Skip to content Skip to sidebar Skip to footer

Splitting Html String Into Two Parts With Htmlagilitypack

I'm looking for the best way to split an HTML document over some tag in C# using HtmlAgilityPack. I want to preserve the intended markup as I'm doing the split. Here is an example.

Solution 1:

Definitely not by . (Note: this was originally a tag on the question—now removed.) I'm usually not one to jump on The Pony is Coming bandwagon, but this is one case in which regular expressions would be particularly bad.

First, I would write a recursive function that removes all siblings of a node that follow that node—call it RemoveSiblingsAfter(node)—and then calls itself on its parent, so that all siblings following the parent are removed as well (and all siblings following the grandparent, and so on). You can use an XPath to find the node(s) on which you want to split, e.g.doc.DocumentNode.SelectNodes("//a[@href='#']"), and call the function on that node. When done, you'd remove the splitting node itself, and that's it. You'd repeat these steps for a copy of the original document, except you'd implement RemoveSiblingsBefore(node) to remove siblings that precede a node.

In your example, RemoveSiblingsBefore would act as follows:

  1. <a href="#"> has no siblings, so recurse on parent, <li>.
  2. <li> has a preceding sibling—<li>Bullet 1</li>—so remove, and recurse on parent, <ul>.
  3. <ul> has no siblings, so recurse on parent, <p>.
  4. <p> has a preceding sibling—<p>Stuff</p>—so remove, and recurse on parent, <div>.
  5. and so on.

Solution 2:

Here is what I came up with. This does the split and removes the "empty" elements of the element where the split happens.

privatestaticvoidSplitDocument()
    {
        vardoc=newHtmlDocument();
        doc.Load("HtmlDoc.html");
        varlinks= doc.DocumentNode.SelectNodes("//a[@href]");
        varfirstPart= GetFirstPart(doc.DocumentNode, links[0]).DocumentNode.InnerHtml;
        varsecondPart= GetSecondPart(links[0]).DocumentNode.InnerHtml;
    }

    privatestatic HtmlDocument GetFirstPart(HtmlNode currNode, HtmlNode link)
    {
        varnodeStack=newStack<Tuple<HtmlNode, HtmlNode>>();        
        varnewDoc=newHtmlDocument();
        varparent= newDoc.DocumentNode;

        nodeStack.Push(newTuple<HtmlNode, HtmlNode>(currNode, parent));

        while (nodeStack.Count > 0)
        {
            varcurr= nodeStack.Pop();
            varcopyNode= curr.Item1.CloneNode(false);
            curr.Item2.AppendChild(copyNode);

            if (curr.Item1 == link)
            {
                varnodeToRemove= NodeAndEmptyAncestors(copyNode);
                nodeToRemove.ParentNode.RemoveChild(nodeToRemove);
                break;
            }

            for (vari= curr.Item1.ChildNodes.Count - 1; i >= 0; i--)
            {
                nodeStack.Push(newTuple<HtmlNode, HtmlNode>(curr.Item1.ChildNodes[i], copyNode));
            }
        }

        return newDoc;
    }

    privatestatic HtmlDocument GetSecondPart(HtmlNode link)
    {
        varnodeStack=newStack<HtmlNode>();
        varnewDoc=newHtmlDocument();

        varcurrNode= link;
        while (currNode.ParentNode != null)
        {
            currNode = currNode.ParentNode;
            nodeStack.Push(currNode.CloneNode(false));
        }

        varparent= newDoc.DocumentNode;
        while (nodeStack.Count > 0)
        {
            varnode= nodeStack.Pop();
            parent.AppendChild(node);
            parent = node;
        }

        varnewLink= link.CloneNode(false);
        parent.AppendChild(newLink);

        currNode = link;
        varnewParent= newLink.ParentNode;

        while (currNode.ParentNode != null)
        {
            varfoundNode=false;
            foreach (var child in currNode.ParentNode.ChildNodes)
            {
                if (foundNode) newParent.AppendChild(child.Clone());
                if (child == currNode) foundNode = true;
            }

            currNode = currNode.ParentNode;
            newParent = newParent.ParentNode;
        }

        varnodeToRemove= NodeAndEmptyAncestors(newLink);
        nodeToRemove.ParentNode.RemoveChild(nodeToRemove);

        return newDoc;
    }

    privatestatic HtmlNode NodeAndEmptyAncestors(HtmlNode node)
    {
        varcurrNode= node;
        while (currNode.ParentNode != null && currNode.ParentNode.ChildNodes.Count == 1)
        {
            currNode = currNode.ParentNode;
        }

        return currNode;
    }

Post a Comment for "Splitting Html String Into Two Parts With Htmlagilitypack"