README.textile
February 21, 2009 ยท View on GitHub
h1. HtmlSharp
HtmlSharp is a C# library for parsing HTML that allows querying of the DOM with css selectors.
h2. Usage
Let's say you have the following html stored in a string named html.
Sample HTML
Make sure you read this part.
Some info about stuff.
Copyright data.
To be able to query this document, we'll need to parse it:
HtmlParser parser = new HtmlParser(); HtmlDocument doc = parser.Parse(html);
From there, we can use "css selectors":http://www.w3.org/TR/css3-selectors/#selectors to find the information we want from the document.
Finding the tag with the id of copyright:
Tag t = doc.Find("#copyright");
Automatic casting to access attributes statically:
P paragraph = doc.Find("#copyright"); string align = paragraph.Align; // "left" string class = paragraph.Class; // null
Finding multiple tags:
IEnumerableparagraphs = doc.FindAll("p"); // Returns the 3 paragraph objects it finds
h2. Supported selectors
|. Pattern |. Meaning | | * | any element | | E | an element of type E | | E[foo] | an E element with a "foo" attribute | | E[foo="bar"] | an E element whose "foo" attribute value is exactly equal to "bar" | | E[foo~="bar"] | an E element whose "foo" attribute value is a list of space-separated values, one of which is exactly equal to "bar" | | E[foo^="bar"] | an E element whose "foo" attribute value begins exactly with the string "bar" | | E[foo$="bar"] | an E element whose "foo" attribute value ends exactly with the string "bar" | | E[foo*="bar"] | an E element whose "foo" attribute value contains the substring "bar" | | E[hreflang|="en"] | an E element whose "hreflang" attribute has a hyphen-separated list of values beginning (from the left) with "en" | | E:root | an E element, root of the document | | E:nth-child(n) | an E element, the n-th child of its parent | | E:nth-last-child(n) | an E element, the n-th child of its parent, counting from the last one | | E:nth-of-type(n) | an E element, the n-th sibling of its type | | E:nth-last-of-type(n) | an E element, the n-th sibling of its type, counting from the last one | | E:first-child | an E element, first child of its parent | | E:last-child | an E element, last child of its parent | | E:first-of-type | an E element, first sibling of its type | | E:last-of-type | an E element, last sibling of its type | | E:only-child | an E element, only child of its parent | | E:only-of-type | an E element, only sibling of its type | | E:empty | an E element that has no children (including text nodes) | | E:lang(fr) | an element of type E in language "fr" (the document language specifies how language is determined) | | E:disabled | a user interface element E which is disabled | | E:enabled | a user interface element E which is enabled | | E:checked | a user interface element E which is checked (for instance a radio-button or checkbox) | | E.warning | an E element whose class is "warning" (the document language specifies how class is determined) | | E#myid | an E element with ID equal to "myid" | | E:not(s) | an E element that does not match simple selector s | | E F | an F element descendant of an E element | | E > F | an F element child of an E element | | E + F | an F element immediately preceded by an E element | | E ~ F | an F element preceded by an E element |