HTMLAgilityPack and separating on <br/> - html-agility-pack

I have some html, which is separated by <br/> e.g.:
Jack Janson
<br/>
309 123 456
<br/>
My Special Street 43
What is the easiest way to retrieve the information in 3 columns?
I am not an XPath expert, so another approach would be to separate the string on the line break, and just work with the array. Is there a smarter way to do it?
Update: Forgot to format the code.

In pure XPATH over XML, you would use an XPATH expression like this: //preceding-sibling::br or //following-sibling::br (see here for help on XPATH Axes)
But, the XPATH over HTML implementation that you'll find in Html Agility Pack does not support pure text node or (Attribute node) in XPATH selection expressions (//br/text() or //br/#blah do not work for example). Note it works in filters, so, these //br[text()='blah'] or //br[#att='blah'] work.
So, back to the question, you need to combine XPATH and code, something like this:
HtmlDocument doc = new HtmlDocument();
doc.Load(myHtmlFile);
foreach (HtmlNode p in doc.DocumentNode.SelectNodes("//br"))
{
Console.WriteLine(p.PreviousSibling.InnerText.Trim());
}
That will output
Jack Janson
309 123 456

Related

Xpath get element above

suppose I have this structure:
<div class="a" attribute="foo">
<div class="b">
<span>Text Example</span>
</div>
</div>
In xpath, I would like to retrieve the value of the attribute "attribute" given I have the text inside: Text Example
If I use this xpath:
.//*[#class='a']//*[text()='Text Example']
It returns the element span, but I need the div.a, because I need to get the value of the attribute through Selenium WebDriver
Hey there are lot of ways by which you can figure it out.
So lets say Text Example is given, you can identify it using this text:-
//span[text()='Text Example']/../.. --> If you know its 2 level up
OR
//span[text()='Text Example']/ancestor::div[#class='a'] --> If you don't know how many level up this `div` is
Above 2 xpaths can be used if you only want to identify the element using Text Example, if you don't want to iterate through this text. There are simple ways to identify it directly:-
//div[#class='a']
From your question itself you have mentioned the answer for it
but I need the div.a,
try this
driver.findElement(By.cssSelector("div.a")).getAttribute("attribute");
use cssSelector for best result.
or else try the following xpath
//div[contains(#class, 'a')]
If you want attribute of div.a with it's descendant span which contains text something, try as below :-
driver.findElement(By.xpath("//div[#class = 'a' and descendant::span[text() = 'Text Example']]")).getAttribute("attribute");
Hope it helps..:)

XPATH - get all inner nodes except a particular one

this is my HTML
<book>
<div id="name"></div>
<span id="age"></span>
<p id="contact_number"></p>
...
...
(more attributes)
</book>
I need to extract all the text() inside <book></book> except the p with id="contact_number"
so basically i need //book//text() except //book//p[#id="contact_number"]//text()
How can i do this in a single xpath query?
There might be a better way if you can put the requirement differently. Anyway, to answer the question the way it asked, you can try this :
//book//text()[not(ancestor::p/#id='contact_number')]
or maybe just use parent::p instead of ancestor::p :
//book//text()[not(parent::p/#id='contact_number')]
add [normalize-space()] at the end if you need to filter out empty text nodes.
Try the following:
//*[not(self::p[#id = 'contact_number'])]/text()[normalize-space()]

Using Xpath and HtmlAgilityPack to find all elements with innertext containing a specific word or words

I am trying to build a simple search-engine using HtmlAgilityPack and Xpath with C# (.NET 4).
I want to find every node containing a userdefined searchword, but I can't seem to get the XPath right.
For Example:
<HTML>
<BODY>
<H1>Mr T for president</H1>
<div>We believe the new president should be</div>
<div>the awsome Mr T</div>
<div>
<H2>Mr T replies:</H2>
<p>I pity the fool who doesn't vote</p>
<p>for Mr T</p>
</div>
</BODY>
</HTML>
If the specified searchword is "Mr T" I'd want the following nodes: <H1>, The second <div>, <H2> and the second <p>.
I have tried numerous variants of doc.DocumentNode.SelectNodes("//text()[contains(., "+ searchword +")]"); but I always seem to wind up with every single node in the entire DOM.
Any hints to get me in the right direction would be very appreciated.
Use:
//*[text()[contains(., 'Mr T')]]
This selects all elements in the XML document that have a text-node child which contains the string 'Mr T'.
This can also be written shorter as:
//text()[contains(., 'Mr T')]/..
This selects the parent(s) of any text node that contains the string 'Mr T'.
According to Xpath, if you want to find a specific keyword you need to follow the format ("keyword" is the word you like to search) :
//*[text()[contains(., 'keyword')]]
You have to follow the same format as above in C#, keyword is the string variable you call:
doc.DocumentNode.SelectNodes("//*[text()[contains(., '" + keyword + "')]]");
Use the following:
doc.DocumentNode.SelectNodes("//*[contains(text()[1], " + searchword + ")]")
This selects all elements (*) whose first text child (text()[1]) contains the searchword.
Case-insensitive solution:
var xpathForFindText =
"//*[text()[contains(translate(., 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), '" + lowerFocusKwd + "')]]";
var result=doc.DocumentNode.SelectNodes(xpathForFindText);
Note:
Be careful, because the lowerFocusKwd must not contain the following character, because the xpath will be in bad format:
'

To get text after the tag, containing another text

For example:
<p>
<b>Member Since:</b> Aug. 07, 2010<br><b>Time Played:</b> <span class="text_tooltip" title="Actual Time: 15.09:37:06">16 days</span><br><b>Last Game:</b>
<span class="text_tooltip" title="07/16/2011 23:41">1 minute ago</span>
<br><b>Wins:</b> 1,017<br><b>Losses / Quits:</b> 883 / 247<br><b>Frags / Deaths:</b> 26,955 / 42,553<br><b>Hits / Shots:</b> 690,695 / 4,229,566<br><b>Accuracy:</b> 16%<br>
</p>
I want to get 1,017. It is a text after the tag, containing text Wins:.
If I used regex, it would be [/<b>Wins:<\/b> ([^<]+)/,1], but how to do it with Nokogiri and XPath?
Or should I better parse this part of page with regex?
Here
doc = Nokogiri::HTML(html)
puts doc.at('b[text()="Wins:"]').next.text
You can use this XPath: //*[*/text() = 'Wins:']/text() It will return 1,017.
About regex: RegEx match open tags except XHTML self-contained tags
I would use pure XPath like:
"//b[.='Wins:']/following::node()[1]"
I've heard thousand of times (and from gurus) "never use regex to parse XML". Can you provide some "shocking" reference demonstrating that this sentence is not valid any more?
Use:
//*[. = 'Wins:']/following-sibling::node()[1]
In case this is ambiguous (selects more than one node), more strict expressions can be specified:
//*[. = 'Wins:']/following-sibling::node()[self::text()][1]
Or:
(//*[. = 'Wins:'])[1]/following-sibling::node()[1]
Or:
(//*[. = 'Wins:'])[1]/following-sibling::node()[self::text()][1]

XPath / XQuery: find text in a node, but ignoring content of specific descendant elements

I am trying to find a way to search for a string within nodes, but excluding ythe content of some subelements of those nodes. Plain and simple, I want to search for a string in paragraphs of a text, excluding the footnotes which are children elements of the paragraphs.
For example,
My document being:
<document>
<p n="1">My text starts here/</p>
<p n="2">Then it goes on there<footnote>It's not a very long text!</footnote></p>
</document>
When I'm searching for "text", I would like the Xpath / XQuery to retrieve the first p element, but not the second one (where "text" is contained only in the footnote subelement).
I have tried the contains() function, but it retrieves both p elements.
Any help would be much appreciated :)
I want to search for a string in
paragraphs of a text, excluding the
footnotes which are children elements
of the paragraphs
An XPath 1.0 - only solution:
Use:
//p//text()[not(ancestor::footnote) and contains(.,'text')]
Against the following XML document (obtained from yours but added p s within a footnote to make this more interesting):
<document>
<p n="1">My text starts here/</p>
<p n="2">Then it goes on there
<footnote>It's not a very long text!
<p>text</p>
</footnote>
</p>
</document>
this XPath expression selects exactly the wanted text node:
My text starts here/
//p[(.//text() except .//footnote//text())[contains(., 'text')]]
/document/p[text()[contains(., 'text')]] should do.
For the record, as a complement to the other answers, I've found this workaround that also seems to do the job:
//p[contains(child::text()|not(descendant::footnote), "text")]

Resources