Xquery to extract text in html - xpath

I am working on extracting text out of html documents and storing in database. I am using webharvest tool for extracting the content. However I kind of stuck at a point. Inside webharvest I use XQuery expression inorder to extract the data. The html document that I am parsing is as follows:
<td><a name="hw">HELLOWORLD</a>Hello world</td>
I need to extract "Hello world" text from the above html script.
I have tried extracting the text in this fashion:
$hw :=data($item//a[#name='hw']/text())
However what I always get is "HELLOWORLD" instead of "Hello world".
Is there a way to extract "Hello World". Please help.
What if I want to do it this way:
<td>
<a name="hw1">HELLOWORLD1</a>Hello world1
<a name="hw2">HELLOWORLD2</a>Hello world2
<a name="hw3">HELLOWORLD3</a>Hello world3
</td>
I would like to extract the text Hello world 2 that is in betweeb hw2 and hw3. I would not want to use text()[3] but is there some way I could extract the text out between /a[#name='hw2'] and /a[#name='hw3'].

Your xpath is selecting the text of the a nodes, not the text of the td nodes:
$item//a[#name='hw']/text()
Change it to this:
$item[a/#name='hw']/text()
Update (following comments and update to question):
This xpath selects the second text node from $item that have an a tag containing a name attribute set to hw:
$item[a/#name='hw']//text()[2]

I would not want to use text()[3] but
is there some way I could extract the
text out between /a[#name='hw2'] and
/a[#name='hw3'].
If there is just one text node between the two <a> elements, then the following would be quite simple:
/a[#name='hw3']/preceding::text()[1]
If there are more than one text nodes between the two elements, then you need to express the intersection of all text nodes following the first element with all text nodes preceding the second element. The formula for intersection of two nodesets (aka Kaysian method of intersection) is:
$ns1[count(.|$ns2) = count($ns2)]
So, just replace in the above expression $ns1 with:
/a[#name='hw2']/following-sibling::text()
and $ns2 with:
/a[#name='hw3']/preceding-sibling::text()
Lastly, if you really have XQuery (or XPath 2), then this is simply:
/a[#name='hw2']/following-sibling::text()
intersect
/a[#name='hw3']/preceding-sibling::text()

This handles your expanded case, while letting you select by attribute value rather than position:
let $item :=
<td>
<a name="hw1">HELLOWORLD1</a>Hello world1
<a name="hw2">HELLOWORLD2</a>Hello world2
<a name="hw3">HELLOWORLD3</a>Hello world3
</td>
return $item//node()[./preceding-sibling::a/#name = "hw2"][1]
This gets the first node that has a preceding-sibling "a" element with a name attribute of "hw2".

Related

Xpath getting text with mixed elements in same div

Here is some sample HTML
<div class="something">
<p> This is a <b> Paragraph </b> with mixed elements
<p> Next paragraph....
</div>
what I tried was
//div[contains('#class','something')/text()
and
//div[contains('#class','something')/*/text()
and
//div[contains('#class','something')/p/text()
all of these seem to skip the 'b' tags and the 'a' tags.
Try " ".join(sel.xpath("//div[contains(#class,'something')]//text()").extract()) where sel is selector in your case may be response.
Use the XPath expression
//div[contains(#class,'something')]//text()
to get a concatenation of the text of all the text() nodes in the chosen div element.
Output:
This is a Paragraph with mixed elements
Next paragraph....
It depends on what and how you want to obtain. Anyway, there are couple of problems with what you tried:
You are missing closing bracket (]) after contains in the XPath expression.
#class should not be enclosed in (single) quotes when used inside contains.
If you want to get all the text of div element as one string, you might use
normalize-space(//div[contains(#class,'something')])

Xpath get element above

suppose I have this structure:
<div class="a" attribute="foo">
<div class="b">
<span>Text Example</span>
</div>
</div>
In xpath, I would like to retrieve the value of the attribute "attribute" given I have the text inside: Text Example
If I use this xpath:
.//*[#class='a']//*[text()='Text Example']
It returns the element span, but I need the div.a, because I need to get the value of the attribute through Selenium WebDriver
Hey there are lot of ways by which you can figure it out.
So lets say Text Example is given, you can identify it using this text:-
//span[text()='Text Example']/../.. --> If you know its 2 level up
OR
//span[text()='Text Example']/ancestor::div[#class='a'] --> If you don't know how many level up this `div` is
Above 2 xpaths can be used if you only want to identify the element using Text Example, if you don't want to iterate through this text. There are simple ways to identify it directly:-
//div[#class='a']
From your question itself you have mentioned the answer for it
but I need the div.a,
try this
driver.findElement(By.cssSelector("div.a")).getAttribute("attribute");
use cssSelector for best result.
or else try the following xpath
//div[contains(#class, 'a')]
If you want attribute of div.a with it's descendant span which contains text something, try as below :-
driver.findElement(By.xpath("//div[#class = 'a' and descendant::span[text() = 'Text Example']]")).getAttribute("attribute");
Hope it helps..:)

Xpath - matching based on node() contains() content

I have the following HTML structure (there are many blocks using the same architecture):
<span id="mySpan">
<i>
Price
<b>
3 900
<small>€</small>
</b>
</i>
</span>
Now, I want to get the content of <b> using Xpath which I tried like so:
//span[#id="mySpan"]/i/node()[1][contains(text(),"Price")]
which does match anything. How can I match this using the node()[1] text as anchor?
Regarding the Xpath you tried, instead of text() which return text node child, simply use . :
//span[#id="mySpan"]/i/node()[1][contains(.,"Price")]
For the ultimate goal, I'd suggest this XPath :
//span[#id="mySpan"]/i[contains(.,"Price")]/b
or if you want specifically to match against the first node within <i> :
//span[#id="mySpan"]/i[contains(node(),"Price")]/b

CRLF causing problems in finding an element via xpath

I am working on a test case and need to find text within a table. The only thing to key off of is the label in the previous column. The keys are Next Trckng/Dschrg, Next Full, Next Qtrly, Next Mdcr. I would like to create an xpath expression that will find the Text 1, Text 2, Text 3, and Text 4 based on the key. Since all the keys have the word Next in them, I have mocked this up to find all four of them at once.
//td[preceding-sibling::td[contains(descendant::text(),'Next')]]/a
The third one is not found because it does not have an 'a' element, which is fine. the problem comes in the very first td. It has a span in it, unlike the others. The span is on a second physical line from the td. It appears that the CRLF is preventing FirePath from finding the first td, when I put the span on the same line as the td, it is found. The problem is that I cannot change the actual page, this is a test case.
Is this a FireBug issue or is this actually resulting in two text elements in the DOM? How do I tweak the xpath to find all four nodes?
Here is the HTML:
<table border=1>
<tbody>
<tr>
<td>
<span id="xxx"><a><img></a></span>
Next Trckng/Dschrg:</td>
<td><a>Text 1</a></td>
<td>Next Full:</td>
<td><a>Text 2</a></td>
<td>Next Qtrly:</td>
<td> <!-- Text 3 --></td>
<td>Next Mdcr:</td>
<td><a>Text 4</a></td>
<td>Change Of Therapy:</td>
</tr>
</tbody>
</table>
The problem is with the expression contains(descendant::text(),'Next'). The contains function takes two strings as arguments. Since you pass a node-set as first argument, it is converted to a string. The conversion works by calling the string function on the node-set which according to the spec returns the string-value of the node that is first in document order. In your case, this will be the first text child of a td element. For the first td element, this is a text node containing only whitespace.
The solution is simple: Pass the current td element to the contains function:
contains(., 'Next')
The string-value of this single node will contain the concatenation of the string-values of all text node descendants.

XPath / XQuery: find text in a node, but ignoring content of specific descendant elements

I am trying to find a way to search for a string within nodes, but excluding ythe content of some subelements of those nodes. Plain and simple, I want to search for a string in paragraphs of a text, excluding the footnotes which are children elements of the paragraphs.
For example,
My document being:
<document>
<p n="1">My text starts here/</p>
<p n="2">Then it goes on there<footnote>It's not a very long text!</footnote></p>
</document>
When I'm searching for "text", I would like the Xpath / XQuery to retrieve the first p element, but not the second one (where "text" is contained only in the footnote subelement).
I have tried the contains() function, but it retrieves both p elements.
Any help would be much appreciated :)
I want to search for a string in
paragraphs of a text, excluding the
footnotes which are children elements
of the paragraphs
An XPath 1.0 - only solution:
Use:
//p//text()[not(ancestor::footnote) and contains(.,'text')]
Against the following XML document (obtained from yours but added p s within a footnote to make this more interesting):
<document>
<p n="1">My text starts here/</p>
<p n="2">Then it goes on there
<footnote>It's not a very long text!
<p>text</p>
</footnote>
</p>
</document>
this XPath expression selects exactly the wanted text node:
My text starts here/
//p[(.//text() except .//footnote//text())[contains(., 'text')]]
/document/p[text()[contains(., 'text')]] should do.
For the record, as a complement to the other answers, I've found this workaround that also seems to do the job:
//p[contains(child::text()|not(descendant::footnote), "text")]

Resources