CRLF causing problems in finding an element via xpath - xpath

I am working on a test case and need to find text within a table. The only thing to key off of is the label in the previous column. The keys are Next Trckng/Dschrg, Next Full, Next Qtrly, Next Mdcr. I would like to create an xpath expression that will find the Text 1, Text 2, Text 3, and Text 4 based on the key. Since all the keys have the word Next in them, I have mocked this up to find all four of them at once.
//td[preceding-sibling::td[contains(descendant::text(),'Next')]]/a
The third one is not found because it does not have an 'a' element, which is fine. the problem comes in the very first td. It has a span in it, unlike the others. The span is on a second physical line from the td. It appears that the CRLF is preventing FirePath from finding the first td, when I put the span on the same line as the td, it is found. The problem is that I cannot change the actual page, this is a test case.
Is this a FireBug issue or is this actually resulting in two text elements in the DOM? How do I tweak the xpath to find all four nodes?
Here is the HTML:
<table border=1>
<tbody>
<tr>
<td>
<span id="xxx"><a><img></a></span>
Next Trckng/Dschrg:</td>
<td><a>Text 1</a></td>
<td>Next Full:</td>
<td><a>Text 2</a></td>
<td>Next Qtrly:</td>
<td> <!-- Text 3 --></td>
<td>Next Mdcr:</td>
<td><a>Text 4</a></td>
<td>Change Of Therapy:</td>
</tr>
</tbody>
</table>

The problem is with the expression contains(descendant::text(),'Next'). The contains function takes two strings as arguments. Since you pass a node-set as first argument, it is converted to a string. The conversion works by calling the string function on the node-set which according to the spec returns the string-value of the node that is first in document order. In your case, this will be the first text child of a td element. For the first td element, this is a text node containing only whitespace.
The solution is simple: Pass the current td element to the contains function:
contains(., 'Next')
The string-value of this single node will contain the concatenation of the string-values of all text node descendants.

Related

Xpath getting text with mixed elements in same div

Here is some sample HTML
<div class="something">
<p> This is a <b> Paragraph </b> with mixed elements
<p> Next paragraph....
</div>
what I tried was
//div[contains('#class','something')/text()
and
//div[contains('#class','something')/*/text()
and
//div[contains('#class','something')/p/text()
all of these seem to skip the 'b' tags and the 'a' tags.
Try " ".join(sel.xpath("//div[contains(#class,'something')]//text()").extract()) where sel is selector in your case may be response.
Use the XPath expression
//div[contains(#class,'something')]//text()
to get a concatenation of the text of all the text() nodes in the chosen div element.
Output:
This is a Paragraph with mixed elements
Next paragraph....
It depends on what and how you want to obtain. Anyway, there are couple of problems with what you tried:
You are missing closing bracket (]) after contains in the XPath expression.
#class should not be enclosed in (single) quotes when used inside contains.
If you want to get all the text of div element as one string, you might use
normalize-space(//div[contains(#class,'something')])

xpath nearest element to a given element

I am having trouble returning an element using xpath.
I need to get the text from the 2nd TD from a large table.
<tr>
<td>
<label for="PropertyA">Some text here </label>
</td>
<td> TEXT!! </td>
</tr>
I'm able to find the label element, but then I'm having trouble selecting the sibling TD to return the text.
This is how I select the label:
"//label[#for='PropertyA']"
thanks
You are looking for the axes following-sibling. It searches in the siblings in the same parent - there it is tr. If the tds aren't in the same tr then they aren't found. If you want to it then you can use axes following.
//td[label[#for='PropertyA']]/following-sibling::td[1]
From the label element, it should be:
//label[#for='PropertyA']/following::td[1]
And then use the DOM method from the hosting language to get the string value.
Or select the text node (something I do not recommend) with:
//label[#for='PropertyA']/following::td[1]/text()
Or if there's going to be just this one only node, then you could use the string() function:
string(//label[#for='PropertyA']/following::td[1])
You can also select from the common ancestor tr like:
//tr[td/label/#for='PropertyA']/td[2]
Getting ANY following element:
//td[label[#for='PropertyA']]/following-sibling::*

XPath / XQuery: find text in a node, but ignoring content of specific descendant elements

I am trying to find a way to search for a string within nodes, but excluding ythe content of some subelements of those nodes. Plain and simple, I want to search for a string in paragraphs of a text, excluding the footnotes which are children elements of the paragraphs.
For example,
My document being:
<document>
<p n="1">My text starts here/</p>
<p n="2">Then it goes on there<footnote>It's not a very long text!</footnote></p>
</document>
When I'm searching for "text", I would like the Xpath / XQuery to retrieve the first p element, but not the second one (where "text" is contained only in the footnote subelement).
I have tried the contains() function, but it retrieves both p elements.
Any help would be much appreciated :)
I want to search for a string in
paragraphs of a text, excluding the
footnotes which are children elements
of the paragraphs
An XPath 1.0 - only solution:
Use:
//p//text()[not(ancestor::footnote) and contains(.,'text')]
Against the following XML document (obtained from yours but added p s within a footnote to make this more interesting):
<document>
<p n="1">My text starts here/</p>
<p n="2">Then it goes on there
<footnote>It's not a very long text!
<p>text</p>
</footnote>
</p>
</document>
this XPath expression selects exactly the wanted text node:
My text starts here/
//p[(.//text() except .//footnote//text())[contains(., 'text')]]
/document/p[text()[contains(., 'text')]] should do.
For the record, as a complement to the other answers, I've found this workaround that also seems to do the job:
//p[contains(child::text()|not(descendant::footnote), "text")]

XPath query. Preceding-sibling of a conditionally reduced set of nodes

I got html code like the following:
<p style="margin:0 0 0.5em 0;"><b>Blablub</b></p>
<table> ... </table>
Now I want to query the content of the <b> right above the table but only if the table does not have any attributes. I tried the following query:
//table[not(#*)]/preceding-sibling::p/b
If I remove the preceding-sibling::p/b part entirely it works. It gives me exactly the tables I need. However, if I use this query it gives me content of an <b> tag which precedes a table WITH attributes.
Use:
//table[not(#*)]/preceding-sibling::*[1][self::p]/b
This means: Select all b elements that are children of all p elements that are the first preceding sibling of a table that has no attributes.
This is quite different from the problematic expression cited in the question:
//table[not(#*)]/preceding-sibling::p[1]/b
The latter selects the b children of the first p following sibling -- there is no guarantee that the first p following sibling is also the first element sibling.

Xquery to extract text in html

I am working on extracting text out of html documents and storing in database. I am using webharvest tool for extracting the content. However I kind of stuck at a point. Inside webharvest I use XQuery expression inorder to extract the data. The html document that I am parsing is as follows:
<td><a name="hw">HELLOWORLD</a>Hello world</td>
I need to extract "Hello world" text from the above html script.
I have tried extracting the text in this fashion:
$hw :=data($item//a[#name='hw']/text())
However what I always get is "HELLOWORLD" instead of "Hello world".
Is there a way to extract "Hello World". Please help.
What if I want to do it this way:
<td>
<a name="hw1">HELLOWORLD1</a>Hello world1
<a name="hw2">HELLOWORLD2</a>Hello world2
<a name="hw3">HELLOWORLD3</a>Hello world3
</td>
I would like to extract the text Hello world 2 that is in betweeb hw2 and hw3. I would not want to use text()[3] but is there some way I could extract the text out between /a[#name='hw2'] and /a[#name='hw3'].
Your xpath is selecting the text of the a nodes, not the text of the td nodes:
$item//a[#name='hw']/text()
Change it to this:
$item[a/#name='hw']/text()
Update (following comments and update to question):
This xpath selects the second text node from $item that have an a tag containing a name attribute set to hw:
$item[a/#name='hw']//text()[2]
I would not want to use text()[3] but
is there some way I could extract the
text out between /a[#name='hw2'] and
/a[#name='hw3'].
If there is just one text node between the two <a> elements, then the following would be quite simple:
/a[#name='hw3']/preceding::text()[1]
If there are more than one text nodes between the two elements, then you need to express the intersection of all text nodes following the first element with all text nodes preceding the second element. The formula for intersection of two nodesets (aka Kaysian method of intersection) is:
$ns1[count(.|$ns2) = count($ns2)]
So, just replace in the above expression $ns1 with:
/a[#name='hw2']/following-sibling::text()
and $ns2 with:
/a[#name='hw3']/preceding-sibling::text()
Lastly, if you really have XQuery (or XPath 2), then this is simply:
/a[#name='hw2']/following-sibling::text()
intersect
/a[#name='hw3']/preceding-sibling::text()
This handles your expanded case, while letting you select by attribute value rather than position:
let $item :=
<td>
<a name="hw1">HELLOWORLD1</a>Hello world1
<a name="hw2">HELLOWORLD2</a>Hello world2
<a name="hw3">HELLOWORLD3</a>Hello world3
</td>
return $item//node()[./preceding-sibling::a/#name = "hw2"][1]
This gets the first node that has a preceding-sibling "a" element with a name attribute of "hw2".

Resources