xpath syntax meaning - xpath

I'm trying to understand a piece of code I should manage. I found some html manipulation in which HtmlAgilityPack is used for some node selection. Someone knows the meaning of this xpath selector?
//table/*[not(self::tr or self::tbody)]

In English:
Select any element node (*) such that it is not itself a tr or
tbody ([not(self::tr or self::tbody)]) and that is the child of a
table element that could appear anywhere in the document (//table).
It is equivalent to the following un-abbreviated expression
/descendant-or-self::node()/child::table/child::*[not(self::tr or self::tbody)]

self is a handy way of referring to the name of the element node under consideration, without namespaces.
In this example, we will match any element which is a child of a table, and is not a tr or a tbody.

Related

Xpath child of multiple types

I have this xpath:
.//*[#id='some_id']/td//div
and now I want to select any child of the div that is of certain type, for example every child that is either a label or span. Something like this
.//*[#id='some_id']/td//div/(label|span)/.......
but that is not valid xpath. How can I do that (wthout writing two full xpaths for the given 2 example for child types)
descendant:: finds on all level below, to find only children use
.//*[#id='some_id']/td//div/*[self::label or self::span]
you need to use
descendant::
to select child elements of particular element. look at the below example,
.//*[#id='some_id']/td//div/descendant::label[#class='some-class']
the above xpath will get all label with class "some-class" which is actually the child of ".//*[#id='some_id']/td//div/" element.
to find multiple child elements then use below xpath,
.//*[#id='some_id']/td//div/descendant::*[local-name()='label' or local-name='span']

Matching text with xpath?

I'm screen-scraping an HTML page which contains:
<table border=1 class="searchresult" cellpadding=2>
<tr><th colspan=2>Last search</th></tr>
<tr><th align=left>Search term</th><td>xxxxxx</td></tr>
<tr><th align=left>Result</th><td>yyyyyyyy/td></tr>
</table>
I want to write an XPATH expression which gets me the data cell containing "yyyyyyyy". I've gotten as far as
.//table[#class='searchresult']//tr/th
which gets me a list of all the table-header nodes in the table. I can iterate over them in user code, find the one whose .text is "Results" and then call .getnext() on that to get the table-data. But, is there a cleaner way to do this by writing a more specific XPATH pattern? It seems like there should be, but I haven't gotten my head that far around XPATH yet to figure out how.
If it matters, I'm doing this in Python with lxml.
.//table[#class='searchresult']//tr/td[preceding-sibling::th] might give you what you need.
Two comprehensive papers on semi-automatically creating XPath statements like this one, specifically for screen scraping purposes can be found here:
http://tobiasanton.com/Tobias_Anton/Academia.html
Use:
//table/tr[last()]/td
This selects any td element that is a child of any tr that is the last tr child of any table in this XHTML document.
This may select more than one td element, depending on whether or not there is only one table in the XHTML document. You need to make this expression more precise, if more than one table element is present.
For example, if the table in question is the first in the document, use:
(//table)[1]/tr[last()]/td

How to make one RxPath from two

I have those two RxPaths which I need to be written in one expresion:
/td[2]/a[1]/tag[1]
and
/td[2]/a[1]
So basically I need to select path with 'tag' element if exists, if not than to select 'a' element.
something like:
if exist /td[2]/a[1]/tag[1] select /td[2]/a[1]/tag[1]
else select /td[2]/a[1]
Those elements need to have innertext attribute with some value in them, so I tried:
/td[2]/descendant::node()[#innertext!='']
but it won't work...
Also those elements are at the bottom of hierarchy so if is there any way to just select first element at lowest level.
I managed to solve this with an regex at the end of my Xpath expression.
/dom/body/div[#id='isc_0']/div/div[#id='isc_B']/div[#id='isc_C']/div[#id='isc_10']/div/div/iframe/body/table/tbody/tr/td[1]/a[#innertext='any uri item']/../../td[2]/*[#innertext~'[^ ]+']
Sorry for misunderstanding with problem...
Regards,
Vajda Vladimir
So basically I need to select path
with 'tag' element if exists, if not
than to select 'a' element. something
like:
if exist
/td[2]/a[1]/tag[1]
select
/td[2]/a[1]/tag[1]
else select
/td[2]/a[1]
I highly doubt that the top element of the document is a td. Don't use /td -- it means you want to select the top element of the document and this top element must be a td .
Also, /td[2] never selects anything, because a (wellformed) XML document has exactly one top element.
Use:
someParentElement/td[2]/a[1]/tag[1]
|
someParentElement/td[2]/a[1][not(someParentElement/td[2]/a[1]/tag[1])]
Those elements need to have innertext
attribute with some value in them
Use:
someParentElement/td[2][.//#innertext[normalize-space()]]/a[1]/tag[1]
|
someParentElement/td[2]
[.//#innertext[normalize-space()]]/a[1]
[not(someParentElement/td[2]
[.//#innertext[normalize-space()]]/a[1]/tag[1])]
Also those elements are at the bottom
of hierarchy so if is there any way to
just select first element at lowest
level.
This is not clear. Please, clarify.
All "leaf" elements can be selected using the following XPath expression:
//*[not(*)]
The elements selected don't have any children-elements, but may have other children (such as text-nodes, PIs, comments) and attributes.
Besides all those good advices from #Dimitre, I want to add that a parent will always come before (in document order) than a child, so you could use this XPath expression:
(/real-path-from-root/td[2]/a[1]
|
/real-path-from-root/td[2]/a[1]/tag[1])[last()]
You could do this without | union set operator in XPath 1.0, but it will end up very unreadable... Of course, in XPath 2.0 you could just do:
(/real-path-from-root/td[2]/a[1]/(.|tag[1]))[last()]

Can't get nth node in Selenium

I try to write xpath expressions so that my tests won't be broken by small design changes. So instead of the expressions that Selenium IDE generates, I write my own.
Here's an issue:
//input[#name='question'][7]
This expression doesn't work at all. Input nodes named 'question' are spread across the page. They're not siblings.
I've tried using intermediate expression, but it also fails.
(//input[#name='question'])[2]
error = Error: Element (//input[#name='question'])[2] not found
That's why I suppose Seleniun has a wrong implementation of XPath.
According to XPath docs, the position predicate must filter by the position in the nodeset, so it must find the seventh input with the name 'question'. In Selenium this doesn't work. CSS selectors (:nth-of-kind) neither.
I had to write an expression that filters their common parents:
//*[contains(#class, 'question_section')][7]//input[#name='question']
Is this a Selenium specific issue, or I'm reading the specs wrong way? What can I do to make a shorter expression?
Here's an issue:
//input[#name='question'][7]
This expression doesn't work at all.
This is a FAQ.
[] has a higher priority than //.
The above expression selects every input element with #name = 'question', which is the 7th child of its parent -- and aparently the parents of input elements in the document that is not shown don't have so many input children.
Use (note the brackets):
(//input[#name='question'])[7]
This selects the 7th element input in the document that satisfies the conditions in the predicate.
Edit:
People, who know Selenium (Dave Hunt) suggest that the above expression is written in Selenium as:
xpath=(//input[#name='question'])[7]
If you want the 7th input with name attribute with a value of question in the source then try the following:
/descendant::input[#name='question'][7]

How do I use XPath in Nokogiri?

I have not found any documentation nor tutorial for that. Does anything like that exist?
doc.xpath('//table/tbody[#id="threadbits_forum_251"]/tr')
The code above will get me any table, anywhere, that has a tbody child with the attribute id equal to "threadbits_forum_251". But why does it start with double //? Why there is /tr at the end? See "Ruby Nokogiri Parsing HTML table II" for more details.
Can anybody tell me how to extract href, id, alt, src, etc., using Nokogiri?
td[3]/div[1]/a/text()' <--- extracts text
How can I extract other things?
Seems you need to read a XPath Tutorial
Your //table/tbody[#id="threadbits_forum_251"]/tr expression means:
// - Anywhere in your XML document
table/tbody - take a table element with a tbody child
[#id="threadbits_forum_251"] - where id attribute are equals to "threadbits_forum_251"
tr - and take its tr elements
So, basically, you need to know:
attributes begins with #
conditions go inside [] brackets
If I correcly understood that API, you can go with doc.xpath("td[3]/div[1]/a")["href"], or td[3]/div[1]/a/#href if there is just one <a> element.
Your XPath is correct and you seem to have answered your own question's first part (almost):
doc.xpath('//table/tbody[#id="threadbits_forum_251"]/tr')
"the code above will get me any table table's tr, anywhere, that has a tbody child with the attribute id equal to threadbits_forum_251"
// means the following element can appear anywhere in the document.
/tr at the end means, get the tr node of the matching element.
You dont need to extract each attribute one by one. Just get the entire node containing all four attributes in Nokogiri, and get the attributes using:
theNode['href']
theNode['src']
Where theNode is your Nokogiri Node object.
Edit:
Sorry I haven't used these libraries, but I think the XPath evaluation and parsing is being done by Mechanize. So here's how you would get the entire element and its attributes in one go.
doc.xpath("td[3]/div[1]/a").each do |anchor|
puts anchor['href']
puts anchor['src']
...
end

Resources