How to get table data from tables using xpath - xpath

What XPATH query could i use to get the values from the 1st and 3rd <td> tag for each row in a html table.
The XPATH query I have used use is
/table/tr/td[1]|td[3].
This only returns the values in the first <td> tag for each row in a table.
EXAMPLE
I would expect to get the values bob,19,jane,11,cameron and 32 from the below table. But am only getting bob,jane,cameron.
<table>
<tr><td>Bob</td><td>Male</td><td>19</td></tr>
<tr><td>Jane</td><td>Feale</td><td>11</td></tr>
<tr><td>Cameron</td><td>Male</td><td>32</td></tr>
</table>

#jakenoble's answer:
/table/tr/td[1]|/table/tr/td[3]
is correct.
An equivalent XPath expression that avoids the | (union) operator and may be more efficient is:
/table/tr/td[position() = 1 or position() = 3]

Try
/table/tr/td[1]|/table/tr/td[3]
I remember doing this in the past and found it rather annoying because it is ugly and long-winded

Related

XPath with dittoed fields?

In this document if the second column is blank it means use the previous row's value.
<doc>
<table>
<tr><td>ASU</td><td>CS</td><td>3</td></tr>
<tr><td>ASU</td><td>English</td><td>3</td></tr>
<tr><td>ASU</td><td></td><td>4</td></tr>
<tr><td>ASU</td><td>French</td><td>3</td></tr>
</table>
<table>
<tr><td>CMU</td><td>CS</td><td>4</td></tr>
<tr><td>CMU</td><td>English</td><td>3</td></tr>
<tr><td>CMU</td><td>French</td><td>3</td></tr>
<tr><td>CMU</td><td></td><td>4</td></tr>
</table>
<table>
<tr><td>SDSU</td><td>English</td><td>3</td></tr>
<tr><td>SDSU</td><td></td><td>4</td></tr>
<tr><td>SDSU</td><td></td><td>5</td></tr>
<tr><td>SDSU</td><td>French</td><td>4</td></tr>
</table>
</doc>
I want rows were the second columns are English so these would be the rows:
<tr><td>ASU</td><td>English</td><td>3</td></tr>
<tr><td>ASU</td><td></td><td>4</td></tr>
<tr><td>CMU</td><td>English</td><td>3</td></tr>
<tr><td>SDSU</td><td>English</td><td>3</td></tr>
<tr><td>SDSU</td><td></td><td>4</td></tr>
<tr><td>SDSU</td><td></td><td>5</td></tr>
What would the XPath be for this?
(This is using XPath 1.0, there may be better solutions with more recent XPath versions).
First, you want trs, so that’s straightforward:
/doc/table/tr[...some predicate...]
The rows you want are either:
Those with where the second tr just contains “English”
tr[2] = 'English'
Or those where the second tr is empty...
tr[2] = ''
and, looking at the previous sibling rows which don’t have an empty second tr...
preceding-sibling::tr[td[2] != '']
the first one ([1]) has a second tr that contains “English”
/td[2] = 'English'
So combining all that, a query that gives you the desired rows is:
/doc/table/tr[td[2] = 'English'
or (td[2] = ''
and preceding-sibling::tr[td[2] != ''][1]/td[2] = 'English')]

accessing the text value of last nested <tr> element with no id or class hooks

I need to access the value of the 10th <td> element in the last row of a table. I can't use an ID as a hook because only the table has an ID. I've managed to make it work using the code below. Unfortunately, its static. I know I will always need the 10th <td> element, but I won't ever know which row it needs to be. I just know it needs to be the last row in the table. How would I replace "tr[6]" with the actual last <tr> dynamically? (this is probably really easy, but this is literally my first time doing anything with ruby).
page = Nokogiri::HTML(open(url))
test = page.css("tr[6]").map { |row|
row.css("td[10]").text}
puts test
You want to do:
page.at("tr:last td:eq(10)")
If you do not need to do anything else with the page you can actually make this a single line with
test = Nokogiri::HTML(open(url)).search("tr").last.search("td")[10].text
Otherwise (this will work):
page = Nokogiri::HTML(open(url))
test = page.search("tr").last.search("td")[10].text
puts test
Example:(Used a large table from another question on StackOverflow)
Nokogiri::HTML(open("http://en.wikipedia.org/wiki/Richard_Dreyfuss")).search('table')[1].search('tr').last.search('td').children.map{|c| c.text}.join(" ")
#=> "2013 Paranoia Francis Cassidy"
Is there a particular reason you want an Array with 1 element? My example will return a string but you could easily modify it to return an Array.
You can use CSS pseudo class selectors for this:
page.css("table#the-table-id tr:last-of-type td:nth-of-type(10)")
This first selects the <table> with the appropriate id, then selects the last <tr> child of that table, and then selects the 10th <td> of that <tr>. The result is an array of all matching elements, if youexpect there to be only one you could use at_css instead.
If you prefer XPath, you could use this:
page.xpath("//table[#id='the-table-id']/tr[last()]/td[10]")

Selenium IDE with XPath to identify cell in table based on other column

Please take a look at the snippet of html below:
<tr class="clickable">
<td id="7b8ee8f9-b66f-4fba-83c1-4cf2827130b5" class="clickable">
<a class="editLink" href="#">Single</a>
</td>
<td class="clickable">£14.00</td>
</tr>
I'm trying to assert the value of td[2] when td[1] contains "Single". I've tried assorted variants of:
//td[2][(contains(text(),'£14.00'))]/../td[1][(contains(text(),'Single'))]
I've used similar notation elsewhere successfully - but to no avail here... I think it's down to td[1] having the nested element, but not sure.
Can someone enlighten as to what I'm getting wrong? :)
Cheers!
What about:
//tr[contains(td[1], "Single")]/td[2]
First select the <tr> containing the <td> matching the text, and then select td[2].
Then,
contains(//tr[contains(td[1], "Single")]/td[2], "£14.00")
should return True.
Or, closer to the expression you tried, you could test if this matches:
//tr[contains(td[1], "Single")]/td[2][contains(., "£14.00")]
See #JensErat's answer to find xth td with td contains in same tr xpath python .
Why not make it simple on yourself, do the if statement in your code. Psuedocode:
Select the top level tr.
Find first td within tr, check to see if it contains Single.
If it does, assert that it contains £14.00
Alternatively, you could just get the text of the top level tr and perform the checks on that text.

Parse a HTML table using Ruby, Nokogiri omitting the column headers

I have trouble parsing a HTML table using Nokogiri and Ruby. My HTML table structure looks like this
<table>
<tbody>
<tr>
<td>Firstname</td>
<td>Lastname</td>
<td>Middle</td>
</tr>
<tr>
<td>ding</td>
<td>dong</td>
<td>ling</td>
</tr>
....
....
.... {more tr's and td's with similar data exists.}
....
....
....
....
....
</tbody>
</table>
In the above HTML table I would like to entirely remove the first and corresponding elements, so remove Firstname, Lastname and Middle i.e., I want to start stripping the text only from the second . So this way I get only the contents of the table from the second or tr[2] and no column headers.
Can someone please provide me a code as to how to do this.
Thanks.
require 'rubygems'
require 'nokogiri'
doc = Nokogiri::HTML(x)
rows = doc.xpath('//table/tbody/tr[position() > 1]')
# OR
rows = doc.xpath("//table/tbody/tr")
header = rows.shift
After you've run either one of the above 2 snippets, rows will contain every <tr>...</tr> after the first one. For example puts rows.to_xml prints the following:
<tr><td>ding</td>
<td>dong</td>
<td>ling</td>
</tr>
To get the inner text, removing all the html tags, run puts rows.text
ding
dong
ling
To get the inner text of the td tags only, run rows.xpath('td').map {|td| td.text }
["ding", "dong", "ling"]
Alternatively:
table.css('tr')[1..-1]
or to strip out the text starting at row 2:
table.css('tr')[1..-1].map{|tr| tr.css('td').map &:text}
Since Nokogiri does support :has CSS pseudo-class you can get heading row with
#doc.at_css('table#table_id').css('tr:has(th)')
and since it does supports :not CSS pseudo-class as well, you can get other rows with
#doc.at_css('table#table_id').css('tr:not(:has(th))')
respectively. Depending on your preferences you might like to avoid negation and just use css('tr:has(td)').

XPath query. Preceding-sibling of a conditionally reduced set of nodes

I got html code like the following:
<p style="margin:0 0 0.5em 0;"><b>Blablub</b></p>
<table> ... </table>
Now I want to query the content of the <b> right above the table but only if the table does not have any attributes. I tried the following query:
//table[not(#*)]/preceding-sibling::p/b
If I remove the preceding-sibling::p/b part entirely it works. It gives me exactly the tables I need. However, if I use this query it gives me content of an <b> tag which precedes a table WITH attributes.
Use:
//table[not(#*)]/preceding-sibling::*[1][self::p]/b
This means: Select all b elements that are children of all p elements that are the first preceding sibling of a table that has no attributes.
This is quite different from the problematic expression cited in the question:
//table[not(#*)]/preceding-sibling::p[1]/b
The latter selects the b children of the first p following sibling -- there is no guarantee that the first p following sibling is also the first element sibling.

Resources