XPath with dittoed fields? - xpath

In this document if the second column is blank it means use the previous row's value.
<doc>
<table>
<tr><td>ASU</td><td>CS</td><td>3</td></tr>
<tr><td>ASU</td><td>English</td><td>3</td></tr>
<tr><td>ASU</td><td></td><td>4</td></tr>
<tr><td>ASU</td><td>French</td><td>3</td></tr>
</table>
<table>
<tr><td>CMU</td><td>CS</td><td>4</td></tr>
<tr><td>CMU</td><td>English</td><td>3</td></tr>
<tr><td>CMU</td><td>French</td><td>3</td></tr>
<tr><td>CMU</td><td></td><td>4</td></tr>
</table>
<table>
<tr><td>SDSU</td><td>English</td><td>3</td></tr>
<tr><td>SDSU</td><td></td><td>4</td></tr>
<tr><td>SDSU</td><td></td><td>5</td></tr>
<tr><td>SDSU</td><td>French</td><td>4</td></tr>
</table>
</doc>
I want rows were the second columns are English so these would be the rows:
<tr><td>ASU</td><td>English</td><td>3</td></tr>
<tr><td>ASU</td><td></td><td>4</td></tr>
<tr><td>CMU</td><td>English</td><td>3</td></tr>
<tr><td>SDSU</td><td>English</td><td>3</td></tr>
<tr><td>SDSU</td><td></td><td>4</td></tr>
<tr><td>SDSU</td><td></td><td>5</td></tr>
What would the XPath be for this?

(This is using XPath 1.0, there may be better solutions with more recent XPath versions).
First, you want trs, so that’s straightforward:
/doc/table/tr[...some predicate...]
The rows you want are either:
Those with where the second tr just contains “English”
tr[2] = 'English'
Or those where the second tr is empty...
tr[2] = ''
and, looking at the previous sibling rows which don’t have an empty second tr...
preceding-sibling::tr[td[2] != '']
the first one ([1]) has a second tr that contains “English”
/td[2] = 'English'
So combining all that, a query that gives you the desired rows is:
/doc/table/tr[td[2] = 'English'
or (td[2] = ''
and preceding-sibling::tr[td[2] != ''][1]/td[2] = 'English')]

Related

Scrapy: How do I select the next `td` in this `tr`?

I want to select the next sibling of a td tag in a tr element.
The tr element is this:
<tr>
<td>Created On:</td>
<td>06/28/2018 06:32 </td>
</tr>
My Scrapy code looks like this: response.xpath("//text()[contains(.,'Created On:')]/following-sibling::td"). But that gives me an empty list [].
How do I select the next td?
Try this XPath expression:
//text()[contains(.,'Created On:')]/../following-sibling::td
You were trying to use the following-sibling axis from the wrong context node. Going back one level fixes this problem.
An alternative is matching the td element in the first place like in this expression:
//td[contains(text(),'Created On:')]/following-sibling::td

Selenium IDE with XPath to identify cell in table based on other column

Please take a look at the snippet of html below:
<tr class="clickable">
<td id="7b8ee8f9-b66f-4fba-83c1-4cf2827130b5" class="clickable">
<a class="editLink" href="#">Single</a>
</td>
<td class="clickable">£14.00</td>
</tr>
I'm trying to assert the value of td[2] when td[1] contains "Single". I've tried assorted variants of:
//td[2][(contains(text(),'£14.00'))]/../td[1][(contains(text(),'Single'))]
I've used similar notation elsewhere successfully - but to no avail here... I think it's down to td[1] having the nested element, but not sure.
Can someone enlighten as to what I'm getting wrong? :)
Cheers!
What about:
//tr[contains(td[1], "Single")]/td[2]
First select the <tr> containing the <td> matching the text, and then select td[2].
Then,
contains(//tr[contains(td[1], "Single")]/td[2], "£14.00")
should return True.
Or, closer to the expression you tried, you could test if this matches:
//tr[contains(td[1], "Single")]/td[2][contains(., "£14.00")]
See #JensErat's answer to find xth td with td contains in same tr xpath python .
Why not make it simple on yourself, do the if statement in your code. Psuedocode:
Select the top level tr.
Find first td within tr, check to see if it contains Single.
If it does, assert that it contains £14.00
Alternatively, you could just get the text of the top level tr and perform the checks on that text.

Get last word inside table cell?

I want to scrape data from a table with Ruby and Nokogiri.
There are a lot of <td> elements, but I only need the country which is just text after a <br> element. The problem is, the <td> elements differ. Sometimes there is more than just the country.
For example:
<td>Title1<br>USA</td>
<td>Title2<br>Michael Powell<br>UK</td>
<td>Title3<br>Leopold Lindtberg<br>Ralph Meeker<br>Switzerland</td>
I want to address the element before the closing </td> tag because the country is always the last element.
How can I do that?
I'd use this:
require 'awesome_print'
require 'nokogiri'
html = '
<td>Title1<br>USA</td>
<td>Title2<br>Michael Powell<br>UK</td>
<td>Title3<br>Leopold Lindtberg<br>Ralph Meeker<br>Switzerland</td>
'
doc = Nokogiri::HTML(html)
ap doc.search('td').map{ |td| td.search('text()').last.text }
[
[0] "USA",
[1] "UK",
[2] "Switzerland"
]
The problem is that your HTML being parsed won't have rows of <td> tags, so you'll have to locate the ones you want to parse. Instead, they'll be interspersed between <tr> tags, and maybe even different <table> tags. Because your HTML sample doesn't show the true structure of the document, I can't help you more.
There are bunch of different solutions. Another solution using only the standard library is to substring out the things you dont want.
node_string = <<-STRING
<td>Title1<br>USA</td>
<td>Title2<br>Michael Powell<br>UK</td>
<td>Title3<br>Leopold Lindtberg<br>Ralph Meeker<br>Switzerland</td>
STRING
node_string.split("<td>").collect do |str|
last_str = str.split("<br>").last
last_str.gsub(/[\n,\<\/td\>]/,'') unless last_str.nil?
end.compact

Parse a HTML table using Ruby, Nokogiri omitting the column headers

I have trouble parsing a HTML table using Nokogiri and Ruby. My HTML table structure looks like this
<table>
<tbody>
<tr>
<td>Firstname</td>
<td>Lastname</td>
<td>Middle</td>
</tr>
<tr>
<td>ding</td>
<td>dong</td>
<td>ling</td>
</tr>
....
....
.... {more tr's and td's with similar data exists.}
....
....
....
....
....
</tbody>
</table>
In the above HTML table I would like to entirely remove the first and corresponding elements, so remove Firstname, Lastname and Middle i.e., I want to start stripping the text only from the second . So this way I get only the contents of the table from the second or tr[2] and no column headers.
Can someone please provide me a code as to how to do this.
Thanks.
require 'rubygems'
require 'nokogiri'
doc = Nokogiri::HTML(x)
rows = doc.xpath('//table/tbody/tr[position() > 1]')
# OR
rows = doc.xpath("//table/tbody/tr")
header = rows.shift
After you've run either one of the above 2 snippets, rows will contain every <tr>...</tr> after the first one. For example puts rows.to_xml prints the following:
<tr><td>ding</td>
<td>dong</td>
<td>ling</td>
</tr>
To get the inner text, removing all the html tags, run puts rows.text
ding
dong
ling
To get the inner text of the td tags only, run rows.xpath('td').map {|td| td.text }
["ding", "dong", "ling"]
Alternatively:
table.css('tr')[1..-1]
or to strip out the text starting at row 2:
table.css('tr')[1..-1].map{|tr| tr.css('td').map &:text}
Since Nokogiri does support :has CSS pseudo-class you can get heading row with
#doc.at_css('table#table_id').css('tr:has(th)')
and since it does supports :not CSS pseudo-class as well, you can get other rows with
#doc.at_css('table#table_id').css('tr:not(:has(th))')
respectively. Depending on your preferences you might like to avoid negation and just use css('tr:has(td)').

How to get table data from tables using xpath

What XPATH query could i use to get the values from the 1st and 3rd <td> tag for each row in a html table.
The XPATH query I have used use is
/table/tr/td[1]|td[3].
This only returns the values in the first <td> tag for each row in a table.
EXAMPLE
I would expect to get the values bob,19,jane,11,cameron and 32 from the below table. But am only getting bob,jane,cameron.
<table>
<tr><td>Bob</td><td>Male</td><td>19</td></tr>
<tr><td>Jane</td><td>Feale</td><td>11</td></tr>
<tr><td>Cameron</td><td>Male</td><td>32</td></tr>
</table>
#jakenoble's answer:
/table/tr/td[1]|/table/tr/td[3]
is correct.
An equivalent XPath expression that avoids the | (union) operator and may be more efficient is:
/table/tr/td[position() = 1 or position() = 3]
Try
/table/tr/td[1]|/table/tr/td[3]
I remember doing this in the past and found it rather annoying because it is ugly and long-winded

Resources