Parse a HTML table using Ruby, Nokogiri omitting the column headers - ruby

I have trouble parsing a HTML table using Nokogiri and Ruby. My HTML table structure looks like this
<table>
<tbody>
<tr>
<td>Firstname</td>
<td>Lastname</td>
<td>Middle</td>
</tr>
<tr>
<td>ding</td>
<td>dong</td>
<td>ling</td>
</tr>
....
....
.... {more tr's and td's with similar data exists.}
....
....
....
....
....
</tbody>
</table>
In the above HTML table I would like to entirely remove the first and corresponding elements, so remove Firstname, Lastname and Middle i.e., I want to start stripping the text only from the second . So this way I get only the contents of the table from the second or tr[2] and no column headers.
Can someone please provide me a code as to how to do this.
Thanks.

require 'rubygems'
require 'nokogiri'
doc = Nokogiri::HTML(x)
rows = doc.xpath('//table/tbody/tr[position() > 1]')
# OR
rows = doc.xpath("//table/tbody/tr")
header = rows.shift
After you've run either one of the above 2 snippets, rows will contain every <tr>...</tr> after the first one. For example puts rows.to_xml prints the following:
<tr><td>ding</td>
<td>dong</td>
<td>ling</td>
</tr>
To get the inner text, removing all the html tags, run puts rows.text
ding
dong
ling
To get the inner text of the td tags only, run rows.xpath('td').map {|td| td.text }
["ding", "dong", "ling"]

Alternatively:
table.css('tr')[1..-1]
or to strip out the text starting at row 2:
table.css('tr')[1..-1].map{|tr| tr.css('td').map &:text}

Since Nokogiri does support :has CSS pseudo-class you can get heading row with
#doc.at_css('table#table_id').css('tr:has(th)')
and since it does supports :not CSS pseudo-class as well, you can get other rows with
#doc.at_css('table#table_id').css('tr:not(:has(th))')
respectively. Depending on your preferences you might like to avoid negation and just use css('tr:has(td)').

Related

How to get td text value of nested table by XPath

I have this table inside another table inside another table and so on. And then I want to get the text value of the td element with a specific class.
<tr>
<td width="5%"></td>
<td class="wintxt">The XML ....<br/><br/>Number: xyz</td>
</tr>
I need to get the text content "The XML ....Number: xyz"
I tried using:
List<?> submissionString = resultOfsubmissionPage.getByXPath("//tr[#class=\"wintxt\"]/td/text()");
...and many other variations but I always get a zero element List. Anyone has a clue?
There is mistakes with your provided xpath you are searching text() in that row means tr which has class attribute but as your provided HTML only one td has class attribute. So try as below :-
List<?> submissionString = resultOfsubmissionPage.getByXPath("//tr/td[#class='wintxt']/text()");
Hope it helps..:)

XPath get only first Parent of nested HTML

I am newbie in XPath. Can someone explain how to resolve this problem:
<table>
<tr>
<td>
<table>
<tr>
<td>
<table>
<tr>
<td>Label</td>
<td>value</td>
</tr>
</table>
</td>
</tr>
</table>
</td>
</tr>
</table>
I try to get <tr> which contains Label value, but it does not work for me,
Here is my code :
//td[contains(.,'Label')]/ancestor::tr[1]
Desired result:
<tr>
<td>Label</td>
<td>value</td>
</tr>
Can someone help me ?
This expression matches the tr that you want:
//tr[contains(td/text(), 'Label')]
Like yours, this starts by scanning all tr elements in the document, but this version uses just a single predicate. The td/text() limits the test to actual text nodes which are grandchildren of the row. If you just used td, then all of the td's descendant text nodes would be collected and concatenated, and the outer tr would match.
UPDATE: Also, for what it's worth, the reason your expression isn't working is that the ancestor axis returns elements in document order, not "outward" from the point of the context node. This is something I've run into myself, as it is somewhat unintuitive. To make your approach work, you would need to say
//td[contains(.,'Label')]/ancestor::tr[last()]
instead of
//td[contains(.,'Label')]/ancestor::tr[1]
I had the same issue, except that the text 'Label' was sometimes in a nested span, or even further nested in the td. For example:
<td><span>Label</span></td>
The previous answer only finds 'Label' if it is in a text element that is a direct child of the td. This issue is a bit harder because we need to search for a td that contains the text 'Label' in any of its children. Since the tds are nested, all tds qualify as having a descendant that contains the text 'Label'. So, the only way I found to overcome this is to add a check that makes sure that the td we select does not contain a td with the search text.
//td[contains(., 'Label') and not(.//td[contains(., 'Label')])]/ancestor::tr[1]
This says give me all of the tds that have a decedent text containing 'Label', but exclude all tds that contain a td that has a decedent text containing 'Label' (nesting ancestors). This returns the child most td that contains the text. Then you can go back to the tr that contains this td using ancestor.
Also, if you just want the lowest table that contains text use this:
//table[contains(., 'Label') and not(.//table[contains(., 'Label')])]
or you can select the tr directly:
//tr[contains(., 'Label') and not(.//tr[contains(., 'Label')])]
This seems like a common problem, but I didn't see a solution anywhere. So, I decided to post to this old unanswered question in hopes that it helps somebody.

accessing the text value of last nested <tr> element with no id or class hooks

I need to access the value of the 10th <td> element in the last row of a table. I can't use an ID as a hook because only the table has an ID. I've managed to make it work using the code below. Unfortunately, its static. I know I will always need the 10th <td> element, but I won't ever know which row it needs to be. I just know it needs to be the last row in the table. How would I replace "tr[6]" with the actual last <tr> dynamically? (this is probably really easy, but this is literally my first time doing anything with ruby).
page = Nokogiri::HTML(open(url))
test = page.css("tr[6]").map { |row|
row.css("td[10]").text}
puts test
You want to do:
page.at("tr:last td:eq(10)")
If you do not need to do anything else with the page you can actually make this a single line with
test = Nokogiri::HTML(open(url)).search("tr").last.search("td")[10].text
Otherwise (this will work):
page = Nokogiri::HTML(open(url))
test = page.search("tr").last.search("td")[10].text
puts test
Example:(Used a large table from another question on StackOverflow)
Nokogiri::HTML(open("http://en.wikipedia.org/wiki/Richard_Dreyfuss")).search('table')[1].search('tr').last.search('td').children.map{|c| c.text}.join(" ")
#=> "2013 Paranoia Francis Cassidy"
Is there a particular reason you want an Array with 1 element? My example will return a string but you could easily modify it to return an Array.
You can use CSS pseudo class selectors for this:
page.css("table#the-table-id tr:last-of-type td:nth-of-type(10)")
This first selects the <table> with the appropriate id, then selects the last <tr> child of that table, and then selects the 10th <td> of that <tr>. The result is an array of all matching elements, if youexpect there to be only one you could use at_css instead.
If you prefer XPath, you could use this:
page.xpath("//table[#id='the-table-id']/tr[last()]/td[10]")

Selenium IDE with XPath to identify cell in table based on other column

Please take a look at the snippet of html below:
<tr class="clickable">
<td id="7b8ee8f9-b66f-4fba-83c1-4cf2827130b5" class="clickable">
<a class="editLink" href="#">Single</a>
</td>
<td class="clickable">£14.00</td>
</tr>
I'm trying to assert the value of td[2] when td[1] contains "Single". I've tried assorted variants of:
//td[2][(contains(text(),'£14.00'))]/../td[1][(contains(text(),'Single'))]
I've used similar notation elsewhere successfully - but to no avail here... I think it's down to td[1] having the nested element, but not sure.
Can someone enlighten as to what I'm getting wrong? :)
Cheers!
What about:
//tr[contains(td[1], "Single")]/td[2]
First select the <tr> containing the <td> matching the text, and then select td[2].
Then,
contains(//tr[contains(td[1], "Single")]/td[2], "£14.00")
should return True.
Or, closer to the expression you tried, you could test if this matches:
//tr[contains(td[1], "Single")]/td[2][contains(., "£14.00")]
See #JensErat's answer to find xth td with td contains in same tr xpath python .
Why not make it simple on yourself, do the if statement in your code. Psuedocode:
Select the top level tr.
Find first td within tr, check to see if it contains Single.
If it does, assert that it contains £14.00
Alternatively, you could just get the text of the top level tr and perform the checks on that text.

Get last word inside table cell?

I want to scrape data from a table with Ruby and Nokogiri.
There are a lot of <td> elements, but I only need the country which is just text after a <br> element. The problem is, the <td> elements differ. Sometimes there is more than just the country.
For example:
<td>Title1<br>USA</td>
<td>Title2<br>Michael Powell<br>UK</td>
<td>Title3<br>Leopold Lindtberg<br>Ralph Meeker<br>Switzerland</td>
I want to address the element before the closing </td> tag because the country is always the last element.
How can I do that?
I'd use this:
require 'awesome_print'
require 'nokogiri'
html = '
<td>Title1<br>USA</td>
<td>Title2<br>Michael Powell<br>UK</td>
<td>Title3<br>Leopold Lindtberg<br>Ralph Meeker<br>Switzerland</td>
'
doc = Nokogiri::HTML(html)
ap doc.search('td').map{ |td| td.search('text()').last.text }
[
[0] "USA",
[1] "UK",
[2] "Switzerland"
]
The problem is that your HTML being parsed won't have rows of <td> tags, so you'll have to locate the ones you want to parse. Instead, they'll be interspersed between <tr> tags, and maybe even different <table> tags. Because your HTML sample doesn't show the true structure of the document, I can't help you more.
There are bunch of different solutions. Another solution using only the standard library is to substring out the things you dont want.
node_string = <<-STRING
<td>Title1<br>USA</td>
<td>Title2<br>Michael Powell<br>UK</td>
<td>Title3<br>Leopold Lindtberg<br>Ralph Meeker<br>Switzerland</td>
STRING
node_string.split("<td>").collect do |str|
last_str = str.split("<br>").last
last_str.gsub(/[\n,\<\/td\>]/,'') unless last_str.nil?
end.compact

Resources