accessing the text value of last nested <tr> element with no id or class hooks - ruby

I need to access the value of the 10th <td> element in the last row of a table. I can't use an ID as a hook because only the table has an ID. I've managed to make it work using the code below. Unfortunately, its static. I know I will always need the 10th <td> element, but I won't ever know which row it needs to be. I just know it needs to be the last row in the table. How would I replace "tr[6]" with the actual last <tr> dynamically? (this is probably really easy, but this is literally my first time doing anything with ruby).
page = Nokogiri::HTML(open(url))
test = page.css("tr[6]").map { |row|
row.css("td[10]").text}
puts test

You want to do:
page.at("tr:last td:eq(10)")

If you do not need to do anything else with the page you can actually make this a single line with
test = Nokogiri::HTML(open(url)).search("tr").last.search("td")[10].text
Otherwise (this will work):
page = Nokogiri::HTML(open(url))
test = page.search("tr").last.search("td")[10].text
puts test
Example:(Used a large table from another question on StackOverflow)
Nokogiri::HTML(open("http://en.wikipedia.org/wiki/Richard_Dreyfuss")).search('table')[1].search('tr').last.search('td').children.map{|c| c.text}.join(" ")
#=> "2013 Paranoia Francis Cassidy"
Is there a particular reason you want an Array with 1 element? My example will return a string but you could easily modify it to return an Array.

You can use CSS pseudo class selectors for this:
page.css("table#the-table-id tr:last-of-type td:nth-of-type(10)")
This first selects the <table> with the appropriate id, then selects the last <tr> child of that table, and then selects the 10th <td> of that <tr>. The result is an array of all matching elements, if youexpect there to be only one you could use at_css instead.
If you prefer XPath, you could use this:
page.xpath("//table[#id='the-table-id']/tr[last()]/td[10]")

Related

How to get td text value of nested table by XPath

I have this table inside another table inside another table and so on. And then I want to get the text value of the td element with a specific class.
<tr>
<td width="5%"></td>
<td class="wintxt">The XML ....<br/><br/>Number: xyz</td>
</tr>
I need to get the text content "The XML ....Number: xyz"
I tried using:
List<?> submissionString = resultOfsubmissionPage.getByXPath("//tr[#class=\"wintxt\"]/td/text()");
...and many other variations but I always get a zero element List. Anyone has a clue?
There is mistakes with your provided xpath you are searching text() in that row means tr which has class attribute but as your provided HTML only one td has class attribute. So try as below :-
List<?> submissionString = resultOfsubmissionPage.getByXPath("//tr/td[#class='wintxt']/text()");
Hope it helps..:)

scrapy xpath : selector with many <tr> <td>

Hello I want to ask a question
I scrape a website with xpath ,and the result is like this:
[u'<tr>\r\n
<td>address1</td>\r\n
<td>phone1</td>\r\n
<td>map1</td>\r\n
</tr>',
u'<tr>\r\n
<td>address1</td>\r\n
<td>telephone1</td>\r\n
<td>map1</td>\r\n
</tr>'...
u'<tr>\r\n
<td>address100</td>\r\n
<td>telephone100</td>\r\n
<td>map100</td>\r\n
</tr>']
now I need to use xpath to analyze this results again.
I want to save the first to address,the second to telephone,and the last one to map
But I can't get it.
Please guide me.Thank you!
Here is code,it's wrong. it will catch another thing.
store = sel.xpath("")
for s in store:
address = s.xpath("//tr/td[1]/text()").extract()
tel = s.xpath("//tr/td[2]/text()").extract()
map = s.xpath("//tr/td[3]/text()").extract()
As you can see in scrappy documentation to work with relative XPaths you have to use .// notation to extract the elements relative to the previous XPath, if not you're getting again all elements from the whole document. You can see this sample in the scrappy documentation that I referenced above:
For example, suppose you want to extract all <p> elements inside <div> elements. First, you would get all <div> elements:
divs = response.xpath('//div')
At first, you may be tempted to use the following approach, which is wrong, as it actually extracts all <p> elements from the document, not only those inside <div> elements:
for p in divs.xpath('//p'): # this is wrong - gets all <p> from the whole document
This is the proper way to do it (note the dot prefixing the .//p XPath):
for p in divs.xpath('.//p'): # extracts all <p> inside
So I think in your case you code must be something like:
for s in store:
address = s.xpath(".//tr/td[1]/text()").extract()
tel = s.xpath(".//tr/td[2]/text()").extract()
map = s.xpath(".//tr/td[3]/text()").extract()
Hope this helps,

Get last word inside table cell?

I want to scrape data from a table with Ruby and Nokogiri.
There are a lot of <td> elements, but I only need the country which is just text after a <br> element. The problem is, the <td> elements differ. Sometimes there is more than just the country.
For example:
<td>Title1<br>USA</td>
<td>Title2<br>Michael Powell<br>UK</td>
<td>Title3<br>Leopold Lindtberg<br>Ralph Meeker<br>Switzerland</td>
I want to address the element before the closing </td> tag because the country is always the last element.
How can I do that?
I'd use this:
require 'awesome_print'
require 'nokogiri'
html = '
<td>Title1<br>USA</td>
<td>Title2<br>Michael Powell<br>UK</td>
<td>Title3<br>Leopold Lindtberg<br>Ralph Meeker<br>Switzerland</td>
'
doc = Nokogiri::HTML(html)
ap doc.search('td').map{ |td| td.search('text()').last.text }
[
[0] "USA",
[1] "UK",
[2] "Switzerland"
]
The problem is that your HTML being parsed won't have rows of <td> tags, so you'll have to locate the ones you want to parse. Instead, they'll be interspersed between <tr> tags, and maybe even different <table> tags. Because your HTML sample doesn't show the true structure of the document, I can't help you more.
There are bunch of different solutions. Another solution using only the standard library is to substring out the things you dont want.
node_string = <<-STRING
<td>Title1<br>USA</td>
<td>Title2<br>Michael Powell<br>UK</td>
<td>Title3<br>Leopold Lindtberg<br>Ralph Meeker<br>Switzerland</td>
STRING
node_string.split("<td>").collect do |str|
last_str = str.split("<br>").last
last_str.gsub(/[\n,\<\/td\>]/,'') unless last_str.nil?
end.compact

Parse a HTML table using Ruby, Nokogiri omitting the column headers

I have trouble parsing a HTML table using Nokogiri and Ruby. My HTML table structure looks like this
<table>
<tbody>
<tr>
<td>Firstname</td>
<td>Lastname</td>
<td>Middle</td>
</tr>
<tr>
<td>ding</td>
<td>dong</td>
<td>ling</td>
</tr>
....
....
.... {more tr's and td's with similar data exists.}
....
....
....
....
....
</tbody>
</table>
In the above HTML table I would like to entirely remove the first and corresponding elements, so remove Firstname, Lastname and Middle i.e., I want to start stripping the text only from the second . So this way I get only the contents of the table from the second or tr[2] and no column headers.
Can someone please provide me a code as to how to do this.
Thanks.
require 'rubygems'
require 'nokogiri'
doc = Nokogiri::HTML(x)
rows = doc.xpath('//table/tbody/tr[position() > 1]')
# OR
rows = doc.xpath("//table/tbody/tr")
header = rows.shift
After you've run either one of the above 2 snippets, rows will contain every <tr>...</tr> after the first one. For example puts rows.to_xml prints the following:
<tr><td>ding</td>
<td>dong</td>
<td>ling</td>
</tr>
To get the inner text, removing all the html tags, run puts rows.text
ding
dong
ling
To get the inner text of the td tags only, run rows.xpath('td').map {|td| td.text }
["ding", "dong", "ling"]
Alternatively:
table.css('tr')[1..-1]
or to strip out the text starting at row 2:
table.css('tr')[1..-1].map{|tr| tr.css('td').map &:text}
Since Nokogiri does support :has CSS pseudo-class you can get heading row with
#doc.at_css('table#table_id').css('tr:has(th)')
and since it does supports :not CSS pseudo-class as well, you can get other rows with
#doc.at_css('table#table_id').css('tr:not(:has(th))')
respectively. Depending on your preferences you might like to avoid negation and just use css('tr:has(td)').

How to get table data from tables using xpath

What XPATH query could i use to get the values from the 1st and 3rd <td> tag for each row in a html table.
The XPATH query I have used use is
/table/tr/td[1]|td[3].
This only returns the values in the first <td> tag for each row in a table.
EXAMPLE
I would expect to get the values bob,19,jane,11,cameron and 32 from the below table. But am only getting bob,jane,cameron.
<table>
<tr><td>Bob</td><td>Male</td><td>19</td></tr>
<tr><td>Jane</td><td>Feale</td><td>11</td></tr>
<tr><td>Cameron</td><td>Male</td><td>32</td></tr>
</table>
#jakenoble's answer:
/table/tr/td[1]|/table/tr/td[3]
is correct.
An equivalent XPath expression that avoids the | (union) operator and may be more efficient is:
/table/tr/td[position() = 1 or position() = 3]
Try
/table/tr/td[1]|/table/tr/td[3]
I remember doing this in the past and found it rather annoying because it is ugly and long-winded

Resources