Nokogiri and tables - ruby

Am parsing a web page with a standard structure as follows:
<html>
<body>
<table>
<tbody>
<tr class="active">
<td>name1</td>
<td>name2</td>
<td>name3</td>
</tr>
</tbody>
</table>
</body>
</html>
For the life of me, I can't access the 'tbody' or 'tr' elements.
response = open('http://my_url')
node = Nokogiri::HTML(response).css('table')
puts node
Returns
#<Nokogiri::XML::Element:0x8294c08c name="table" attributes=[#<Nokogiri::XML::Attr:0x8294c014 name="id" value="beta-users">] children=[#<Nokogiri::XML::Text:0x82953bc0 "\n">]>
I have tried various tricks but can't seem to dig deeper down to a lower-level child than 'table'.
At best, I can get to the lowest-level Text object by using
node.children
but
node.children.text
returns "\n".
Despite searching for some hours am none the wiser how to sort it out. Any thoughts?

There is a non-closed class value in your sample, it should be:
<html>
<body>
<table>
<tbody>
<tr class="active">
<td>name1</td>
<td>name2</td>
<td>name3</td>
</tr>
</tbody>
</table>
</body>
</html>
After correcting this, you can:
node = Nokogiri::HTML(response).css('table tbody tr td')
node.each {|child| puts child.text}
name1
name2
name3

Related

Web Scraping - xPath issue

I need to extract the text 120 from this HTML code:
<section class="details">
<h2>Détails du bien</h2>
<table>
....
<tr>
<td>Surface habitable (m²)</td>
<td class="right" title="120">120 </td>
</tr>
...
</table>
</section>
I used this xpath, but it returns an empty list:
//td[contains(text(),"Surface")]/td[#class="right"]/text()
What am I doing wrong?
Try to use xPath axes:
//td[contains(text(),"Surface")]/following-sibling::td[#class="right"]/text()
This should solve your problem.

How to get between two br tags in xpath?

I have a table with td like this
<td>
<span> Washington US <br>98101 Times Square</span>
</td>
I can get all the elements in the page, but I need to get those two values separately. If that isn't possible I would like to somehow get 98101 Times Square
I have tried doing something like string(//tr[3]//td[2])/ but all I get is the two text joined together.
You can select the text child nodes in the span element with span/text() so assuming your posted path selects the td containing the span you want //tr[3]//td[2]/span/text().
Here is a sample:
$html = <<<EOD
<html>
<body>
<table>
<tr>
<td>1</td>
</tr>
<tr>
<td>2</td>
</tr>
<tr>
<td>3,1</td>
<td>
<span> Washington US <br>98101 Times Square</span>
</td>
</tr>
</body>
</html>
EOD;
$doc = new DOMDocument();
$doc->loadHTML($html);
$xpath = new DOMXPath($doc);
$textNodes = $xpath->query('//tr[3]//td[2]/span/text()');
foreach ($textNodes as $text) {
echo $text->textContent . "\n";
}
Outputs
Washington US
98101 Times Square
Try
td/span/node()[1]
and
td/span/node()[3]
Or
td/span/text()[1]
td/span/text()[2]

Scraping page with correct xpath using Mechanize and nokogiri

I am trying to access data contained in a table that is itself contained in a table with class ='L1'.
So basically my html structure is like this:
<table class="L1">
<table>
<tr></tr>
<tr>
<td></td>
<td>data</td>
</tr>
<tr>
<td></td>
<td>data</td>
</tr>
...ect...ect
</table>
</table>
I need to catch the data contained in a all <a> </a> that are in the second contained in <tr> </tr> but only starting with the second <tr> of the table.
So far I came up with that:
html_body = Nokogiri::HTML(body)
links = html_body.css('.L1').xpath("//table/tbody/tr/td[2]/a[1]")
But seems to me that this doesn't express the fact that I want to start only after the second <tr> (second <tr> included?
What would be the right code to do this ?
You can use position() to select the later elements that you want.
html_body = Nokogiri::HTML(body)
links = html_body.css('.L1').xpath("//table/tbody/tr[position()>1]/td[2]/a[1]")
As the comments on that SO answer say, remember XPath counts from 1, so >1 skips the first tr.

How to get all the nodes which are coming after a particular tag using Nokogiri

I want to fetch all the HTML tags which are coming after the particular tag. For example:
<html>
<body>
<p>one</p>
<u><p>Two</p></u>
<b><p>Three</p></b>
<p>Four</p>
<table>
<tr><td>Five</td></tr>
<tr><td>Six</td></tr>
</table>
</body>
</html>
I want all the HTML tags which are coming after <u><p>Two</p></u> using Nokogiri.
My result should be:
<b><p>Three</p></b>
<p>Four</p>
<table>
<tr><td>Five</td></tr>
<tr><td>Six</td></tr>
</table>
The following-sibling XPath axis is what you want here. Your example isn’t valid HTML, and Nokogiri will change it when parsing as HTML making it hard to demonstrate using it, but with this similar code:
<html>
<body>
<p>one</p>
<p>Two</p>
<p>Three</p>
<p>Four</p>
<table>
<tr><td>Five</td></tr>
<tr><td>Six</td></tr>
</table>
</body>
</html>
this XPath expression:
//p[.="Two"]/following-sibling::*
will select this:
<p>Three</p>
<p>Four</p>
<table>
<tr><td>Five</td></tr>
<tr><td>Six</td></tr>
</table>
You might want to use node() instead of *, which will select all text nodes as well as elements (including whitespace only nodes):
<p>Three</p>
<p>Four</p>
<table>
<tr><td>Five</td></tr>
<tr><td>Six</td></tr>
</table>
(There will be some more leading whitespace on each line if you do this, I‘ve removed it here.)

Ruby - nokogiri - parse only specific html table

I have a HTML doc to parse and read a bunch of stuff from there. The problem is the html has multiple tables in it, and I am only interested in one table. Plus I want to read only the lines that having some useful content. Here is sample html page, there are two tables with no ID, and I want only the second table and only the lines that are useful to humans.
<HTML>
<BODY>
<TABLE>
<TR>
<TD> I don't want this table </TD></TR>
<TR>
<TD></TD>
<TD> No No No <br></TD>
</TR>
....
</TABLE>
<TABLE>
<TR>
<TD>04/13/2012 22:51 I want this table </TD></TR>
<TR>
<TD></TD>
<TD> First - something there <br></TD>
</TR>
<TR>
<TD>04/13/2012 23:23 Update from xyz</TD></TR>
<TR>
<TD></TD>
<TD>Second - something here <br></TD>
</TR>
</TABLE>
</BODY>
</HTML>
I am trying this code, which is obviously not working. The o/p is not the text I want. It includes both tables, I only want the second table. help!
require 'curb'
require 'nokogiri'
c = Curl::Easy.perform("http://server/cgi-bin/page.cgi?id=123456")
html_doc = Nokogiri::HTML(c.body_str.to_s)
puts html_doc.xpath("//table/tr/td")
Have you tried the xpath of //table[2]/tr/td to get the second table. If you can change the source of the HTML the best solution would be to provide id attributes for your tables.

Resources