Scraping page with correct xpath using Mechanize and nokogiri - ruby

I am trying to access data contained in a table that is itself contained in a table with class ='L1'.
So basically my html structure is like this:
<table class="L1">
<table>
<tr></tr>
<tr>
<td></td>
<td>data</td>
</tr>
<tr>
<td></td>
<td>data</td>
</tr>
...ect...ect
</table>
</table>
I need to catch the data contained in a all <a> </a> that are in the second contained in <tr> </tr> but only starting with the second <tr> of the table.
So far I came up with that:
html_body = Nokogiri::HTML(body)
links = html_body.css('.L1').xpath("//table/tbody/tr/td[2]/a[1]")
But seems to me that this doesn't express the fact that I want to start only after the second <tr> (second <tr> included?
What would be the right code to do this ?

You can use position() to select the later elements that you want.
html_body = Nokogiri::HTML(body)
links = html_body.css('.L1').xpath("//table/tbody/tr[position()>1]/td[2]/a[1]")
As the comments on that SO answer say, remember XPath counts from 1, so >1 skips the first tr.

Related

How to turn a table into a single block of text with scrapy

I am trying to scrape a table which looks like the below.
<table class="table">
<caption>Caption</caption>
<tbody>
<tr>
<th scope="row">Title</th>
<td>Detail</td>
</tr>
<tr>
<th scope="row">Title 2</th>
<td>Detail 2</td>
</tr>
</tbody>
</table>
How would you set up scrapy so my output file generates an output similar to the below?!
Title: Detail
Title2: Detail2
Currently I can get all the text using two css selectors (one for the td's and one for the th's) but I would love to be able to combine these!
Unfortunately the number of rows differs from page to page..
Using xpath:
tabledata={}
for i in response.xpath("//table[#class='table']//tr")
tabledata[i.xpath("th/text()").extract_first()] = i.xpath("td/text()").extract_first()
Output
{"Title":"Detail", "Title 2":"Detail 2"}

XPATH for tr that has multiple text's in side it

I need XPATH for <tr> that contains text 'abc' in second <td> and text 'xyz' in third <td>
Tried but no luck.
final String XPATH = "//tr[td[contains(.,'abc')] and td[contains(.,'xyz')]";
Your expression actually almost selects what you want (once you fix the missing last ]). You just need to specify positions of the <td> elements.
//tr[td[2][contains(.,'abc')] and td[3][contains(.,'xyz')]]
For the following XML document:
<document>
<table>
<tr>
<td>foo</td>
<td>abc</td>
<td>xyz</td>
</tr>
<tr>
<td>foo</td>
<td>bar</td>
<td>xyz</td>
</tr>
</table>
</document>
this returns a node-set with the first <tr> element of the document in document order.

Import data from HTML page using feeds importer in drupal

I'm trying to import some data from a HTML page with feeds importer. The context is this:
<table class="tabela">
<tr valign="TOP">
<td class="formulario-legenda">Nome:</td>
<td nowrap="nowrap">
<b>Raul Fernando de Almeida Moreira Vidal</b>
</td>
</tr>
<tr valign="TOP">
<td class="formulario-legenda">Sigla:</td>
<td>
<b>RMV</b>
</td>
</tr>
<tr valign="TOP">
<td class="formulario-legenda">Código:</td>
<td>206415</td>
</tr>
<tr valign="TOP">
<td class="formulario-legenda">Estado:</td>
<td>Ativo</td>
</tr>
</table>
<table>
<tr>
<td class="topo">
<table>
<tr>
<td class="formulario-legenda">Categoria:</td>
<td>Professor Associado</td>
</tr>
<tr>
<td class="formulario-legenda">Carreira:</td>
<td>Pessoal Docente de Universidades</td>
</tr>
<tr>
<td class="formulario-legenda">Grupo profissional:</td>
<td>Docente</td>
</tr>
<tr valign="TOP">
<td class="formulario-legenda">Departamento:</td>
<td>
<a href="uni_geral.unidade_view?pv_unidade=151"
title="Departamento de Engenharia Informática">Departamento de Engenharia Informática</a>
</td>
</tr>
</table>
</td>
</tr>
</table>
I tried with this:
/html/body/div/div/div/div/div/div/div/table/tbody/tr/td/table/tbody/tr[1]/td[2]
but nothing appears. Can someone help me with the right syntax to obtain "Grupo Profissional"?
Quick answer that might work
Considering just the HTML sample you provided (which only has two tables) you can select the text you want using this expression, based on the table's position:
//table[2]//tr[3]/td[1]/text()
This will work in the HTML you pasted above. But it might not work in your actual scenario, since you might have other tables, the table you want to select has no ID and you didn't suggest some invariant text in your code which could be used to anchor the context for the expression. Assuming the initial part of your XPath expression (the div sequence) is correct, you might be able to use:
/html/body/div/div/div/div/div/div/div/table[2]//tr[3]/td[1]/text()
But it's wuite a fragile expression and vulnerable to any changes in the document.
A (possibly) better solution
A better alternative is to look for some identifier you could use. I can only guess, since I don't know your code. In your sample code, I would guess that Codigo and the number following it 206415 might be some identifier. If it is, you could use it to anchor your context. First you select it:
//table[.//td[text()='Código:']/following-sibling::td='206415']
The expression above will select the table which contains a td with the exact text Código: followed by a td containing the exact text 206415. This will create a unique context (considering that the number is an unique identifier). From that context, you can now select the text you want, which is inside the next table (following-sibling::table[1]). This is the context of the second table:
//table[.//td[text()='Código:']/following-sibling::td='206415']/following-sibling::table[1]
And this should select the text you want (Grupo profissional:) which is in the third row tr[3] and first cell/column td[1] of that table:
//table[.//td[text()='Código:']/following-sibling::td='206415']/following-sibling::table[1]//tr[3]/td[1]/text()

Nokogiri next_element with filter

Let's say I've got an ill formed html page:
<table>
<thead>
<th class="what_I_need">Super sweet text<th>
</thead>
<tr>
<td>
I also need this
</td>
<td>
and this (all td's in this and subsequent tr's)
</td>
</tr>
<tr>
...all td's here too
</tr>
<tr>
...all td's here too
</tr>
</table>
On BeautifulSoup, we were able to get the <th> and then call findNext("td"). Nokogiri has the next_element call, but that might not return what I want (in this case, it would return the tr element).
Is there a way to filter the next_element call of Nokogiri? e.g. next_element("td")?
EDIT
For clarification, I'll be looking at many sites, most of them ill formed in different ways.
For instance, the next site might be:
<table>
<th class="what_I_need">Super sweet text<th>
<tr>
<td>
I also need this
</td>
<td>
and this (all td's in this and subsequent tr's)
</td>
</tr>
<tr>
...all td's here too
</tr>
<tr>
...all td's here too
</tr>
</table>
I can't assume any structure other than there will be trs below the item that has the class what_I_need
First, note that your closing th tag is malformed: <th>. It should be </th>. Fixing that helps.
One way to do it is to use XPath to navigate to it once you've found the th node:
require 'nokogiri'
html = '
<table>
<thead>
<th class="what_I_need">Super sweet text<th>
</thead>
<tr>
<td>
I also need this
</td>
<tr>
</table>
'
doc = Nokogiri::HTML(html)
th = doc.at('th.what_I_need')
th.text # => "Super sweet text"
td = th.at('../../tr/td')
td.text # => "\n I also need this\n "
This is taking advantage of Nokogiri's ability to use either CSS accessors or XPath, and to do it pretty transparently.
Once you have the <th> node, you could also navigate using some of Node's methods:
th.parent.next_element.at('td').text # => "\n I also need this\n "
One more way to go about it, is to start at the top of the table and look down:
table = doc.at('table')
th = table.at('th')
th.text # => "Super sweet text"
td = table.at('td')
td.text # => "\n I also need this\n "
If you need to access all <td> tags within a table you can iterate over them easily:
table.search('td').each do |td|
# do something with the td...
puts td.text
end
If you want the contents of all <td> by their containing <tr> iterate over the rows then the cells:
table.search('tr').each do |tr|
cells = tr.search('td').map(&:text)
# do something with all the cells
end

Ruby - nokogiri - parse only specific html table

I have a HTML doc to parse and read a bunch of stuff from there. The problem is the html has multiple tables in it, and I am only interested in one table. Plus I want to read only the lines that having some useful content. Here is sample html page, there are two tables with no ID, and I want only the second table and only the lines that are useful to humans.
<HTML>
<BODY>
<TABLE>
<TR>
<TD> I don't want this table </TD></TR>
<TR>
<TD></TD>
<TD> No No No <br></TD>
</TR>
....
</TABLE>
<TABLE>
<TR>
<TD>04/13/2012 22:51 I want this table </TD></TR>
<TR>
<TD></TD>
<TD> First - something there <br></TD>
</TR>
<TR>
<TD>04/13/2012 23:23 Update from xyz</TD></TR>
<TR>
<TD></TD>
<TD>Second - something here <br></TD>
</TR>
</TABLE>
</BODY>
</HTML>
I am trying this code, which is obviously not working. The o/p is not the text I want. It includes both tables, I only want the second table. help!
require 'curb'
require 'nokogiri'
c = Curl::Easy.perform("http://server/cgi-bin/page.cgi?id=123456")
html_doc = Nokogiri::HTML(c.body_str.to_s)
puts html_doc.xpath("//table/tr/td")
Have you tried the xpath of //table[2]/tr/td to get the second table. If you can change the source of the HTML the best solution would be to provide id attributes for your tables.

Resources