How to turn a table into a single block of text with scrapy - xpath

I am trying to scrape a table which looks like the below.
<table class="table">
<caption>Caption</caption>
<tbody>
<tr>
<th scope="row">Title</th>
<td>Detail</td>
</tr>
<tr>
<th scope="row">Title 2</th>
<td>Detail 2</td>
</tr>
</tbody>
</table>
How would you set up scrapy so my output file generates an output similar to the below?!
Title: Detail
Title2: Detail2
Currently I can get all the text using two css selectors (one for the td's and one for the th's) but I would love to be able to combine these!
Unfortunately the number of rows differs from page to page..

Using xpath:
tabledata={}
for i in response.xpath("//table[#class='table']//tr")
tabledata[i.xpath("th/text()").extract_first()] = i.xpath("td/text()").extract_first()
Output
{"Title":"Detail", "Title 2":"Detail 2"}

Related

Html Agility Pack loop through Specific Row

I have a table like this
<table>
<thead>
<tr>
<th>Name</th>
<th>Department</th>
<th>Gender</th>
</tr>
</thead>
<tbody>
<tr id="data1">
</tr>
<tr>
</tr>
</tbody>
And I want to use Html Agility Pack to parse its specific row i.e i want to display row next to row which has id=data1
below is code I am trying ...
//Selecting Document Node....
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(data);
//Selecting Specific Node...
var tableNodes = doc.DocumentNode.SelectNodes("//table");
Your xpath should be like this:
//table/tbody/tr[#id='data1']/following-sibling::tr

How to select all the rows from a table with specific text in its header with xpath

I'm new to Xpath and I'm trying to get all the rows from a specific table on a wikipedia article that has many tables, luckily the table I want has the text "Posición" in one of the th elements inside it's header, how can I achieve this?
I am using C# to achieve this, any help and tips will be greatly appreciated :)
<table>
<thead>
<tr>
<th>Something</th>
<th>Posición</th>
<th>Something</th>
</tr>
</thead>
<tbody>
<tr>
<td>info1</td>
<td>info2</td>
<td>info3</td>
</tr>
... more trs
</tbody>
</table>

Creating pipe tables with pandoc

Is it possible to convert a html table into a pipe table with pandoc (or using any other tool)?
I tried pandoc bla.html --to markdown+pipe_tables and pandoc bla.html --to markdown+pipe_tables-simple_tables but both seem to produce simple tables.
bla.html contains:
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>age</th>
<th>workclass</th>
<th>education</th>
<th>gender</th>
<th>hours-per-week</th>
<th>occupation</th>
<th>income</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>39</td>
<td>State-gov</td>
<td>Bachelors</td>
<td>Male</td>
<td>40</td>
<td>Adm-clerical</td>
<td><=50K</td>
</tr>
</tbody>
</table>
If I use -t markdown_github as suggested here, the output is html again.
I realized that "-t markdown_github" does produce the right result after I entered something into the first <th> cell. The empty cell seems to trip pandoc.

Scraping page with correct xpath using Mechanize and nokogiri

I am trying to access data contained in a table that is itself contained in a table with class ='L1'.
So basically my html structure is like this:
<table class="L1">
<table>
<tr></tr>
<tr>
<td></td>
<td>data</td>
</tr>
<tr>
<td></td>
<td>data</td>
</tr>
...ect...ect
</table>
</table>
I need to catch the data contained in a all <a> </a> that are in the second contained in <tr> </tr> but only starting with the second <tr> of the table.
So far I came up with that:
html_body = Nokogiri::HTML(body)
links = html_body.css('.L1').xpath("//table/tbody/tr/td[2]/a[1]")
But seems to me that this doesn't express the fact that I want to start only after the second <tr> (second <tr> included?
What would be the right code to do this ?
You can use position() to select the later elements that you want.
html_body = Nokogiri::HTML(body)
links = html_body.css('.L1').xpath("//table/tbody/tr[position()>1]/td[2]/a[1]")
As the comments on that SO answer say, remember XPath counts from 1, so >1 skips the first tr.

Import data from HTML page using feeds importer in drupal

I'm trying to import some data from a HTML page with feeds importer. The context is this:
<table class="tabela">
<tr valign="TOP">
<td class="formulario-legenda">Nome:</td>
<td nowrap="nowrap">
<b>Raul Fernando de Almeida Moreira Vidal</b>
</td>
</tr>
<tr valign="TOP">
<td class="formulario-legenda">Sigla:</td>
<td>
<b>RMV</b>
</td>
</tr>
<tr valign="TOP">
<td class="formulario-legenda">Código:</td>
<td>206415</td>
</tr>
<tr valign="TOP">
<td class="formulario-legenda">Estado:</td>
<td>Ativo</td>
</tr>
</table>
<table>
<tr>
<td class="topo">
<table>
<tr>
<td class="formulario-legenda">Categoria:</td>
<td>Professor Associado</td>
</tr>
<tr>
<td class="formulario-legenda">Carreira:</td>
<td>Pessoal Docente de Universidades</td>
</tr>
<tr>
<td class="formulario-legenda">Grupo profissional:</td>
<td>Docente</td>
</tr>
<tr valign="TOP">
<td class="formulario-legenda">Departamento:</td>
<td>
<a href="uni_geral.unidade_view?pv_unidade=151"
title="Departamento de Engenharia Informática">Departamento de Engenharia Informática</a>
</td>
</tr>
</table>
</td>
</tr>
</table>
I tried with this:
/html/body/div/div/div/div/div/div/div/table/tbody/tr/td/table/tbody/tr[1]/td[2]
but nothing appears. Can someone help me with the right syntax to obtain "Grupo Profissional"?
Quick answer that might work
Considering just the HTML sample you provided (which only has two tables) you can select the text you want using this expression, based on the table's position:
//table[2]//tr[3]/td[1]/text()
This will work in the HTML you pasted above. But it might not work in your actual scenario, since you might have other tables, the table you want to select has no ID and you didn't suggest some invariant text in your code which could be used to anchor the context for the expression. Assuming the initial part of your XPath expression (the div sequence) is correct, you might be able to use:
/html/body/div/div/div/div/div/div/div/table[2]//tr[3]/td[1]/text()
But it's wuite a fragile expression and vulnerable to any changes in the document.
A (possibly) better solution
A better alternative is to look for some identifier you could use. I can only guess, since I don't know your code. In your sample code, I would guess that Codigo and the number following it 206415 might be some identifier. If it is, you could use it to anchor your context. First you select it:
//table[.//td[text()='Código:']/following-sibling::td='206415']
The expression above will select the table which contains a td with the exact text Código: followed by a td containing the exact text 206415. This will create a unique context (considering that the number is an unique identifier). From that context, you can now select the text you want, which is inside the next table (following-sibling::table[1]). This is the context of the second table:
//table[.//td[text()='Código:']/following-sibling::td='206415']/following-sibling::table[1]
And this should select the text you want (Grupo profissional:) which is in the third row tr[3] and first cell/column td[1] of that table:
//table[.//td[text()='Código:']/following-sibling::td='206415']/following-sibling::table[1]//tr[3]/td[1]/text()

Resources