Creating pipe tables with pandoc - pandoc

Is it possible to convert a html table into a pipe table with pandoc (or using any other tool)?
I tried pandoc bla.html --to markdown+pipe_tables and pandoc bla.html --to markdown+pipe_tables-simple_tables but both seem to produce simple tables.
bla.html contains:
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>age</th>
<th>workclass</th>
<th>education</th>
<th>gender</th>
<th>hours-per-week</th>
<th>occupation</th>
<th>income</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>39</td>
<td>State-gov</td>
<td>Bachelors</td>
<td>Male</td>
<td>40</td>
<td>Adm-clerical</td>
<td><=50K</td>
</tr>
</tbody>
</table>
If I use -t markdown_github as suggested here, the output is html again.

I realized that "-t markdown_github" does produce the right result after I entered something into the first <th> cell. The empty cell seems to trip pandoc.

Related

XPath find text according last word in the string

I need to find the whole text according last word in the string. I have something like this:
<table>
<tr>
<td style='white-space:nowrap;'>
<a href=''>test</a>
</td>
<td>any text</td>
<td>text text texttofind</td>
<td>Not Available</td>
<td class='aui-lozenge aui-lozenge-default'>text</td>
</tr>
<tr>
<td style='white-space:nowrap;'>
<a href=''>test</a>
</td>
<td>any text</td>
<td>text text texttofind2</td>
<td>Not Available</td>
<td class='aui-lozenge aui-lozenge-default'>text</td>
</tr>
<tr>
<td style='white-space:nowrap;'>
<a href=''>test</a>
</td>
<td>any text</td>
<td>text text texttofind3</td>
<td>Not Available</td>
<td class='aui-lozenge aui-lozenge-default'>text</td>
</tr>
</table>
I need to find whole text vallue according last word texttofind
<td>text text texttofind</td>
I cant use contains, because it will find multiple values. I need something like ends-with but I am using xpath 1.0.
I tried something like this, but I am not sure what is wrong because it is not working
//tr[substring(., string-length(#td)
- string-length('texttofind') + 1) = 'texttofind']
or maybe it would be better to use matches?
You're almost there; try changing your xpath expression to
//tr//td[substring(., string-length(.)
- string-length('texttofind') + 1) = 'texttofind']
and see if it works.

How to turn a table into a single block of text with scrapy

I am trying to scrape a table which looks like the below.
<table class="table">
<caption>Caption</caption>
<tbody>
<tr>
<th scope="row">Title</th>
<td>Detail</td>
</tr>
<tr>
<th scope="row">Title 2</th>
<td>Detail 2</td>
</tr>
</tbody>
</table>
How would you set up scrapy so my output file generates an output similar to the below?!
Title: Detail
Title2: Detail2
Currently I can get all the text using two css selectors (one for the td's and one for the th's) but I would love to be able to combine these!
Unfortunately the number of rows differs from page to page..
Using xpath:
tabledata={}
for i in response.xpath("//table[#class='table']//tr")
tabledata[i.xpath("th/text()").extract_first()] = i.xpath("td/text()").extract_first()
Output
{"Title":"Detail", "Title 2":"Detail 2"}

Import data from HTML page using feeds importer in drupal

I'm trying to import some data from a HTML page with feeds importer. The context is this:
<table class="tabela">
<tr valign="TOP">
<td class="formulario-legenda">Nome:</td>
<td nowrap="nowrap">
<b>Raul Fernando de Almeida Moreira Vidal</b>
</td>
</tr>
<tr valign="TOP">
<td class="formulario-legenda">Sigla:</td>
<td>
<b>RMV</b>
</td>
</tr>
<tr valign="TOP">
<td class="formulario-legenda">Código:</td>
<td>206415</td>
</tr>
<tr valign="TOP">
<td class="formulario-legenda">Estado:</td>
<td>Ativo</td>
</tr>
</table>
<table>
<tr>
<td class="topo">
<table>
<tr>
<td class="formulario-legenda">Categoria:</td>
<td>Professor Associado</td>
</tr>
<tr>
<td class="formulario-legenda">Carreira:</td>
<td>Pessoal Docente de Universidades</td>
</tr>
<tr>
<td class="formulario-legenda">Grupo profissional:</td>
<td>Docente</td>
</tr>
<tr valign="TOP">
<td class="formulario-legenda">Departamento:</td>
<td>
<a href="uni_geral.unidade_view?pv_unidade=151"
title="Departamento de Engenharia Informática">Departamento de Engenharia Informática</a>
</td>
</tr>
</table>
</td>
</tr>
</table>
I tried with this:
/html/body/div/div/div/div/div/div/div/table/tbody/tr/td/table/tbody/tr[1]/td[2]
but nothing appears. Can someone help me with the right syntax to obtain "Grupo Profissional"?
Quick answer that might work
Considering just the HTML sample you provided (which only has two tables) you can select the text you want using this expression, based on the table's position:
//table[2]//tr[3]/td[1]/text()
This will work in the HTML you pasted above. But it might not work in your actual scenario, since you might have other tables, the table you want to select has no ID and you didn't suggest some invariant text in your code which could be used to anchor the context for the expression. Assuming the initial part of your XPath expression (the div sequence) is correct, you might be able to use:
/html/body/div/div/div/div/div/div/div/table[2]//tr[3]/td[1]/text()
But it's wuite a fragile expression and vulnerable to any changes in the document.
A (possibly) better solution
A better alternative is to look for some identifier you could use. I can only guess, since I don't know your code. In your sample code, I would guess that Codigo and the number following it 206415 might be some identifier. If it is, you could use it to anchor your context. First you select it:
//table[.//td[text()='Código:']/following-sibling::td='206415']
The expression above will select the table which contains a td with the exact text Código: followed by a td containing the exact text 206415. This will create a unique context (considering that the number is an unique identifier). From that context, you can now select the text you want, which is inside the next table (following-sibling::table[1]). This is the context of the second table:
//table[.//td[text()='Código:']/following-sibling::td='206415']/following-sibling::table[1]
And this should select the text you want (Grupo profissional:) which is in the third row tr[3] and first cell/column td[1] of that table:
//table[.//td[text()='Código:']/following-sibling::td='206415']/following-sibling::table[1]//tr[3]/td[1]/text()

About Selenium IDE

How to record a Autosuggest box in a application through Selenium IDE
You may have some success using a combination of mouseDownAt, mouseUpAt and waitForElementPresent.
e.g.
mouseDownAt,id=autoSuggestInput,5,5
mouseUpAt,id=autoSuggestInput,5,5
waitForElementPresent, css=.autoSuggestLink
Can you provide a link to the demo page of the auto suggest script you're using
I have created one demo script using "http://jquery.bassistance.de/" Base URL. So please check following script for this.
<tr>
<td>open</td>
<td>/autocomplete/demo/</td>
<td></td>
</tr>
<tr>
<td>waitForText</td>
<td>id=suggest1</td>
<td></td>
</tr>
<tr>
<td>typeKeys</td>
<td>id=suggest1</td>
<td>Ad</td>
</tr>
<tr>
<td>waitForText</td>
<td>css=li.ac_even.ac_over</td>
<td>Ada</td>
</tr>
<tr>
<td>assertText</td>
<td>xpath=/x:html/x:body/x:div[2]/x:ul/x:li[1]</td>
<td>Ada</td>
</tr>
<tr>
<td>assertText</td>
<td>xpath=/x:html/x:body/x:div[2]/x:ul/x:li[2]</td>
<td>Adamsville</td>
</tr>

Cucumber + Selenium - How to count the number of rows in a table?

Does anyone know a quick way to count the number of entries in a table using Ruby, Cucumber & Selenium?
The table is fairly basic, I want to count the number of rows:
<table id="product_container">
<tr>
<th>Product Name</th>
<th>Qty In Stock</th>
</tr>
<tr>
<td>...</td>
<td>...</td>
</tr>
</table>
You can use:
page.should have_css "#product_container tr", :count => number_of_rows.to_i
The following step definition should work with Capybara.
Then /^I should have (\d+) table rows$/ do |number_of_rows|
actual_number = page.all('#product_container tr').size
actual_order.should == number_of_rows
end
Usage:
Then I should have 10 table rows
The page.all documentation.
I always use getXpathCount() (Selenium method) in such situation and it works fine :)
In PHP:
$rowsCount = $this->getXpathCount("//table[#id='product_container']/tr");
And if you don't want to count header rows, you should edit the table as:
<table id="product_container">
<thead>
<tr>
<th>Product Name</th>
<th>Qty In Stock</th>
</tr>
</thead>
<tbody>
<tr>
<td>...</td>
<td>...</td>
</tr>
</tbody>
</table>
Then you can get the products count:
$rowsCount = $this->getXpathCount("//table[#id='product_container']/tbody/tr");

Resources