Getting attributed html element

Getting attributed html element - ruby

I'm trying to get table with content of MMEL codes from this site and I'm trying to accomplish it with CSS Selectors.
What I've got so far is:
require_relative 'sources/Downloader'
require 'nokogiri'
html_content = Downloader.download_page('http://www.s-techent.com/ATA100.htm')
parsed_html = Nokogiri::HTML(html_content)
tmp = parsed_html.css("tr[*]")
puts tmp.text
And I'm getting error while trying to get this tr with attribute. How can I complete this task to get this table in simple form because I want to parse it to JSON. It would be nice go get this in sections and call it in.each block.
EDIT:
I'd be nic if I can get things in block like this (look into pages source)
<TR><TD WIDTH="10%" VALIGN="TOP" ROWSPAN=5>
<B><FONT FACE="Arial" SIZE=2><P ALIGN="CENTER">11</B></FONT></TD>
<TD WIDTH="40%" VALIGN="TOP" COLSPAN=2>
<B><FONT FACE="Arial" SIZE=2><P>PLACARDS AND MARKINGS</B></FONT></TD>
<TD WIDTH="50%" VALIGN="TOP">
<FONT FACE="Arial" SIZE=2><P ALIGN="LEFT">All procurable placards, labels, etc., shall be included in the illustrated Parts Catalog. They shall be illustrated, showing the part number, Legend and Location. The Maintenance Manual shall provide the approximate Location (i.e., FWD -UPPER -RH) and illustrate each placard, label, marking, self -illuminating sign, etc., required for safety information, maintenance significant information or by government regulations. Those required by government regulations shall be so identified.</FONT></TD>
</TR>

This should print all those TR's from source at line 96. There are three tables in that page and table[1] has all the text you needed:
require 'nokogiri'
doc = Nokogiri::HTML(open('http://www.s-techent.com/ATA100.htm'))
doc.css("table")[1].css("tr").each do |i|
puts i #=> prints the exact html between TR tags (including)
puts i.text #=> prints the text
end
For instance:
puts doc.css("table")[1].css("tr")[2]
prints the following:
<tr>
<td valign="TOP" colspan="3">
<b><font face="Arial" size="2"><p align="CENTER">GROUP DEFINITION - AIRCRAFT</p></font></b>
</td>
<td valign="TOP">
<font face="Arial" size="2"><p align="LEFT">The complete operational unit. Includes dimensions and
areas, lifting and shoring, leveling and weighing, towing and taxiing, parking and mooring, requi
red placards, servicing.</p></font>
</td>
</tr>

You could do the same using xpath also:
Below is the content from the first table of the webpage given in the post by OP:
require 'nokogiri'
require 'open-uri'
doc = Nokogiri.HTML(open('http://www.s-techent.com/ATA100.htm'))
doc.xpath('(//table)[1]/tr').each do |tr|
puts tr.to_html(:encoding => 'utf-8')
end
Output:
<tr>
<td width="33%" valign="MIDDLE" colspan="2">
<p><img src="S-Tech-Logo-Blue2.gif" width="274" height="127"></p>
</td>
<td width="67%" valign="MIDDLE">
<b><i><font face="Arial" color="#0000ff">
<p align="CENTER"><big>AIRCRAFT PARTS MANUFACTURING ASSISTANCE (PMA)</big><br><big>DAR SERVICES</big></p></font></i></b>
</td>
</tr>
Now, if you want to collect the last table rows, then do:
require 'nokogiri'
require 'open-uri'
doc = Nokogiri.HTML(open('http://www.s-techent.com/ATA100.htm'))
p doc.xpath('(//table)[3]/tr').to_a.size # => 1
doc.xpath('(//table)[3]/tr').each do |tr|
puts tr.to_html(:encoding => 'utf-8')
end
Output:
<tr>
<td width="40%" valign="TOP" height="10">
<p align="CENTER"><b><font face="Arial" size="2" color="#0000ff">149 AZALEA CIRCLE • LIMERICK, PA 19468-1330</font></b></p>
</td>
<td width="30%" valign="TOP" height="10">
<p align="CENTER"><b><font face="Arial" size="2" color="#0000ff">610-495-6898 (Office) • 484-680-0507 (Cell)</font></b></p>
</td>
<td width="110%" valign="TOP" height="10">
<p align="CENTER"><b><font face="Arial" size="2">E-mail S-Tech</font></b></p>
</td>
</tr>

Related

Using Ruby / Nokogiri to parse randomized class names

I've been doing calculations by hand when it comes to the remaining percentage of the US Presidential election votes in various states. With so many updates and states – this is getting tiring. So why not automate the process?
Here's what I'm looking at:
The problem is that the class names have been randomized. For example, here's the one I'm interested in:
<td class="jsx-3768461732 votes votes-row">2,450,186</td>
Playing around in irb, I tried to use a wildcard on "votes votes-row", since this only appears when I need it in the doc:
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(open("https://www.politico.com/2020-election/results/georgia/"))
votes = doc.css("[td*='votes-row']")
...which yields no results (=> [])
What am I doing wrong and how to fix? I'm ok with xpath – I just want to make sure changes made elsewhere in the doc don't affect finding these elements.

There's probably a better way but...
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(open("https://www.politico.com/2020-election/results/georgia/"))
votes = doc.css('tr[class*="candidate-row"]').map { |row| row.css('td').map { |cell| cell.content } }
biden_row = votes.find_index { |row| row[0] =~ /biden/i }
trump_row = votes.find_index { |row| row[0] =~ /trump/i }
biden_votes = votes[biden_row][1].split('%')[1]
trump_votes = votes[trump_row][1].split('%')[1]
Edit: from the HTML source the relevant table looks like:
<table class="jsx-1526769828 candidate-table">
<thead class="jsx-3554868417 table-head">
<tr class="jsx-3554868417">
<th class="table-header jsx-3554868417 candidate-name">
<h5 class="jsx-3554868417">Candidate</h5>
</th>
<th class="table-header jsx-3554868417 percent">
<h5 class="jsx-3554868417">Pct.</h5>
</th>
<th class="table-header jsx-3554868417 vote-bar"></th>
</tr>
</thead>
<tbody class="jsx-2085888330 table-head">
<tr class="jsx-2677388595 candidate-row">
<td class="jsx-3948343365 candidate-name name-row">
<div class="jsx-1912693590 name-only candidate-short-name">Biden</div>
<div class="jsx-3948343365 candidate-party-tag">
<div class="jsx-1420258095 party-label dem">dem</div>
</div>
<div class="jsx-3948343365 candidate-winner-check"></div>
</td>
<td class="jsx-3830922081 percent percent-row">
<div class="candidate-percent-only jsx-3830922081">49.4%</div>
<div class="candidate-votes-next-to-percent jsx-3830922081">2,450,193</div>
</td>
<td class="jsx-3458171655 vote-bar vote-bar-row">
<div style="width:49.4%" class="jsx-3458171655 bar dem"></div>
</td>
</tr>
<tr class="jsx-2677388595 candidate-row">
<td class="jsx-3948343365 candidate-name name-row">
<div class="jsx-1912693590 name-only candidate-short-name">Trump*</div>
<div class="jsx-3948343365 candidate-party-tag">
<div class="jsx-1420258095 party-label gop">gop</div>
</div>
<div class="jsx-3948343365 candidate-winner-check"></div>
</td>
<td class="jsx-3830922081 percent percent-row">
<div class="candidate-percent-only jsx-3830922081">49.4%</div>
<div class="candidate-votes-next-to-percent jsx-3830922081">2,448,635</div>
</td>
<td class="jsx-3458171655 vote-bar vote-bar-row">
<div style="width:49.4%" class="jsx-3458171655 bar gop"></div>
</td>
</tr>
</tbody>
</table>
So you could probably use the candidate-votes-next-to-percent to get this value. e.g.:
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(open("https://www.politico.com/2020-election/results/georgia/"))
votes = doc.css('tr[class*="candidate-row"]').map do |row|
[
row.css('div[class*="candidate-short-name"]').first.content,
row.css('div[class*="candidate-votes-next-to-percent"]').first.content
]
end
# => [["Biden", "2,450,193"], ["Trump*", "2,448,635"]]

Import data from HTML page using feeds importer in drupal

I'm trying to import some data from a HTML page with feeds importer. The context is this:
<table class="tabela">
<tr valign="TOP">
<td class="formulario-legenda">Nome:</td>
<td nowrap="nowrap">
<b>Raul Fernando de Almeida Moreira Vidal</b>
</td>
</tr>
<tr valign="TOP">
<td class="formulario-legenda">Sigla:</td>
<td>
<b>RMV</b>
</td>
</tr>
<tr valign="TOP">
<td class="formulario-legenda">Código:</td>
<td>206415</td>
</tr>
<tr valign="TOP">
<td class="formulario-legenda">Estado:</td>
<td>Ativo</td>
</tr>
</table>
<table>
<tr>
<td class="topo">
<table>
<tr>
<td class="formulario-legenda">Categoria:</td>
<td>Professor Associado</td>
</tr>
<tr>
<td class="formulario-legenda">Carreira:</td>
<td>Pessoal Docente de Universidades</td>
</tr>
<tr>
<td class="formulario-legenda">Grupo profissional:</td>
<td>Docente</td>
</tr>
<tr valign="TOP">
<td class="formulario-legenda">Departamento:</td>
<td>
<a href="uni_geral.unidade_view?pv_unidade=151"
title="Departamento de Engenharia Informática">Departamento de Engenharia Informática</a>
</td>
</tr>
</table>
</td>
</tr>
</table>
I tried with this:
/html/body/div/div/div/div/div/div/div/table/tbody/tr/td/table/tbody/tr[1]/td[2]
but nothing appears. Can someone help me with the right syntax to obtain "Grupo Profissional"?

Quick answer that might work
Considering just the HTML sample you provided (which only has two tables) you can select the text you want using this expression, based on the table's position:
//table[2]//tr[3]/td[1]/text()
This will work in the HTML you pasted above. But it might not work in your actual scenario, since you might have other tables, the table you want to select has no ID and you didn't suggest some invariant text in your code which could be used to anchor the context for the expression. Assuming the initial part of your XPath expression (the div sequence) is correct, you might be able to use:
/html/body/div/div/div/div/div/div/div/table[2]//tr[3]/td[1]/text()
But it's wuite a fragile expression and vulnerable to any changes in the document.
A (possibly) better solution
A better alternative is to look for some identifier you could use. I can only guess, since I don't know your code. In your sample code, I would guess that Codigo and the number following it 206415 might be some identifier. If it is, you could use it to anchor your context. First you select it:
//table[.//td[text()='Código:']/following-sibling::td='206415']
The expression above will select the table which contains a td with the exact text Código: followed by a td containing the exact text 206415. This will create a unique context (considering that the number is an unique identifier). From that context, you can now select the text you want, which is inside the next table (following-sibling::table[1]). This is the context of the second table:
//table[.//td[text()='Código:']/following-sibling::td='206415']/following-sibling::table[1]
And this should select the text you want (Grupo profissional:) which is in the third row tr[3] and first cell/column td[1] of that table:
//table[.//td[text()='Código:']/following-sibling::td='206415']/following-sibling::table[1]//tr[3]/td[1]/text()

How do I parse HTML using Nokogiri?

I'd like to parse an HTML file extracting relevant data to use in my research. Here's a piece of the HTML:
<td class="color_line1" valign="center"><a class="linkpadrao" href="javascript:Direciona('5453*SERRA#TALHADA');">Serra Talhada</a></td>
<td class="color_line" valign="center" align="center">9</td>
<td class="color_line" valign="center" align="center">2,973</td>
<td class="color_line" valign="center" align="center">0,016</td>
<td class="color_line" valign="center" align="center">2,939</td>
<td class="color_line" valign="center" align="center">3,000</td>
<td class="color_line" valign="center" align="center">0,572</td>
<td class="color_line" valign="center" align="center">2,401</td>
<td class="color_line" valign="center" align="center">0,024</td>
<td class="color_line" valign="center" align="center">2,378</td>
<td class="color_line" valign="center" align="center">2,426</td>
</tr>
Being more specific, I'd like to get the "Serra Talhada" (as a city name), and also all of the numbers below the city name (it's the max, min and average price of gas).
I tried this so far:
require 'rubygems'
require 'mechanize'
require 'nokogiri'
require 'open-uri'
url = "http://www.anp.gov.br/preco/prc/Resumo_Por_Estado_Municipio.asp"
agent = Mechanize.new
parameters = {'selSemana' => '737*De+28%2F07%2F2013+a+03%2F08%2F2013',
'desc_semana' => 'de+28%2F07%2F2013+a+03%2F08%2F2013',
'cod_Semana' => '737',
'tipo' => '1',
'Cod_Combustivel' => 'undefined',
'selEstado' => 'PE*PERNAMBUCO',
'selCombustivel' => '487*Gasolina',
}
municipio = []
page = agent.post(url, parameters)
extrair = page.parser
extrair.css('.linkpadrao').each do |posto|
# Municipios
municipio << posto.text
end
I can't figure out how to get the numbers as they have the same HTML structure.
Any thoughts?!

Since you need to find the cells with respect to the city link, you should find their common ancestor - in this case their tr.
Using xpath, you can locate a specific cell by its text:
# This is the table that contains all of the city data
data_table = extrair.at_css('.table_padrao')
# This is the specific row that contains the specified city
row = data_table.xpath('//tr[td/a[#class="linkpadrao" and text()="Serra Talhada"]]')
# This is the data in the specific row
data = row.css(".color_line").map{|e| e.text }
#=> ["9", "2,973", "0,016", "2,939", "3,000", "0,572", "2,401", "0,024", "2,378", "2,426"]

You can get the numbers following each posto with:
posto.parent.search('~ td').map &:text

xpath selecting text from link in <td> & text from <td>

I have the following code which works very well:
rows = diary_HTML.xpath('//*[#id="main"]/div[2]/table/tbody/tr')
food_diary = rows.collect do |row|
detail = {}
[
["Food", 'td[1]/text()'],
["Calories", 'td[2]/text()'],
["Carbs", 'td[3]/text()'],
["Fat", 'td[4]/text()'],
["Protein", 'td[5]/text()'],
["Cholest", 'td[6]/text()'],
].each do |name, xpath|
detail[name] = row.at_xpath(xpath).to_s.strip
end
detail
end
However the "Food" td does not only include text, but also a link from which I want to get the text.
I know I can use 'td[1]/a/text()'to get the link text, but how do I do both?
'td[1]/a/text()' or 'td[1]/text()'
EDITED - Added Snippet.
I am trying to include the <tr class="meal_header">
<td class="first alt">Breakfast</td> on the first row, all lines with other regular tds on other rows whilst excluding td1 on the bottom row.
<tr class="meal_header">
<td class="first alt">Breakfast</td>
<td class="alt">Calories</td>
<td class="alt">Carbs</td>
<td class="alt">Fat</td>
<td class="alt">Protein</td>
<td class="alt">Sodium</td>
<td class="alt">Sugar</td>
</tr>
<tr>
<td class="first alt">
<a onclick="showEditFood(3992385560);" href="#">Hovis (Uk - White Bread (40g) Toasted With Flora Light Marg, 2 slice</a> </td>
<td>262</td>
<td>36</td>
<td>9</td>
<td>7</td>
<td>0</td>
<td>3</td>
</tr>
<tr class="bottom">
<td class="first alt" style="z-index: 10">
Add Food
<div class="quick_tools">
Quick Tools
<div id="quick_tools_0" class="quick_tools_options hidden">
<ul>
<li><a onclick="showLightbox(200, 250, '/food/quick_add?meal=0&date=2013-04-15'); return false;">Quick add calories</a></li>
<li>Remember meal</li>
<li>Copy yesterday</li>
<li>Copy from date</li>
<li>Copy to date</li>
</ul>
</div>
<div id="recent_meals_0" class="recent_meal_options hidden">
<ul id="recent_meal_options_0">
<li class="header">Copy from which date?</li>
<li>Sunday, April 14</li>
<li>Saturday, April 13</li>
</ul>
</div>
</div>
</td>
<td>285</td>
<td>39</td>
<td>9</td>
<td>10</td>
<td>0</td>
<td>3</td>
<td></td>

The short answer is: use Nokogiri::XML::Element#text, it will give the text of the element plus subelements (your a for example).
You can also clean that code up quite a bit:
keys = ["Food", "Calories", "Carbs", "Fat", "Protein", "Cholest"]
food_diary = rows.collect do |row|
Hash[keys.zip row.search('td').map(&:text)]
end
And as a final tip, avoid using xpath with html, css is so much nicer.

I think you can achieve this by altering the logic to look at element content when you don't have an explicit text() extraction in the xpath
rows = diary_HTML.xpath('//*[#id="main"]/div[2]/table/tbody/tr')
food_diary = rows.collect do |row|
detail = {}
[
["Food", 'td[1]'],
["Calories", 'td[2]/text()'],
["Carbs", 'td[3]/text()'],
["Fat", 'td[4]/text()'],
["Protein", 'td[5]/text()'],
["Cholest", 'td[6]/text()'],
].each do |name, xpath|
if xpath.include?('/text()')
detail[name] = row.at_xpath(xpath).to_s.strip
else
detail[name] = row.at_xpath(xpath).content.strip
end
end
detail
end
You could also add e.g. a symbol to the array, to describe how you were extracting the data, and have a case block which handled items depending on what the last stage was to do following the xpath
Note you could also do what you want by walking the node structure returned by xpath recursively, but that seems like overkill if you just want to ignore markup, links etc.

Not able to click on the nested anchor element using xpath of selenium webdriver

Please find the Html below :
<table class="data" id="filteredTable" cellpadding="0" cellspacing="1">
<tbody>
<tr class="rowLight">
<td class="lt4"><input name="ids" value="att1" type="checkbox"></td>
<td><a hfref= " link1" > foo </a>
</td>
<td>item1</td>
<td>item2</td>
<td>item3</td>
</tr>
<tr class="rowDark">
<td class="lt4"><input name="ids" value="att2" type="checkbox"></td>
<td><a hfref= " link2" > boo </a>
</td>
<td>item1</td>
<td>item2</td>
<td>item3</td>
</tr>
<tr class="rowLight">
<td class="lt4"><input name="ids" value="att3" type="checkbox"></td>
<td><a hfref= " link3" > bar </a>
</td>
<td>item1</td>
<td>item2</td>
<td>item3</td>
</tr>
Now I need to click on the link of bar. But my below Xpath not helping me to get into the bar as well. So any help how to be done the same.
I didn't give the html for the part //form[contains(#name,'filterset_FilterSetListForm')]/table[contains(#class,'contentBody')]/tbody/tr/td/table[contains(#class,'content')]/tbody/tr/td/. <~~ till this I am correct. Confusion start after <tr> from here /table[contains(#id,'filteredTable')]/tbody/tr
Part-II:
When there will be a match say bar , can their associated check box (s) be clicked?
Any help in this regard?
I am using selenium -web driver with Ruby 1.9.3 .

You can get the a element this way
/table[contains(#id,'filteredTable')]/tbody/tr/td/a[contains(text(),'bar')]
or if you want an exact match to the link text
/table[contains(#id,'filteredTable')]/tbody/tr/td/a[text()=' bar ']

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Getting attributed html element - ruby

Related

Using Ruby / Nokogiri to parse randomized class names

Import data from HTML page using feeds importer in drupal

How do I parse HTML using Nokogiri?

xpath selecting text from link in <td> & text from <td>

Not able to click on the nested anchor element using xpath of selenium webdriver

Categories

Resources