I want to scrape data from a table with Ruby and Nokogiri.
There are a lot of <td> elements, but I only need the country which is just text after a <br> element. The problem is, the <td> elements differ. Sometimes there is more than just the country.
For example:
<td>Title1<br>USA</td>
<td>Title2<br>Michael Powell<br>UK</td>
<td>Title3<br>Leopold Lindtberg<br>Ralph Meeker<br>Switzerland</td>
I want to address the element before the closing </td> tag because the country is always the last element.
How can I do that?
I'd use this:
require 'awesome_print'
require 'nokogiri'
html = '
<td>Title1<br>USA</td>
<td>Title2<br>Michael Powell<br>UK</td>
<td>Title3<br>Leopold Lindtberg<br>Ralph Meeker<br>Switzerland</td>
'
doc = Nokogiri::HTML(html)
ap doc.search('td').map{ |td| td.search('text()').last.text }
[
[0] "USA",
[1] "UK",
[2] "Switzerland"
]
The problem is that your HTML being parsed won't have rows of <td> tags, so you'll have to locate the ones you want to parse. Instead, they'll be interspersed between <tr> tags, and maybe even different <table> tags. Because your HTML sample doesn't show the true structure of the document, I can't help you more.
There are bunch of different solutions. Another solution using only the standard library is to substring out the things you dont want.
node_string = <<-STRING
<td>Title1<br>USA</td>
<td>Title2<br>Michael Powell<br>UK</td>
<td>Title3<br>Leopold Lindtberg<br>Ralph Meeker<br>Switzerland</td>
STRING
node_string.split("<td>").collect do |str|
last_str = str.split("<br>").last
last_str.gsub(/[\n,\<\/td\>]/,'') unless last_str.nil?
end.compact
Related
I'm working on a Ruby script that uses Nokogiri and CSS selectors. I'm trying to scrape some data from HTML that looks like this:
<h2>Title 1</h2>
(Part 1)
<h2>Title 2</h2>
(Part 2)
<h2>Title 3</h2>
(Part 3)
Is there a way to select from Part 2 only by specifying the text of the h2 elements that represent the start and end points?
The data of interest in Part 2 is a table with tr and td elements that don't have any class or id identifiers. The other parts also have tables I'm not interested in. Something like
page.css('table tr td')
on the entire page would select from all of those other tables in addition to the one I'm after, and I'd like to avoid that if at all possible.
I'd probably use this as a first attempt:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<h2>Title 1</h2>
(Part 1)
<h2>Title 2</h2>
<table>
<tr><td>(Part 2)</td></tr>
</table>
<h2>Title 3</h2>
(Part 3)
EOT
doc.css('h2')[1].next_element
.to_html # => "<table>\n <tr><td>(Part 2)</td></tr>\n </table>"
Alternately, rather than use css('h2')[1], I could pass some of the task to the CSS selector:
doc.at('h2:nth-of-type(2)').next_element
.to_html # => "<table>\n <tr><td>(Part 2)</td></tr>\n </table>"
Once you have the table then it's easy to grab data from it. There are lots of examples how to do it out there.
According to "Is there a CSS selector for elements containing certain text?", I'm afraid there is no CSS selector working on element text. How about first extract "(Part 2)", and then using Nokogiri to select table elements inside it?
text = "" //your string, or content from a file
part2 = text.scan(/<h2>Title 2<\/h2>\s+(.+)?<h2>/ms).first.first
doc = Nokogiri::HTML(part2)
# continue select table elements from doc
(Part 2) can not contain any h2 tag, or the regex should be different.
If you know that the tables will be static, and the data you require will always be in the second table. You can do something like:
page.css('table')[1].css('tr')[3].css('td')
This will get us the second table on the page, access the 4th row of that table and get us all the values of that row.
I haven't tested this, but this would be the way I would do it if the table I require doesn't have a class or identifier.
I'd probably use this as a first attempt:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<h2>Title 1</h2>
(Part 1)
<h2>Title 2</h2>
<table>
<tr><td>(Part 2)</td></tr>
</table>
<h2>Title 3</h2>
(Part 3)
EOT
doc.css('h2')[1].next_element.to_html # => "<table>\n <tr><td>(Part 2)</td></tr>\n </table>"
Alternately, rather than use css('h2')[1], I could pass some of the task to the CSS selector:
doc.at('h2:nth-of-type(2)').next_element
.to_html # => "<table>\n <tr><td>(Part 2)</td></tr>\n </table>"
next_element is the trick used to find the node following the current one. There are many "next" and "previous" methods so read up on them as they're very useful for this sort of situation.
Finally, to_html is used above to show us what Nokogiri returned in a more friendly output. You wouldn't use it unless it was necessary to output HTML.
Hi I am scrapping a webpage with mechanize and nokogiri. I am selecting a series of links <a></a>
html_body = Nokogiri::HTML(body)
links = html_body.css('.L1').xpath("//table/tbody/tr/td[2]/a[1]")
Then I need to check if the content of each link (<a>content</a>, not the href) matches some stuff in my db. I am doing this:
links.each do |link|
if link = #tournament.homologation_number
if my condition is realized I need to select the <td></td> that is right before the <td> of the link I checked and click on the link that's in it.
<td></td>
<td>content I check with my condition</td>
How can I achieve this using Mechanize and nokogiri ?
I would iterate the first td's because it's easier to get at following elements than previous ones (with css anyway)
page.search('td[1]').each do |td|
if td.at('+ td a').text == 'foo'
page2 = agent.get td.at('a')[:href]
end
end
First of all you have to select all <td></td>, the followining xpath //table/tbody/tr/td[2]/a[1] only selects the first <a></a> element, so you could try something like //table/tbody/tr/td, but this depends on the situation.
Once you have your array of <td></td> you can access their links like this:
tds.each do |td|
link = td.children.first # Select the first children
if condition_is_matched(link.html) # Only consider the html part of the link, if matched follow the previous link
previous_td = td.previous
previous_url = previous_td.children.first.href
goto_url previous_url
end
end
I have trouble parsing a HTML table using Nokogiri and Ruby. My HTML table structure looks like this
<table>
<tbody>
<tr>
<td>Firstname</td>
<td>Lastname</td>
<td>Middle</td>
</tr>
<tr>
<td>ding</td>
<td>dong</td>
<td>ling</td>
</tr>
....
....
.... {more tr's and td's with similar data exists.}
....
....
....
....
....
</tbody>
</table>
In the above HTML table I would like to entirely remove the first and corresponding elements, so remove Firstname, Lastname and Middle i.e., I want to start stripping the text only from the second . So this way I get only the contents of the table from the second or tr[2] and no column headers.
Can someone please provide me a code as to how to do this.
Thanks.
require 'rubygems'
require 'nokogiri'
doc = Nokogiri::HTML(x)
rows = doc.xpath('//table/tbody/tr[position() > 1]')
# OR
rows = doc.xpath("//table/tbody/tr")
header = rows.shift
After you've run either one of the above 2 snippets, rows will contain every <tr>...</tr> after the first one. For example puts rows.to_xml prints the following:
<tr><td>ding</td>
<td>dong</td>
<td>ling</td>
</tr>
To get the inner text, removing all the html tags, run puts rows.text
ding
dong
ling
To get the inner text of the td tags only, run rows.xpath('td').map {|td| td.text }
["ding", "dong", "ling"]
Alternatively:
table.css('tr')[1..-1]
or to strip out the text starting at row 2:
table.css('tr')[1..-1].map{|tr| tr.css('td').map &:text}
Since Nokogiri does support :has CSS pseudo-class you can get heading row with
#doc.at_css('table#table_id').css('tr:has(th)')
and since it does supports :not CSS pseudo-class as well, you can get other rows with
#doc.at_css('table#table_id').css('tr:not(:has(th))')
respectively. Depending on your preferences you might like to avoid negation and just use css('tr:has(td)').
I am trying to work out a method to check the content of an HTML table with Watir-webdriver. Basically I want to validate the table contents against a saved valid table (CSV file) and they are the same after a refresh or redraw action.
Ideas I've come up with so far are to:
Grab the table HTML and compare that as a string with the baseline value.
Iterate through each cell and compare the HTML or text content.
Generate a 2D array representation on the table contents and do an array compare.
What would be the fastest/best approach? Do you have insights on how you handled a similar problem?
Here is an example of the table:
<table id="attr-table">
<thead>
<tr><th id="attr-action-col"><input type="checkbox" id="attr-action-col_box" class="attr-action-box" value=""></th><th id="attr-scope-col"></th><th id="attr-workflow-col">Status</th><th id="attr-type-col"></th><th id="attr-name-col">Name<span class="ui-icon ui-icon-triangle-1-n"></span></th><th id="attr-value-col">Francais Value</th></tr></thead>
<tbody>
<tr id="attr-row-209"><td id="attr_action_209" class="attr-action-col"><input type="checkbox" id="attr_action_209_box" class="attr-action-box" value=""></td><td id="attr_scope_209" class="attr-scope-col"><img src="images/attrib_bullet_global.png" title="global"></td><td id="attr_workflow_209" class="attr-workflow-col"></td><td id="attr_type_209" class="attr-type-col"><img src="images/attrib_text.png"></td><td id="attr_name_209" class="attr-name-col">Name of: Catalogue</td><td id="attr_value_209" class="attr-value-col"><p class="acms ws-editable-content lang_10">2010 EI-176</p></td></tr>
<tr id="attr-row-316"><td id="attr_action_316" class="attr-action-col"><input type="checkbox" id="attr_action_316_box" class="attr-action-box" value=""></td><td id="attr_scope_316" class="attr-scope-col"><img src="images/attrib_bullet_global.png" title="global"></td><td id="attr_workflow_316" class="attr-workflow-col"></td><td id="attr_type_316" class="attr-type-col"><img src="images/attrib_text.png"></td><td id="attr_name_316" class="attr-name-col">_[Key] Media key</td><td id="attr_value_316" class="attr-value-col"><p class="acms ws-editable-content lang_10"><span class="acms acms-choice" contenteditable="false" id="568">163</span></p></td></tr>
<tr id="attr-row-392"><td id="attr_action_392" class="attr-action-col"><input type="checkbox" id="attr_action_392_box" class="attr-action-box" value=""></td><td id="attr_scope_392" class="attr-scope-col"><img src="images/attrib_bullet_global.png" title="global"></td><td id="attr_workflow_392" class="attr-workflow-col"></td><td id="attr_type_392" class="attr-type-col"><img src="images/attrib_numeric.png"></td><td id="attr_name_392" class="attr-name-col">_[Key] Numéro d'ordre</td><td id="attr_value_392" class="attr-value-col"><p class="acms ws-editable-content lang_10">2</p></td></tr>
</tbody>
</table>
Just one idea I came up with. I used Hash and Class object instead of 2D array.
foo.csv
209,global,text.Catalogue,2010 EI-176
392,global,numeric,Numéro d'ordre,2
require 'csv'
expected_datas = CSV.readlines('foo.csv').map do |row|
{
:id => row[0],
:scope => row[1],
:type => row[2],
:name => row[3],
:value => row[4]
}
end
class Data
attr_reader :id,:scope,:type,:name,:value
def initialize(tr)
id = tr.id.slice(/attr-row-([0-9]+)/,1)
scope = tr.td(:id,/scope/).img.src.slice(/attr_bullet_(.+?).png/,1)
type = tr.td(:id,/type/).img.src.slice(/attrib_(.+?).png/,1)
name = tr.td(:id,/name/).text
value = tr.td(:id,/value/).text
end
end
browser = Watir::Browser.new
browser.goto 'foobar'
datas = browser.table(:id,'attr-table').tbody.trs.map{|tr| Data.new(tr)}
datas.zip(expected_datas).each do |data,expected_data|
Data.instance_methods(false).each do |method|
data.send(method).should == expected_data[method.to_sym]
end
end
# something action (refresh or redraw action)
browser.refresh
after_datas = browser.table(:id,'attr-table').tbody.trs.map{|tr| Data.new(tr)}
datas.zip(after_datas).each do |data,after_data|
Data.instance_methods(false).each do |method|
data.send(method).should == after_data.send(method)
end
end
What level of detail do you want the mismatch(es) reported with? I think that might well define the approach you want to take.
For example if you just want to know if there's a mismatch, and don't care where, then comparing arrays might be easiest.
If the order of the rows could vary, then I think comparing Hashes might be best
If you want each mismatch reported individually then iterating by row and column would allow you to report discrete errors, especially if you build a list of differences and then do your assert at the very end based on number of differences found
You could go for exact match
before_htmltable <=> after_htmltable
Or you could strip whitespace
before_htmltable.gsub(/\s+/, ' ') <=> after_htmltable.gsub(/\s+/, ' ')
I would think that creating the array then comparing each element would be more expensive.
Dave
I'm stuck not being able to parse irregularly embedded html tags. Is there a way to remove all html tags from a node and retain all text?
I'm using the code:
rows = doc.search('//table[#id="table_1"]/tbody/tr')
details = rows.collect do |row|
detail = {}
[
[:word, 'td[1]/text()'],
[:meaning, 'td[6]/font'],
].collect do |name, xpath|
detail[name] = row.at_xpath(xpath).to_s.strip
end
detail
end
Using Xpath:
[:meaning, 'td[6]/font']
generates
:meaning: ! '<font size="3">asking for information specifying <font
color="#CC0000" size="3">what is your name?</font> /what/ as in, <font color="#CC0000" size="3">I'm not sure what you mean</font>
/what/ as in <a style="text-decoration: none;" href="http://somesecretlink.com">what</a></font>
On the other hand, using Xpath:
'td/font/text()'
generates
:meaning: asking for information specifying
thus ignoring all children of the node. What I want to achieve is this
:meaning: asking for information specifying what is your name? /what/ as in, I'm not sure what you mean /what/ as in what? I can't hear you
This depends on what you need to extract. If you want all text in font elements, you can do it with the following xpath:
'td/font//text()'
It extracts all text nodes in font tags. If you want all text nodes in the cell, then:
'td//text()'
You can also call the text method on a Nokogiri node:
row.at_xpath(xpath).text
I added an answer for this same sort of question the other day. It's a very easy process.
Take a look at: Convert HTML to plain text and maintain structure/formatting, with ruby