Ruby - traverse through nokogiri element - ruby

I have an html like this:
...
<table>
<tbody>
...
<tr>
<th> head </th>
<td> td1 text<td>
<td> td2 text<td>
...
</tr>
</tbody>
<tfoot>
</tfoot>
</table>
...
I'm using Nokogiri with ruby. I want traverse through each row and get the text of th and corresponding td into an hash.

require "nokogiri"
#Parses your HTML input
html_data = "...stripped HTML markup code..."
html_doc = Nokogiri::HTML html_data
#Iterates over each row in your table
#Note that you may need to clarify the CSS selector below
result = html_doc.css("table tr").inject({}) do |all, row|
#Modify if you need to collect only the first td, for example
all[row.css("th").text] = row.css("td").text
end

I didn't run this code, so I'm not absolutely sure but the overall idea should be right:
html_doc = Nokogiri::HTML("<html> ... </html>")
result = []
html_doc.xpath("//tr").each do |tr|
hash = {}
tr.children.each do |node|
hash[node.node_name] = node.content
end
result << hash
end
puts result.inspect
See the docs for more info: http://nokogiri.org/Nokogiri/XML/Node.html

Related

How to use Conditional in Nokogiri

Is there a way to put No Url Foud in a blank or missing anchor tag.
The reason of asking this is that the textnode output 50 textnode but the url only output 47 as some of the anchor is missin or not availble, causing the next list to colaps and completely ruin the list
see the screenshots td tag|Td list
I could get the textNode and the attributes the only problem here is some of the td list has a missing anchor causing the other list to collapse
<table>
<tr>
<td>TextNode</td>
</tr>
<tr>
<td>TextNode</td>
</tr>
<tr>
<td>TextNode</td>
</tr>
<tr>
<td>TextNode With No Anchor</td>
</tr> <tr>
<td>TextNode</td>
</tr>
<tr>
<td>TextNode With No Anchor</td>
</tr>
</table>
company_name = page.css("td:nth-child(2)")
company_name.each do |line|
c_name = line.text.strip
# this will output 50 titles
puts c_name
end
directory_url = page.css("td:nth-child(1) a")
directory_url.each do |line|
dir_url = line["href"]
# this will output 47 Urls since some list has no anchor tag.
puts dir_url
end
You can't find things that aren't there. You have to find things that are there, and then search within them for elements that may or may not be present.
Like:
directory = page.css("td:nth-child(1)")
directory.each do |e|
anchor = e.css('a')
puts anchor.any? ? anchor[0]['href'] : '(No URL)'
end

How to iterate through web tables using selenium and ruby

I want to iterate through an html table with n number rows and columns as follows:
<table class='table'>
<tbody>
<tr>
<td>Spratly Islands</td>
<td>Vietnam</td>
<td>Azerbaijan</td>
<td>Georgia</td>
</tr>
<tr>
<td>Sri Lanka</td>
<td>Israel</td>
<td>Cyprus</td>
<td>Yemen</td>
</tr>
<tr>
<td>Maldives</td>
<td>Kuwait</td>
<td>West Malaysia</td>
<td>Nepal</td>
</tr>
...
</tbody>
</table>
I want to get the column names for each row using xpath and print it. How to do this in ruby?
Thanks,
RV
To Iterate the table in ruby, Use the following code
I assume the first row is in index 1.
driver.find_elements(xpath: "//table[#class='table']//tr").each.with_index(1) do |_,index|
driver.find_elements(xpath: "//table[#class='table']//tr[#{index}]/td").each do |cell|
puts cell.text
end
puts '*****'
end
And I suggest you to move WATIR which is very nice wrapper for Ruby Selenium-Binding which actually has the syntax for table iteration,
In WATIR, you could do,
b.table(class: 'table').rows.each do |row|
row.cells.each do |cell|
puts cell.text
end
puts '*****'
end

'html-table' gem and issues with HTML::Table::Head

I'm creating two HTML tables. The first one is perfect, and the second one has the same HEAD as the first table no matter what I do.
Here's the problematic code:
require 'html/table'
include HTML
title1 = [1,2,3]
data1 = [1,2,3]
table1 = HTML::Table.new
table1.push Table::Head.create{ |row| row.content = title1 }
data1.each { |entry| table1.push Table::Row.new{|row| row.content = entry}}
title2 = [1,2]
data2 = [1,2]
table2 = HTML::Table.new
table2.push Table::Head.create{ |row| row.content = title2}
data2.each { |entry| table2.push Table::Row.new{ |row| row.content = entry}
}
This is the result from puts table1.html:
<table>
<thead>
<tr>
<td>1</td>
<td>2</td>
<td>3</td>
</tr>
</thead>
<tr>
<td>1</td>
</tr>
<tr>
<td>2</td>
</tr>
<tr>
<td>3</td>
</tr>
</table>
This is the result from puts table2.html:
<table>
<thead>
<tr>
<td>1</td>
<td>2</td>
<td>3</td>
</tr>
</thead>
<tr>
<td>1</td>
</tr>
<tr>
<td>2</td>
</tr>
</table>
There are no issues with the content but HEAD looks the same in both tables. Why?
EDIT:
I've simplified the initial code a bit:
`require 'html/table'
include HTML
s= Table::Head.create{ |row| row.content = 1 }
m= Table::Head.create{ |row| row.content = 2 }
puts s
<td>1</td>
puts m
<td>1</td>`
puts .inspect shows that both variables store same instance object>
puts s.inspect
puts m.inspect
[[#<HTML::Table::Row::Data:0x007ff52b096e38 #html_begin="<td", #html_body="1", #html_end="</td>">]]
[[#<HTML::Table::Row::Data:0x007ff52b096e38 #html_begin="<td", #html_body="1", #html_end="</td>">]]
Until now I was not aware of this gem and I don't understand the reason why someone would want to add this to his codebase. What value does it add? What do you accomplish WITH the gem that you can not accomplish WITHOUT?
The reason you see the same output is because it is the same object that is returned from create:
require 'html/table'
include HTML
s = Table::Head.create{ |row| row.content = 1 }
m = Table::Head.create{ |row| row.content = 2 }
puts s.object_id == m.object_id # => true
If you look at the source code (https://github.com/djberg96/html-table/blob/master/lib/html/head.rb#L18) then it is clear that this is intended behavior:
# This is our constructor for Head objects because it is a singleton
# class. Optionally, a block may be provided. If an argument is
# provided it is treated as content.
#
def self.create(arg=nil, &block)
##head = new(arg, &block) unless ##head
##head
end
According to this code and the comment a <thead> is a singleton and should only exist once.
Without looking any further into the library and how it works: IMHO treating <thead> as a singleton is plain wrong and a reason to stop using this library right away. You could contact the author if you are curious.
As a rule of thumb: when there is class level variables (##) then there is trouble.
So what can you do?
You need to create a HTML table outside of a web framework like Rails? You could:
Use ERB: http://ruby-doc.org/stdlib-2.3.1/libdoc/erb/rdoc/ERB.html
HTML is "just XML (tm)" so you can use REXML: http://ruby-doc.org/stdlib-1.9.3/libdoc/rexml/rdoc/REXML.html
These are both "built in" solutions. Available in your Ruby right away. But you could also use a different templating solution (haml, slim, ...) or, because REXML interface is not the most straightforward i think, another XML generator (ox, oga, nokogiri) or builder/xml.

How to create an array scraping HTML?

I have a Rake task set-up, and it works almost how I want it to.
I'm scraping information from a site and want to get all of the player ratings into an array, ordered by how they appear in the HTML. I have player_ratings and want to do exactly what I did with the player_names variable.
I only want the fourth <td> within a <tr> in the specified part of the doc because that corresponds to the ratings. If I use Nokogiri's text, I only get the first player rating when I really want an array of all of them.
task :update => :environment do
require "nokogiri"
require "open-uri"
team_ids = [7689, 7679, 7676, 7680]
player_names = []
for team_id in team_ids do
url = URI.encode("http://modules.ussquash.com/ssm/pages/leagues/Team_Information.asp?id=#{team_id}")
doc = Nokogiri::HTML(open(url))
player_names = doc.css('.table.table-bordered.table-striped.table-condensed')[1].css('tr td a').map(&:content)
player_ratings = doc.css('.table.table-bordered.table-striped.table-condensed')[1].css('tr td')[3]
puts player_ratings
player_names.map{|player| puts player}
end
end
Any advice on how to do this?
I think changing your xpath might help. Here is the xpath
nodes = doc.xpath "//table[#class='table table-bordered table-striped table-condensed'][2]//tr/td[4]"
data = nodes.each {|node| node.text }
Iterating the nodes with node.text gives me
4.682200 
5.439000 
5.568400 
5.133700 
4.480800 
4.368700 
2.768100 
3.814300 
5.103400 
4.567000 
5.103900 
3.804400 
3.737100 
4.742400 
I'd recommend using Wombat (https://github.com/felipecsl/wombat), where you can specify that you want to retrieve a list of elements matched by your css selector and it will do all the hard work for you
It's not well known, but Nokogiri implements some of jQuery's JavaScript extensions for searching using CSS selectors. In your case, the :eq(n) method will be useful:
require 'nokogiri'
doc = Nokogiri::XML(<<EOT)
<html>
<body>
<table>
<tr>
<td>1</td>
<td>2</td>
<td>3</td>
<td>4</td>
</tr>
</table>
</body>
</html>
EOT
doc.at('td:eq(4)').text # => "4"

match table row id's with a common prefix

This might be merely a syntax question.
I am unclear how to match only table rows whose id begins with rowId_
agent = Mechanize.new
pageC1 = agent.get("/customStrategyScreener!list.action")
The table has class=tableCellDT.
pageC1.search('table.tableCellDT tr[#id=rowId_]') # parses OK but returns 0 rows since rowId_ is not matched exactly.
pageC1.search('table.tableCellDT tr[#id=rowId_*]') # Throws an error since * is not treated like a wildcard string match
EXAMPLE HTML:
<table id="row" cellpadding="5" class="tableCellDT" cellspacing="1">
<thead>
<tr>
<th class="tableHeaderDT">#</th>
<th class="tableHeaderDT sortable">
Screener</th>
<th class="tableHeaderDT sortable">
Strategy</th>
<th class="tableHeaderDT"> </th></tr></thead>
<tbody>
<tr id="rowId_BullPut" class="odd">
<td> 1 </td>
<td> Bull</td>
<td></td>
<td>Edit
Delete
View
</td></tr>
NOTE
pageC1 is a Mechanize::Page object, not a Nokogiri anything. Sorry that wasn't clear at first.
Mechanize::Page doesn't have #css or #xpath methods, but a Nokogiri doc can be extracted from it (used internally anyway).
To get the tr elements that have an id starting with "rowId_":
pageC1.search('//tr[starts-with(#id, "rowId_")]')
You want either the CSS3 attribute starts-with selector:
pageC1.css('table.tableCellDT tr[id^="rowId_"]')
or the XPath starts-with() function:
pageC1.xpath('.//table[#class="tableCellDT"]//tr[starts-with(#id,"rowId_")]')
Although the Nokogiri Node#search method will intelligently pick between CSS or XPath selector syntax based on what you wrote, that does not mean that you can mix both CSS and XPath selector syntax in the same query.
In action:
>> require 'nokogiri'
#=> true
>> doc = Nokogiri.HTML <<ENDHTML; true #hide output from IRB
">> <table class="foo"><tr id="rowId_nonono"><td>Nope</td></tr></table>
">> <table class="tableCellDT">
">> <tr id="rowId_yesyes"><td>Yes1</td></tr>
">> <tr id="rowId_andme2"><td>Yes2</td></tr>
">> <tr id="rowIdNONONO"><td>Needs underscore</td></tr>
">> </table>
">> ENDHTML
#=> true
>> doc.css('table.tableCellDT tr[id^="rowId_"]').map(&:text)
#=> ["Yes1", "Yes2"]
>> doc.xpath('.//table[#class="tableCellDT"]//tr[starts-with(#id,"rowId_")]').map(&:text)
#=> ["Yes1", "Yes2"]
Thanks to
http://nokogiri.org/Nokogiri/XML/Node.html#method-i-css
and the answers above, here is the final code that solves my problem of getting just the rows I need, and then reading only certain information from each one:
pageC1.search('//tr[starts-with(#id, "rowId_")]').each do |row|
# Read the string after _ in rowId_, part of the "id" in <tr>
rid = row.attribute("id").text.split("_")[1] # => "BullPut"
# Get the URL of the 3rd <a> link in <td> cell 4
link = row.css("td[4] a[3]")[0].attributes["href"].text # => "link3?model.itemId=2262&amp;model.source=list"
end

Resources