How to create an array scraping HTML? - ruby

I have a Rake task set-up, and it works almost how I want it to.
I'm scraping information from a site and want to get all of the player ratings into an array, ordered by how they appear in the HTML. I have player_ratings and want to do exactly what I did with the player_names variable.
I only want the fourth <td> within a <tr> in the specified part of the doc because that corresponds to the ratings. If I use Nokogiri's text, I only get the first player rating when I really want an array of all of them.
task :update => :environment do
require "nokogiri"
require "open-uri"
team_ids = [7689, 7679, 7676, 7680]
player_names = []
for team_id in team_ids do
url = URI.encode("http://modules.ussquash.com/ssm/pages/leagues/Team_Information.asp?id=#{team_id}")
doc = Nokogiri::HTML(open(url))
player_names = doc.css('.table.table-bordered.table-striped.table-condensed')[1].css('tr td a').map(&:content)
player_ratings = doc.css('.table.table-bordered.table-striped.table-condensed')[1].css('tr td')[3]
puts player_ratings
player_names.map{|player| puts player}
end
end
Any advice on how to do this?

I think changing your xpath might help. Here is the xpath
nodes = doc.xpath "//table[#class='table table-bordered table-striped table-condensed'][2]//tr/td[4]"
data = nodes.each {|node| node.text }
Iterating the nodes with node.text gives me
4.682200 
5.439000 
5.568400 
5.133700 
4.480800 
4.368700 
2.768100 
3.814300 
5.103400 
4.567000 
5.103900 
3.804400 
3.737100 
4.742400 

I'd recommend using Wombat (https://github.com/felipecsl/wombat), where you can specify that you want to retrieve a list of elements matched by your css selector and it will do all the hard work for you

It's not well known, but Nokogiri implements some of jQuery's JavaScript extensions for searching using CSS selectors. In your case, the :eq(n) method will be useful:
require 'nokogiri'
doc = Nokogiri::XML(<<EOT)
<html>
<body>
<table>
<tr>
<td>1</td>
<td>2</td>
<td>3</td>
<td>4</td>
</tr>
</table>
</body>
</html>
EOT
doc.at('td:eq(4)').text # => "4"

Related

ruby selenium xpath td css

I am testing a webapp using Ruby and Selenium web-driver. I have not been able to examine the contents of a cell in the displayed webpage.
What I would like to get is the IP in the td.
<td class="multi_select_column"><input name="object_ids" type="checkbox"
value="adcf0467-2756-4c02-9edd-bb83c40b8685" /></td>
<td class="sortable normal_column">Core</td>
<td class="sortable nowrap-col normal_column">r1-c4-b4</td>
<td class="sortable anchor normal_column"><a href="/horizon/admin/instances
/adcf0467-2756-4c02-9edd-bb83c40b8685/detail" class="">pg-gtmpg--675</a></td>
<td class="sortable normal_column">column_name</td><td class="sortable normal_column">
<tr
<ul>
<li>172.25.1.12</li>
</ul>
I used the Firefox addon firepath to get the Xpath of the IP.
It gives "html/body/div[1]/div[2]/div[3]/form/table/tbody/tr[1]/td[6]/ul/li", which looks correct.
However I have not been able to display the IP.
Here is my test code;
#usr/bin/env ruby
#
# Sample Ruby script using the Selenium client API
#
require "rubygems"
require "selenium/client"
require "test/unit"
require "selenium/client"
begin
driver = Selenium::WebDriver.for(:remote, :url =>"http://dog.dog.jump.acme.com:4444/wd/hub")
driver.navigate.to "http://10.87.252.37/acme/auth/login/"
g_user_name = driver.find_element(:id, 'id_username')
g_user_name.send_keys("user")
g_user_name.submit
g_password = driver.find_element(:id, 'id_password')
g_password.send_keys("password")
g_password.submit
g_instance_1 = driver.find_element(:xpath, "html/body/div[1]/div[2]/div[3]/form/table/tbody/tr[1] /td[4]/a")
puts g_instance_1.text() <- here, I see the can see text
g_instance_2 = driver.find_elements(:xpath, "html/body/div[1]/div[2]/div[3]/form/table/tbody/tr[1] /td[6]/ul/li[1]")
puts g_instance_2
output is <Selenium::WebDriver::Element:0x000000023c1700
puts g_instance_2.inspect
output is :[#<Selenium::WebDriver::Element:0x22f3b7c6e7724d4a id="4">]
puts g_instance_2.class
Output: Array
puts g_instance_2.count
Output:1
When there is no /a in the td it doesn't seem to work.
I have tried puts g_instance_2.text, g_instance_2.text() and many others with no success.
I must be missing something obvious, but I am not seeing it
ruby 1.9.3p0 (2011-10-30 revision 33570) [x86_64-linux] on
Linux ubuntu 3.8.0-34-generic #49~precise1-Ubuntu
I decided to try a different apporach using the css selector instead of xpath.
When I insert the following css selector into the FirePath window the desired html section is selected.
g_instance_2 = driver.find_elements(:css, "table#instances tbody tr:nth-of-type(1) td:nth-of-type(6) ul li:nth-of-type(1)" )
The problem is the same as before, I dont seem to be able to access the contents of g_instance_2
I have tried;
puts g_instance_2
g_instance_22 = [g_instance_2]
puts g_instance_22
Both return;
#<Selenium::WebDriver::Element:0x000000028a6ba8>
#<Selenium::WebDriver::Element:0x000000028a6ba8>
How can I check the value returned from the remote web-server?
Would Python be a better choice to do this?
The HTML code fragment you are trying to test is not valid HTML. It might be worth filing a bug report for it.
With the given code, the following CSS selector retrieves the <a> you want:
[href^="/horizon/admin/instances"]
Translated into: any element that has the "href" attribute starting with "/horizon/admin/instances"
For XPATH this is the selector
("//a[contains(#href,'/horizon/admin/instances')]")
Same translation just an uglier syntax.
The problem was that I was not accessing the returned array properly.
puts g_instance_2[0].text()
works for css and xpath

How do you traverse a HTML document, search, and skip to the next item using Nokogiri?

How do you traverse up to a certain found element and then continue to the next found item? In my example I am trying to search for the first element, grab the text, and then continue until I find the next tag or until I hit a specific tag. The reason I need to also take into account the tag is because I want to do something there.
Html
<table border=0>
<tr>
<td width=180>
<font size=+1><b>apple</b></font>
</td>
<td>Description of an apple</td>
</tr>
<tr>
<td width=180>
<font size=+1><b>banana</b></font>
</td>
<td>Description of a banana</td>
</tr>
<tr>
<td><img vspace=4 hspace=0 src="common/dot_clear.gif"></td>
</tr>
...Then this repeats itself in a similar format
Current scrape.rb
#...
document.at_css("body").traverse do |node|
#if <font> is found
#puts text in font
#else if <img> is found then
#puts img src and continue loop until end of document
end
Thank you!
Interesting. You basically want to traverse through all the children in your tree and perform some operations on basis of the nodes obtained.
So here is how we can do that:
#Acquiring dummy page
page = Nokogiri::HTML(open('http://en.wikipedia.org/wiki/Ruby_%28programming_language%29'))
Now, if you want to start traversing all body elements, we can employ XPath for our rescue. XPath expression: //body//* will give back all the children and grand-children in body.
This would return the array of elements with class Nokogiri::XML::Element
page.xpath('//body//*')
page.xpath('//body//*').first.node_name
#=> "div"
So, you can now traverse on that array and perform your operations:
page.xpath('//body//*').each do |node|
case node.name
when 'div' then #do this
when 'font' then #do that
end
end
Something like this perhaps:
document.at_css("body").traverse do |node|
if node.name == 'font'
puts node.content
elsif node.name == 'img'
puts node.attribute("src")
end

Formatting HTML into CSV

I'm scraping a website using Ruby with Nokogiri.
This script creates a local text file, opens a URL, and writes to the file if the expression tr td is met. It is working fine.
require 'rubygems'
require 'nokogiri'
require 'open-uri'
DOC_URL_FILE = "doc.csv"
url = "http://www.SuperSecretWebSite.com"
data = Nokogiri::HTML(open(url))
all_data = data.xpath('//tr/td').text
File.open(DOC_URL_FILE, 'w'){|file| file.write all_data}
Each line has five fields which I would like to run horizontally then go to the next line after five cells are filled. The data is all there but isn't usable.
I was hoping to learn or get the code from someone that knows how to create a CSV formatting code that:
While the script is reading the code, dump every new td /td x5 into its own cells horizontally.
Go to the next line, etc.
The layout of the HTML is:
<tr>
<td>John Smith</td>
<td>I live here 123</td>
<td>phone ###</td>
<td>Birthday</td>
<td>Other Data</td>
</tr>
What the final product should look like.
http://picpaste.com/pics/Screenshot-KRnqRGrP.1361813552.png
current output
john Smith I live here 123 phone ### Birthday Other Data,
This is pretty standard code to walk a table and extract its cells into an array of arrays. What you do with the data at that point is up to you, but it's a very easy to pass it to CSV.
require 'nokogiri'
require 'pp'
doc = Nokogiri::HTML(<<EOT)
<table>
<tr>
<td>John Smith</td>
<td>I live here 123</td>
<td>phone ###</td>
<td>Birthday</td>
<td>Other Data</td>
</tr>
<tr>
<td>John Smyth</td>
<td>I live here 456</td>
<td>phone ###</td>
<td>Birthday</td>
<td>Other Data</td>
</tr>
</table>
EOT
data = []
doc.at('table').search('tr').each do |tr|
data << tr.search('td').map(&:text)
end
pp data
Which outputs:
[["John Smith", "I live here 123", "phone ###", "Birthday", "Other Data"],
["John Smyth", "I live here 456", "phone ###", "Birthday", "Other Data"]]
The code uses at to locate the first <table>, then iterates over each <tr> using search. For each row, it iterates over the cells and extracts their text.
Nokogiri's at finds the first occurrence of something, and returns a Node. search finds all occurrences and returns a NodeSet, which acts like an array. I'm using CSS accessors, instead of XPath, for simplicity.
As a FYI:
File.open(DOC_URL_FILE, 'w'){|file| file.write all_data}
can be written more succinctly as:
File.write(DOC_URL_FILE, all_data)
I've been working on this problem for awhile. Can you give me any more help?
Sigh...
Did you read the CSV documents, especially the examples? What happens if, instead of defining data = [] we replace it with:
CSV.open("path/to/file.csv", "wb") do |data|
and wrap the loop with the CSV block, like:
CSV.open("path/to/file.csv", "wb") do |data|
doc.at('table').search('tr').each do |tr|
data << tr.search('td').map(&:text)
end
end
That's not tested, but it's really that simple. Go and fiddle with that.

match table row id's with a common prefix

This might be merely a syntax question.
I am unclear how to match only table rows whose id begins with rowId_
agent = Mechanize.new
pageC1 = agent.get("/customStrategyScreener!list.action")
The table has class=tableCellDT.
pageC1.search('table.tableCellDT tr[#id=rowId_]') # parses OK but returns 0 rows since rowId_ is not matched exactly.
pageC1.search('table.tableCellDT tr[#id=rowId_*]') # Throws an error since * is not treated like a wildcard string match
EXAMPLE HTML:
<table id="row" cellpadding="5" class="tableCellDT" cellspacing="1">
<thead>
<tr>
<th class="tableHeaderDT">#</th>
<th class="tableHeaderDT sortable">
Screener</th>
<th class="tableHeaderDT sortable">
Strategy</th>
<th class="tableHeaderDT"> </th></tr></thead>
<tbody>
<tr id="rowId_BullPut" class="odd">
<td> 1 </td>
<td> Bull</td>
<td></td>
<td>Edit
Delete
View
</td></tr>
NOTE
pageC1 is a Mechanize::Page object, not a Nokogiri anything. Sorry that wasn't clear at first.
Mechanize::Page doesn't have #css or #xpath methods, but a Nokogiri doc can be extracted from it (used internally anyway).
To get the tr elements that have an id starting with "rowId_":
pageC1.search('//tr[starts-with(#id, "rowId_")]')
You want either the CSS3 attribute starts-with selector:
pageC1.css('table.tableCellDT tr[id^="rowId_"]')
or the XPath starts-with() function:
pageC1.xpath('.//table[#class="tableCellDT"]//tr[starts-with(#id,"rowId_")]')
Although the Nokogiri Node#search method will intelligently pick between CSS or XPath selector syntax based on what you wrote, that does not mean that you can mix both CSS and XPath selector syntax in the same query.
In action:
>> require 'nokogiri'
#=> true
>> doc = Nokogiri.HTML <<ENDHTML; true #hide output from IRB
">> <table class="foo"><tr id="rowId_nonono"><td>Nope</td></tr></table>
">> <table class="tableCellDT">
">> <tr id="rowId_yesyes"><td>Yes1</td></tr>
">> <tr id="rowId_andme2"><td>Yes2</td></tr>
">> <tr id="rowIdNONONO"><td>Needs underscore</td></tr>
">> </table>
">> ENDHTML
#=> true
>> doc.css('table.tableCellDT tr[id^="rowId_"]').map(&:text)
#=> ["Yes1", "Yes2"]
>> doc.xpath('.//table[#class="tableCellDT"]//tr[starts-with(#id,"rowId_")]').map(&:text)
#=> ["Yes1", "Yes2"]
Thanks to
http://nokogiri.org/Nokogiri/XML/Node.html#method-i-css
and the answers above, here is the final code that solves my problem of getting just the rows I need, and then reading only certain information from each one:
pageC1.search('//tr[starts-with(#id, "rowId_")]').each do |row|
# Read the string after _ in rowId_, part of the "id" in <tr>
rid = row.attribute("id").text.split("_")[1] # => "BullPut"
# Get the URL of the 3rd <a> link in <td> cell 4
link = row.css("td[4] a[3]")[0].attributes["href"].text # => "link3?model.itemId=2262&amp;model.source=list"
end

Ruby - traverse through nokogiri element

I have an html like this:
...
<table>
<tbody>
...
<tr>
<th> head </th>
<td> td1 text<td>
<td> td2 text<td>
...
</tr>
</tbody>
<tfoot>
</tfoot>
</table>
...
I'm using Nokogiri with ruby. I want traverse through each row and get the text of th and corresponding td into an hash.
require "nokogiri"
#Parses your HTML input
html_data = "...stripped HTML markup code..."
html_doc = Nokogiri::HTML html_data
#Iterates over each row in your table
#Note that you may need to clarify the CSS selector below
result = html_doc.css("table tr").inject({}) do |all, row|
#Modify if you need to collect only the first td, for example
all[row.css("th").text] = row.css("td").text
end
I didn't run this code, so I'm not absolutely sure but the overall idea should be right:
html_doc = Nokogiri::HTML("<html> ... </html>")
result = []
html_doc.xpath("//tr").each do |tr|
hash = {}
tr.children.each do |node|
hash[node.node_name] = node.content
end
result << hash
end
puts result.inspect
See the docs for more info: http://nokogiri.org/Nokogiri/XML/Node.html

Resources