Using Ruby and Mechanize to make new Olympic medals count - ruby

I want to remake the Olympic medals count on London2012 to better reflect the value of the medals. Currently it is only sorted by gold medals. I'd like to relist it by points, so gold=4, silver=2 and bronze=1 to make a new more rational list. I probably want to remember the previous rank then add a new rank column as well.
I'd like to try mechanize to get raw data from site, then parse the data into rows and cols, apply the new counts, then remake the list.
From source at http://www.london2012.com/medals/medal-count/ each country has a block with medals like so:
<span class="countryName">Canada</span></a></div></div></td><td class="gold c">0</td><td class="silver c">2</td><td class="bronze c">5</td>
If I use agent.get('http://www.london2012.com/medals/medal-count') It shows the whole list. How to parse specific spans and table data?
I also need to remember the rank, then when I make the new page put the new rank beside it.
Any tips on mechanize parsing and remembering data would be really helpful. More importantly your thinking process in doing something like this, I'd appreciate the help to get me started. This doesn't have to be a code answer
Thanks

First to identify the table. In chrome load the page and right click anywhere on the table. Go to inspect element. Go up the heirarchy until you're on the table. Now select it and you'll see it looks like this:
<table class="or-tbl overall_medals sortable" summary="Schedule">
The overall_medals class looks like it will be unique so that's a good one to use. Now start irb and do:
require 'mechanize'
agent = Mechanize.new
page = agent.get 'http://www.london2012.com/medals/medal-count/'
double check that the table is unique:
page.search('table.overall_medals').size
#=> 1 (good, it is)
You can get all the data from the table into an array with:
page.search('table.overall_medals tr').map{|tr| tr.search('td').map(&:text)}
Notice that the first 2 rows are empty, let's get rid of them by using a range:
data = page.search('table.overall_medals tr')[2..-1].map{|tr| tr.search('td').map(&:text)}
The second row isn't really empty, it has the column names (in th's instead of td's). You can get those with:
columns = page.search('table.overall_medals tr[2] th').map{|th| th.text.strip}
You can get these into hashes with:
rows = data.map{|row| Hash[columns.zip row]}
Now you can do
rows[0]['Country']
#=> "United States of America"
Or even one big hash:
countries = rows.map{|row| {row['Country'] => row}}.reduce &:merge
now:
countries['France']['Gold']
#=> "8"

You might find this Medals API useful (Assuming your question is not specifically about Mechanize)
http://apify.heroku.com/resources/5014626da8cdbb0002000006
It uses Nokogiri to parse the site and the output is available as JSON:
http://apify.heroku.com/api/olympics2012_medals.json

Related

Having trouble parsing these data in watir-webdriver

See hierarchy below:
All I need here is "Company Title", "Company Owner", "Company Owner Title", "Street Number Street Name", and "City, State Zipcode".
I tried b.div.span.bs, but that didn't work (bs because there are multiple blocks I'm gathering data from). I also thought I'd just try something like b.tds.split('<br>') and then replace all instances of tags and somehow delete empty array cells, but I found that each block is different, so the data don't align, i.e., Company Title might be in cell 1 for the first array, but then if Company Title isn't present (for the second block) then cell 1 would be Company Owner, which is conflicting... Anyway, just trying to find a clever way to get these data. Thank you.
Here is the actual HTML; however you must first click "View All".
You can split out everything inside the <div> and then split that by <br>. The first part is Company Title (if exists) and then Company Owner is last/second.
The rest is ... trickier. Some are pretty straighforward in that Fax and Member Since have labels so those are easy. The <a> is easy.
You could probably test the phone number with a regex and then back up from there. If the one before the phone number isn't <a> then it's city, state zip and the one before that is the address. If one exists before that, it's the Company Owner Title.
Everything after the phone number in your examples have labels so those are easy.
I'm not sure all of your use cases, but often for pages where the DOM is not very helpful I just get the text and parse with Ruby:
browser.td.text.split("\n").reject(&:empty?)
This doesn't directly answer the question, but it shows how I'd go about doing this using Nokogiri, which is the standard HTML/XML parser for Ruby:
require 'nokogiri'
doc = Nokogiri::HTML('<td><div></div><br>a<br>b<br>c</td>')
doc is Nokogiri's internal representation of the document.
We use landmarks in the markup to navigate and find things we want. In this case <div> is a good starting point:
doc.at('div').next_sibling.next_sibling.text # => "a"
next_sibling is how we tell Nokogiri to look at the next node. In this case it's stepping past the first <br> and looking at the a TextNode.
That'd result in unworkable code though, so there's a better way to go:
doc.search('td br').to_html # => "<br><br><br>"
That shows we can find all the <br> tags inside the <td>, so we just have to iterate over them and use them as our landmarks:
doc.search('td br').map{ |br| br.next_sibling.text } # => ["a", "b", "c"]

Parsing one large array into several sub-arrays

I have a list of adjectives (found here), that I would like to be the basis for a "random_adjective(category)" method.
I'm really just taking a stab at this, as my first real attempt at a useful program.
Step 1: Open file, remove formatting. No problem.
list=File.read('adjectivelist')
list.gsub(/\n/, " ")
The next step is to break the string up by category..
list.split(" ")
Now I have an array of every word in the file. Neat. The ones with a tilde before them represent the category names.
Now I would like to break up this LARGE array into several smaller ones, based on category.
I need help with the syntax here, although the pseudocode for this would be something like
Scan the array for an element which begins with a tilde.
Now create a new array based on the name of that element sans the tilde, and ALSO place this "category name" into the "categories" array. Now pull all the elements from the main array, and pop them into the sub-array, until you meet another tilde. Then repeat the process until there are no more elements in the array.
Finally I would pull a random word from the category named in the parameter. If there was no category name matching the parameter, it would return false and exit (this is simply in case I want to add more categories later.)
Tips would be appreciated
You may want to go back and split first time around like this:
categories = list.split(" ~")
Then each list item will start with the category name. This will save you having to go back through your data structure as you suggest. Consider that a tip: sometimes it's better to re-think the start of a coding problem than to head inexorably forwards
The structure you are reaching towards is probably a Hash, where the keys are category names, and the values are arrays of all the matching adjectives. It might look like this:
{
'category' => [ 'word1', 'word2', 'word3' ]
}
So you might do this:
words_in_category = Hash.new
categories.each do |category_string|
cat_name, *words = category_string.split(" ")
words_in_category[cat_name] = words
end
Finally, to pick a random element from an array, Ruby provides a very useful method sample, so you can just do this
words_in_category[ chosen_category ].sample
. . . assuming chosen_category contains the string name of an actual category. I'll leave it to you to figure out how to put this all together and handle errors, bad input etc
Use slice_before:
categories = list.split(" ").slice_before(/~\w+/)
This will create an sub array for each word starting with ~, containing all words before the next matching word.
If this file format is your original and you have freedom to change it, then I recommend you save the data as yaml or json format and read it when needed. There are libraries to do this. That is all. No worry about the mess. Don't spend time reinventing the wheel.

Extracting text from a webtable in Watir / Ruby

I'm trying to extract the data from an Income Statement, url is http://finance.yahoo.com/q/is?s=LMT+Income+Statement&annual
I was unable to find the table using the browser.table(:name, 'blah') or (:id, 'blah'), but had some luck using the xpath with Nokogiri using this code, which picks up after I've initialized everything and browsed to the page:
page_html = Nokogiri::HTML.parse(browser.html)
tobj = page_html.xpath('//*[#id="yfncsumtab"]').inner_text
Now I'm able to take tobj and pull the data out, but it doesn't do me any good for trying to manipulate the object as a table. Any suggestions on how to go about storing the table as a variable would help. I can probably figure out iterating through the rows/columns from there, but I wouldn't mind if you tacked on some code that would do that.
Do you know Watir has xpath support?
browser.element(:xpath => '//*[#id="yfncsumtab"]')
Look at it this way:
doc = Nokogiri::HTML.parse(browser.html)
table = doc.at('table#yfncsumtab')
# iterate through tr's
table.search('tr').each do |tr|
# do something with tr
end
Try browser.element(id: "yfncsumtab").text

performance issue of watir table object processing. How to make Nokogiri html table into array?

The following works but is always very slow, seemingly halting my scraping program and its Firefox or Chrome browser for even whole minutes per page:
pp recArray = $browser.table(:id,"recordTable").to_a
Getting the HTML table's text or html source is fast though:
htmlcode = $browser.table(:id,"recordTable").html # .text shows only plaintext portion like lynx
How might I be able to create the same recArray (each element from a <TR>) using for example a Nokogiri object holding only that table's html?
recArray = Nokogiri::HTML(htmlcode). ??
I wrote a blog post about that a few days ago: http://zeljkofilipin.com/watir-nokogiri/
If you have further questions, ask.
You want each tr in the table?
Nokogiri::HTML($browser.html).css('table[#id="recordTable"] > tr')
This gives a NodeSet which can be more useful than Array. Of course there's still to_a
Thought it would be useful to sum up all the steps here and there:
The question was how to produce the same array object filled with strings from the page's text content that a Watir::Webdriver Table #to_a might produce, but much faster:
recArray = Nokogiri::HTML(htmlcode). **??**
So instead of this as I was doing before:
recArray=$browser.table(:class, 'detail-table w-Positions').to_a
I send the whole page's html as a string to Nokogiri to let it do the parsing:
recArray=Nokogiri::HTML($browser.html).css('table[#class="detail-table w-Positions"] tr').to_a
Which found me the rows of the table I want and put them into an array.
Not done yet since the elements of that array are still Nokogiri (Table Row?) types, which barfed when I attempted things like .join(",") (useful for writing into a .CSV file or database for instance)
So the following iterates through each row element, turning each into an array of pure Ruby String types, containing only the text content of each table cell stripped of html tags:
recArray= recArray.map {|row| row.css("td").map {|c| c.text}.to_a } # Could of course be merged with above to even longer, nastier one-liner
Each cell had previously also been a Nokogiri Element type, done away with the .text mapping.
Significant speedup achieved.
Next I wonder what it would take to simply override the #to_a method of every Watir::Webdriver Table object globally in my Ruby code files....
(I realize that may not be 100% compatible but it would spare me so much code rewriting. Am willing to try in my personal.lib.rb include file.)

Scraping a website with Nokogiri

I am using Nokogiri to scrape a website and am running into an issue when I try to grab a field from a table. I am using selector gadget to find the CSS selector of the table. I am grabbing data from a government website that details information on motor carriers.
The method that I am using looks like:
def scrape_database
url = "http://safer.fmcsa.dot.gov/query.asp?searchtype=ANY&query_type=queryCarrierSnapshot&query_param=USDOT&query_string=#{self.dot}#Inspections"
doc = Nokogiri::HTML(open(url))
self.name = doc.at_css("tr:nth-child(4) .queryfield").text
self.address = doc.at_css("tr:nth-child(6) .queryfield").text
end
I grab all of the fields in the upper table using that syntax and the method operates fine, however I am having issues with the crash rate/inspections table below it.
Here is what I am using to grab that info:
self.vehicle_inspections = doc.at_css("center:nth-child(13) tr:nth-child(2) :nth-child(2)").text
undefined method `text' for nil:NilClass
If I remove text from the end of this, the method runs but doesn't grab any relevant information (obviously). I am assuming this is due to the complicated selector that I am using to grab the field, but am not quite sure.
Has anyone run into a similar problem and can you give me some advice?
Yes, that error means that your CSS selector is not finding the information; at_css is returning nil, and nil.text is not valid. You can guard against it like so:
insp = doc.at_css("long example css selector")
self.vehicle_inspections = insp && insp.text
However, it sounds to me like you "need" this data. Since you have not provided with the HTML page nor the CSS selectors, I can't help you craft a working CSS or XPath selector.
For future questions, or an edit to this one, note that actual (pared-down) code is strongly preferred over hand waving and loose descriptions of what your code looks like. If you show us the HTML page, or a relevant snippet, and describe which element/text/attribute you want, we can tell you how to select it.
I see six tables on that page. Which is the "crash rate/inspections" table? Given that your URL includes #Inspections on the end, I'm assuming you're talking about the two tables immediately underneath the "Inspections/Crashes In US" section. Here are XPath selectors that match each:
require 'nokogiri'
require 'open-uri'
url = "http://safer.fmcsa.dot.gov/query.asp?searchtype=ANY&query_type=queryCarrierSnapshot&query_param=USDOT&query_string=800585"
doc = Nokogiri::HTML(open(url))
table1 = doc.at_xpath('//table[#summary="Inspections"][preceding::h4[.//a[#name="Inspections"]]]')
table2 = doc.at_xpath('//table[#summary="Crashes"][preceding::h4[.//a[#name="Inspections"]]]')
# Find a row by index (1 is the first row)
vehicle_inspections = table1.at_xpath('.//tr[2]/td').text.to_i
# Find a row by header text
out_of_service_drivers = table1.at_xpath('.//tr[th="Out of Service"]/td[2]').text.to_i
p [ vehicle_inspections, out_of_service_drivers ]
#=> [6, 0]
tow_crashes = table2.at_xpath('.//tr[th="Crashes"]/td[3]').text.to_i
p tow_crashes
#=> 0
The XPath queries may look intimidating. Let me explain how they work:
//table[#summary="Inspections"][preceding::h4[.//a[#name="Inspections"]]]
//table find a <table> at any level of the document
[#summary="Inspections"] …but only if it has a summary attribute with this value
[preceding::h4…] …and only if you can find an <h4> element earlier in the document
[.//a…] …specifically, a <h4> that has an <a> somewhere underneath it
[#name="Inspections"] …and that <a> has to have a name attribute with this text.
This would actually match two tables (there's another summary="Inspections" table later on the page), but using at_xpath finds the first matching table.
.//tr[2]/td
. Starting at the current node (this table)
//tr[2] …find the second <tr> that is a descendant at any level
/td …and then find the <td> children of that.
Again, because we're using at_xpath we find the first matching <td>.
.//tr[th="Out of Service"]/td[2]
. Starting at the current node (this table)
//tr …find any <tr> that is a descendant at any level
[th="Out of Service] …but only those <tr> that have a <th> child with this text
/td[2] …and then find the second <td> children of those.
In this case there is only one <tr> that matches the criteria, and thus only one <td> that matches, but we still use at_xpath so that we get that node directly instead of a NodeSet with a single element in it.
The goal here (and with any screen scraping) is to latch onto meaningful values on the page, not arbitrary indices.
For example, I could have written my table1 xpath as:
# Find the first table with this summary
table1 = doc.at_xpath('//table[#summary="Inspections"][1]')
…or even…
# Find the 20th table on the page
//table[20]
However, those are fragile. Someone adding a new section to the page, or code that happens to add or remove a formatting table would cause those expressions to break. You want to hunt for strong attributes and text that likely won't change, and anchor your searches based on that.
The vehicle_inspections XPath is similarly fragile, relying on the ordering of rows instead of the label text for the row.

Resources