Using Nokogiri to scrape a value from Yahoo Finance? - ruby

I wrote a simple script:
require 'rubygems'
require 'nokogiri'
require 'open-uri'
url = "http://au.finance.yahoo.com/q/bs?s=MYGN"
doc = Nokogiri::HTML(open(url))
name = doc.at_css("#yfi_rt_quote_summary h2").text
market_cap = doc.at_css("#yfs_j10_mygn").text
ebit = doc.at("//*[#id='yfncsumtab']/tbody/tr[2]/td/table[2]/tbody/tr/td/table/tbody/tr[11]/td[2]/strong").text
puts "#{name} - #{market_cap} - #{ebit}"
The script grabs three values from Yahoo finance. The problem is that the ebit XPath returns nil. The way I got the XPath was using the Chrome developer tools and copy and pasting.
This is the page I'm trying to get the value from http://au.finance.yahoo.com/q/bs?s=MYGN and the actual value is 483,992 in the total current assets row.
Any help would be appreciated, especially if there is a way to get this value with CSS selectors.

Nokogiri supports:
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(open("http://au.finance.yahoo.com/q/bs?s=MYGN"))
ebit = doc.at('strong:contains("Total Current Assets")').parent.next_sibling.text.gsub(/[^,\d]+/, '')
puts ebit
# >> 483,992
I'm using the <strong> tag as an place-marker with the :contains pseudo-class, then backing up to the containing <td>, moving to the next <td> and grabbing its text, then finally stripping the white-space using gsub(/[^,\d]+/, '') which removes everything that isn't a number or a comma.
Nokogiri supports a number of jQuery's JavaScript extensions, which is why :contains works.

This seems to work for me
doc.css("table.yfnc_tabledata1 tr[11] td[2]").text.tr(",","").to_i
#=> 483992
Or as a string
doc.css("table.yfnc_tabledata1 tr[11] td[2]").text.strip.gsub(/\u00A0/,"")
#=> "483,992"

Related

Parsing large HTML files with Nokogiri

I'm trying to parse http://www.pro-medic.ru/index.php?ht=246&perpage=all with Nokogiri, but unfortunately I can't get all items from the page.
My simple test code is:
require 'open-uri'
require 'nokogiri'
html = Nokogiri::HTML open('http://www.pro-medic.ru/index.php?ht=246&perpage=all')
p html.css('ul.products-grid-compact li .goods_container').count
It returns only 83 items but the real count is about 186.
I thought that the problem could be in open, but it seems that function reads the HTML page correctly.
Has anybody faced the same problem?
The file seems to exceed Nokogiri's parser limits. You can relax the limits by adding the HUGE flag:
require 'open-uri'
require 'nokogiri'
url = 'http://www.pro-medic.ru/index.php?ht=246&perpage=all'
html = Nokogiri::HTML(open(url)) do |config|
config.options |= Nokogiri::XML::ParseOptions::HUGE
end
html.css('ul.products-grid-compact li .goods_container').count
#=> 186
Note that |= is a bitwise OR assignment operator, don't confuse it with the logical operator ||=
According to Parse Options, you can also set this flag via config.huge

Web Scraping with Nokogiri and Mechanize

I am parsing prada.com and would like to scrape data in the div class "nextItem" and get its name and price. Here is my code:
require 'rubygems'
require 'mechanize'
require 'nokogiri'
require 'open-uri'
agent = Mechanize.new
page = agent.get('http://www.prada.com/en/US/e-store/department/woman/handbags.html?cmp=from_home')
fp = File.new('prada_prices','w')
html_doc = Nokogiri::HTML(page)
page = html_doc.xpath("//ol[#class='nextItem']")
page.each do {|i| fp.write(i.text + "\n")}
end
I get an error and no output. What I think I am doing is instantiating a mechanize object and calling it agent.
Then creating a page variable and assigning it the url provided.
Then creating a variable that is a nokogiri object with the mechanize url passed in
Then searching the url for all class references that are titled nextItem
Then printing all the data contained there
Can someone show me where I might have went wrong?
Since Prada's website dynamically loads its content via JavaScript, it will be hard to scrape its content. See "Scraping dynamic content in a website" for more information.
Generally speaking, with Mechanize, after you get a page:
page = agent.get(page_url)
you can easily search items with CSS selectors and scrape for data:
next_items = page.search(".fooClass")
next_items.each do |item|
price = item.search(".fooPrice").text
end
Then simply handle the strings or generate hashes as you desire.
Here are the wrong parts:
Check again the block syntax - use {} or do/end but not both in the same time.
Mechanize#get returns a Mechanize::Page which act as a Nokogiri document, at least it has search, xpath, css. Use them instead of trying to coerce the document to a Nokogiri::HTML object.
There is no need to require 'open-uri', and require 'nokogiri' when you are not using them directly.
Finally check maybe more about Ruby's basics before continuing with web scraping.
Here is the code with fixes:
require 'rubygems'
require 'mechanize'
agent = Mechanize.new
page = agent.get('http://www.prada.com/en/US/e-store/department/woman/handbags.html?cmp=from_home')
fp = File.new('prada_prices','w')
page = page.search("//ol[#class='nextItem']").each do |i|
fp.write(i.text + "\n")
end
fp.close

How do I print XPath value?

I want to print the contents of an XPath node. Here is what I have:
require "mechanize"
agent = Mechanize.new
agent.get("http://store.steampowered.com/promotion/snowglobefaq")
puts agent.xpath("//*[#id='item_52b3985a70d58']/div[4]")
This returns: <main>: undefined method xpath for #<Mechanize:0x2fa18c0> (NoMethodError).
I just started using Mechanize and have no idea what I'm doing, however, I've used Watir and thought this would work but it didn't.
You an use Nokogiri to parse the page after retrieving it. Here is the example code:
m = Mechanize.new
result = m.get("http://google.com")
html = Nokogiri::HTML(result.body)
divs = html.xpath('//div').map { |div| div.content } # here you can do whatever is needed with the divs
# I've mapped their content into an array
There are two things wrong:
The ID doesn't exist on that page. Try this to see the list of tag IDs available:
require "open-uri"
require 'nokogiri'
doc = Nokogiri::HTML(open("http://store.steampowered.com/promotion/snowglobefaq"))
puts doc.search('[id*="item"]').map{ |n| n['id'] }.sort
The correct chain of methods is agent.page.xpath.
Because there is no sample HTML showing exactly which tag you want, we can't help you much.

Why is Nokogiri returning blank output to excel?

My task
Extract all specifications from http://www.asus.com/Notebooks_Ultrabooks/ASUS_TAICHI_21/#specifications and put it in a spreadsheet (we work on formatting later)
Problem
Spreadsheet is created but my output is returning blank.
My Code
require 'Nokogiri'
require 'open-uri'
require 'spreadsheet'
doc = Nokogiri::HTML(open("http://www.asus.com/Notebooks_Ultrabooks/ASUS_TAICHI_21/#specifications"))
data = puts doc.css('//div#specifications/div#spec-area/ul#product-spec/li')
Spreadsheet.client_encoding = 'UTF-8'
book = Spreadsheet::Workbook.new
sheet1 = book.create_worksheet
sheet1.name = 'My First Worksheet'
sheet1[0,0] = data
book.write 'C:/Users/Barry/Desktop/output.xls'
The following code worked for me
require 'Nokogiri'
require 'open-uri'
require 'spreadsheet'
doc = Nokogiri::HTML(open("http://www.asus.com/Notebooks_Ultrabooks/ASUS_TAICHI_21/#specifications"))
data = doc.css('div#specifications div#spec-area ul.product-spec')[0].text
Spreadsheet.client_encoding = 'UTF-8'
book = Spreadsheet::Workbook.new
sheet1 = book.create_worksheet
sheet1.name = 'My First Worksheet'
sheet1[0,0] = data
book.write 'C:/Users/Barry/Desktop/output.xls'
There are a few problems here:
It looks like you’re trying to debug by printing out the result of the css call in the line:
data = puts doc.css('//div#specifications/div#spec-area/ul#product-spec/li')
The method puts returns nil, so data will be nil and will result in nothing being shown.
In the page you’re parsing, the product-spec list is in fact a class, not an id, so you need .product-spec (. instead of #).
The syntax you’re using isn’t actually CSS, it looks like you’re mixing CSS and Xpath. You want something like this:
doc.css('div#specifications div#spec-area ul.product-spec li')
(This last point doesn’t seem to actually affect the result. Nokogiri converts CSS selectors to xpath and it appears that the transformation results in valid xpath anyway).

Screen scraping with Nokogiri and Each method returning zero

I'm running the following code:
require 'rubygems'
require 'nokogiri'
require 'open-uri'
url = "http://sfbay.craigslist.org/search/sss?query=bike&catAbb=sss&srchType=A&minAsk=&maxAsk="
doc = Nokogiri::HTML(open(url))
doc.css(".row").each do |row|
row.css("a").text
end
The only thing I get returned is 0. However, when I just run doc.css(".row"), I get the entire list of rows from the CL. Why is it returning zero when I use the each method and how do I fix it?
.each doesn't return anything, it's a simple iterator. Perhaps you are looking for .map?
This will return an array of the anchor element text:
doc.css(".row").map {|row| row.css("a").text }
You don't need to issue two different css queries; you can combine them:
doc.css(".row > a").map(&:text)

Resources