Scraping track data from HTML? - ruby

I'd like to be able to scrape data from a track list page at 1001tracklists. A URL example is:
http://www.1001tracklists.com/tracklist/25122_lange-intercity-podcast-115-2013-03-06.html
Here is an example of how the data is displayed on the page:
Above & Beyond - Black Room Boy (Above & Beyond Club Mix) [ANJUNABEATS]
I'd like to pull out all the songs from this page in the following format:
$byArtist - $name [$publisher]
After reviewing the HTML for this page, it appears the content I am after is stored in HTML5 meta microdata format:
<td class="" id="tlptr_433662">
<a name="tlp_433662"></a>
<div itemprop="tracks" itemscope itemtype="http://schema.org/MusicRecording" id="tlp5_content">
<meta itemprop="byArtist" content="Above & Beyond">
<meta itemprop="name" content="Black Room Boy (Above & Beyond Club Mix)">
<meta itemprop="publisher" content="ANJUNABEATS">
<meta itemprop="url" content="/track/103905_above-beyond-black-room-boy-above-beyond-club-mix/index.html">
<span class="tracklistTrack floatL"id="tr_103905" >Above & Beyond - Black Room Boy (Above & Beyond Club Mix) </span><span class="floatL">[ANJUNABEATS]</span>
<div id="tlp5_actions" class="floatL" style="margin-top:1px;">
There is a CSS selector with a "tlp_433662" value. Each song on the page will have its own unique id. One will have "tlp_433662" and the next will have "tlp_433628" or something similar.
Is there a way to extract all songs listed on the tracklist page using Nokogiri and XPath?
I will probably want to "do" an "each" on my "data" listed below so that the scraper loops over the data extracting each set of relevant data. Here is the start of my Ruby program:
require 'rubygems'
require 'nokogiri'
require 'open-uri'
url = "http://www.1001tracklists.com/tracklist/25122_lange-intercity-podcast-115-2013-03-06.html"
data = Nokogiri::HTML(open(url))
# what do do next? print out xpath loop code which extracts my data.
# code block I need help with
data.xpath.........each do |block|
block.xpath("...........").each do |span|
puts stuff printing out what I want.
end
end
My ultimate goal, which I know how to do, is to take this Ruby script to Sinatra to "webify" the data and add some nice Twitter bootstrap CSS as shown in this youtube video: http://www.youtube.com/watch?v=PWI1PIvy4A8
Can you help me with the XPath code block so that I can scrape the data and print the array?

Here's some code to gather the information into an array of hashes.
I prefer using CSS accessors over XPath, because they're more readable if you have any HTML/CSS or jQuery experience.
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(open('http://www.1001tracklists.com/tracklist/25122_lange-intercity-podcast-115-2013-03-06.html'))
data = doc.search('tr.tlpItem div[itemtype="http://schema.org/MusicRecording"]').each_with_object([]) do |div, array|
hash = div.search('meta').each_with_object({}) do |m, h|
h[m['itemprop']] = m['content']
end
link = div.at('span a')
hash['tracklistTrack'] = [ link['href'], link.text ]
title = div.at('span.floatL a')
hash['title'] = [title['href'], title.text ]
array << hash
end
pp data[0, 2]
Which outputs a subset of the page's data. After some massaging the structure looks like this:
[
{
"byArtist"=>"Markus Schulz",
"name"=>"The Spiritual Gateway (Transmission 2013 Theme)",
"publisher"=>"COLDHARBOUR RECORDINGS",
"url"=>"/track/108928_markus-schulz-the-spiritual-gateway-transmission-2013-theme/index.html",
"tracklistTrack"=>[
"/track/108928_markus-schulz-the-spiritual-gateway-transmission-2013-theme/index.html",
"Markus Schulz - The Spiritual Gateway (Transmission 2013 Theme)"
],
"title"=>[
"/track/108928_markus-schulz-the-spiritual-gateway-transmission-2013-theme/index.html",
"Markus Schulz - The Spiritual Gateway (Transmission 2013 Theme)"
]
},
{
"byArtist"=>"Lange & Audrey Gallagher",
"name"=>"Our Way Home (Noah Neiman Remix)",
"publisher"=>"LANGE RECORDINGS",
"url"=>"/track/119667_lange-audrey-gallagher-our-way-home-noah-neiman-remix/index.html",
"tracklistTrack"=>[
"/track/119667_lange-audrey-gallagher-our-way-home-noah-neiman-remix/index.html",
"Lange & Audrey Gallagher - Our Way Home (Noah Neiman Remix)"
],
"title"=>[
"/track/119667_lange-audrey-gallagher-our-way-home-noah-neiman-remix/index.html",
"Lange & Audrey Gallagher - Our Way Home (Noah Neiman Remix)"
]
}
]

require 'nokogiri'
require 'rest-client'
url = 'http://www.1001tracklists.com/tracklist/25122_lange-intercity-podcast-115-2013-03-06.html'
page = Nokogiri::HTML(RestClient.get(url,:user_agent=>'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)'), nil, 'UTF-8');
page.css('table.detail tr.tlpItem').each do |row|
artist = row.css('meta[#itemprop="byArtist"]').attr('content')
name = row.css('meta[#itemprop="name"]').attr('content')
puts "#{artist} - #{name}"
end
...a more advanced version, that grabs all the meta info from the row and prints 'Artist - Song [Publisher]
require 'nokogiri'
require 'rest-client'
url = 'http://www.1001tracklists.com/tracklist/25122_lange-intercity-podcast-115-2013-03-06.html'
page = Nokogiri::HTML(RestClient.get(url,:user_agent=>'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)'), nil, 'UTF-8');
page.css('table.detail tr.tlpItem').each do |row|
meta = row.search('meta').each_with_object({}) do |tag, hash|
hash[tag['itemprop']] = tag['content']
end
puts "#{meta['byArtist']} - #{meta['name']} [#{meta['publisher']||'Unknown'}]"
end
You get the picture for the rest of the properties. You will need to do some error/exists? checking because some songs don't have all the properties. But this should get you on the right track. I've also used the rest-client gem so feel free to use whatever you want to retrieve the page.

There is this free webservice wich scrapes all the 400+ schema.org classes from given url and give them back as JSON
http://scrappy.netfluid.org/

Related

Scraping - Loading dynamic buttons

I'm trying to web-scrape the "Fresh & Chilled" products of Waitrose & Partners using Ruby and Nokogiri.
In order to load more products, I'd need to click in 'Load More...', which will dynamically load more products without altering the URL or redirecting to a new page.
How do I 'click' the "Load More" button to load more products?
I think it is a dynamic website as items are loaded dynamically after clicking the "Load More..." button and the URL is not being altered at all (so no pagination is visible)
Here's the code I've tried so far, but I'm stuck in loading more items. My guess is that the DOM is being loaded by itself, but you cannot actually click the button because it represents to call a javascript method which will load the rest of the items.
require "csv"
require "json"
require "nokogiri"
require "open-uri"
require "pry"
def scrape_category(category)
CSV.open("out/waitrose_items_#{category}.csv", "w") do |csv|
headers = [:id, :name, :category, :price_per_unit, :price_per_quantity, :image_url, :available, :url]
csv << headers
url = "https://www.waitrose.com/ecom/shop/browse/groceries/#{category}"
html = open(url)
doc = Nokogiri::HTML(html)
load_more = doc.css(".loadMoreWrapper___UneG1").first
pages = 0
while load_more != nil
puts pages.to_s
load_more.content # Here's where I don't know how to click the button to load more items
products = doc.css(".podHeader___3yaub")
puts "products = " + products.length.to_s
pages = pages + 1
load_more = doc.css(".loadMoreWrapper___UneG1").first
end
(0..products.length-1).each do |i|
puts "url = " + products[i].text
end
load_more = doc.css(".loadMoreWrapper___UneG1")[0]
# here goes the processing of each single item to put in csv file
end
end
def scrape_waitrose
categories = [
"fresh_and_chilled",
]
threads = categories.map do |category|
Thread.new { scrape_category(category) }
end
threads.each(&:join)
end
#binding.pry
Nokogiri is a way of parsing HTML. It's the Ruby equivalent to Javascript's Cheerio or Java's Jsoup. This is actually not a Nokogiri question.
What you are confusing is the way to parse the HTML and the method to collect the HTML, as delivered over the network. It is important to remember that lots of functions, like your button clicking, are enabled by Javascript. These days many sites, like React sites, are completely built by Javascript.
So when you execute this line:
doc = Nokogiri::HTML(html)
It is the html variable you have to concentrate on. Your html is NOT the same as the html that I would view from the same page in my browser.
In order to do any sort of reliable web scraping, you have to use a headless browser that will execute Javascript files. In Ruby terms, that used to mean using Poltergeist to control Phantomjs, a headless version of the Webkit browser. Phantomjs became unsupported when Puppeteer and headless Chrome arrived.

Ruby: How do I parse links with Nokogiri with content/text all the same?

What I am trying to do: Parse links from website (http://nytm.org/made-in-nyc) that all have the exact same content. "(hiring)" Then I will write to a file 'jobs.html' a list of links. (If it is a violation to publish these websites I will quickly take down the direct URL. I thought it might be useful as a reference to what I am trying to do. First time posting on stack)
DOM Structure:
<article>
<ol>
<li>#waywire</li>
<li><a href="http://1800Postcards.com" target="_self" class="vt-p">1800Postcards.com</a</li>
<li>Adafruit Industries</li>
<li><a href="http://www.adafruit.com/jobs/" target="_self" class="vt-p">(hiring)</a</li>
etc...
What I have tried:
require 'nokogiri'
require 'open-uri'
def find_jobs
doc = Nokogiri::HTML(open('http://nytm.org/made-in-nyc'))
hire_links = doc.css("a").select{|link| link.text == "(hiring)"}
results = hire_links.each{|link| puts link['href']}
begin
file = File.open("./jobs.html", "w")
file.write("#{results}")
rescue IOError => e
ensure
file.close unless file == nil
end
puts hire_links
end
find_jobs
Here is a Gist
Example Result:
[344] #<Nokogiri::XML::Element:0x3fcfa2e2276c name="a" attributes=[#<Nokogiri::XML::Attr:0x3fcfa2e226e0 name="href" value="http://www.zocdoc.com/careers">, #<Nokogiri::XML::Attr:0x3fcfa2e2267c name="target" value="_blank">] children=[#<Nokogiri::XML::Text:0x3fcfa2e1ff1c "(hiring)">]>
So it successfully writes these entries into the jobs.html file but it is in XML format? Not sure how to target just the value and create a link from that. Not sure where to go from here. Thanks!
The problem is with how results is defined. results is an array of Nokogiri::XML::Element:
results = hire_links.each{|link| puts link['href']}
p results.class
#=> Array
p results.first.class
#=> Nokogiri::XML::Element
When you go to write the Nokogiri::XML::Element to the file, you get the results of inspecting it:
puts results.first.inspect
#=> "#<Nokogiri::XML::Element:0x15b9694 name="a" attributes=...."
Given that you want the href attribute of each link, you should collect that in the results instead:
results = hire_links.map{ |link| link['href'] }
Assuming you want each href/link displayed as a line in the file, you can join the array:
File.write('./jobs.html', results.join("\n"))
The modified script:
require 'nokogiri'
require 'open-uri'
def find_jobs
doc = Nokogiri::HTML(open('http://nytm.org/made-in-nyc'))
hire_links = doc.css("a").select { |link| link.text == "(hiring)"}
results = hire_links.map { |link| link['href'] }
File.write('./jobs.html', results.join("\n"))
end
find_jobs
#=> produces a jobs.html with:
#=> http://www.20x200.com/jobs/
#=> http://www.8coupons.com/home/jobs
#=> http://jobs.about.com/index.html
#=> ...
Try using Mechanize. It leverages Nokogiri, and you can do something like
require 'mechanize'
browser = Mechanize.new
page = browser.get('http://nytm.org/made-in-nyc')
links = page.links_with(text: /(hiring)/)
Then you will have an array of link objects that you can get whatever info you want. You can also use the link.click method that Mechanize provides.

How do I print XPath value?

I want to print the contents of an XPath node. Here is what I have:
require "mechanize"
agent = Mechanize.new
agent.get("http://store.steampowered.com/promotion/snowglobefaq")
puts agent.xpath("//*[#id='item_52b3985a70d58']/div[4]")
This returns: <main>: undefined method xpath for #<Mechanize:0x2fa18c0> (NoMethodError).
I just started using Mechanize and have no idea what I'm doing, however, I've used Watir and thought this would work but it didn't.
You an use Nokogiri to parse the page after retrieving it. Here is the example code:
m = Mechanize.new
result = m.get("http://google.com")
html = Nokogiri::HTML(result.body)
divs = html.xpath('//div').map { |div| div.content } # here you can do whatever is needed with the divs
# I've mapped their content into an array
There are two things wrong:
The ID doesn't exist on that page. Try this to see the list of tag IDs available:
require "open-uri"
require 'nokogiri'
doc = Nokogiri::HTML(open("http://store.steampowered.com/promotion/snowglobefaq"))
puts doc.search('[id*="item"]').map{ |n| n['id'] }.sort
The correct chain of methods is agent.page.xpath.
Because there is no sample HTML showing exactly which tag you want, we can't help you much.

Using Nokogiri to scrape a value from Yahoo Finance?

I wrote a simple script:
require 'rubygems'
require 'nokogiri'
require 'open-uri'
url = "http://au.finance.yahoo.com/q/bs?s=MYGN"
doc = Nokogiri::HTML(open(url))
name = doc.at_css("#yfi_rt_quote_summary h2").text
market_cap = doc.at_css("#yfs_j10_mygn").text
ebit = doc.at("//*[#id='yfncsumtab']/tbody/tr[2]/td/table[2]/tbody/tr/td/table/tbody/tr[11]/td[2]/strong").text
puts "#{name} - #{market_cap} - #{ebit}"
The script grabs three values from Yahoo finance. The problem is that the ebit XPath returns nil. The way I got the XPath was using the Chrome developer tools and copy and pasting.
This is the page I'm trying to get the value from http://au.finance.yahoo.com/q/bs?s=MYGN and the actual value is 483,992 in the total current assets row.
Any help would be appreciated, especially if there is a way to get this value with CSS selectors.
Nokogiri supports:
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(open("http://au.finance.yahoo.com/q/bs?s=MYGN"))
ebit = doc.at('strong:contains("Total Current Assets")').parent.next_sibling.text.gsub(/[^,\d]+/, '')
puts ebit
# >> 483,992
I'm using the <strong> tag as an place-marker with the :contains pseudo-class, then backing up to the containing <td>, moving to the next <td> and grabbing its text, then finally stripping the white-space using gsub(/[^,\d]+/, '') which removes everything that isn't a number or a comma.
Nokogiri supports a number of jQuery's JavaScript extensions, which is why :contains works.
This seems to work for me
doc.css("table.yfnc_tabledata1 tr[11] td[2]").text.tr(",","").to_i
#=> 483992
Or as a string
doc.css("table.yfnc_tabledata1 tr[11] td[2]").text.strip.gsub(/\u00A0/,"")
#=> "483,992"

Ruby Regex Help

I want to Extract the Members Home sites links from a site.
Looks like this
<a href="http://www.ptop.se" target="_blank">
i tested with it this site
http://www.rubular.com/
<a href="(.*?)" target="_blank">
Shall output http://www.ptop.se,
Here comes the code
require 'open-uri'
url = "http://itproffs.se/forumv2/showprofile.aspx?memid=2683"
open(url) { |page| content = page.read()
links = content.scan(/<a href="(.*?)" target="_blank">/)
links.each {|link| puts #{link}
}
}
if you run this, it dont works. why not?
I would suggest that you use one of the good ruby HTML/XML parsing libraries e.g. Hpricot or Nokogiri.
If you need to log in on the site you might be interested in a library like WWW::Mechanize.
Code example:
require "open-uri"
require "hpricot"
require "nokogiri"
url = "http://itproffs.se/forumv2"
# Using Hpricot
doc = Hpricot(open(url))
doc.search("//a[#target='_blank']").each { |user| puts "found #{user.inner_html}" }
# Using Nokogiri
doc = Nokogiri::HTML(open(url))
doc.xpath("//a[#target='_blank']").each { |user| puts "found #{user.text}" }
Several issues with your code
I don't know what you mean by using
{link}. But if you want to append a '#' character to the link make sure
you wrap that with quotes. ie
"#{link}"
String.scan accepts a block. Use it
to loop through the matches.
The page you are trying to access
does not return any links that the
regex would match anyway.
Here's something that would work:
require 'open-uri'
url = "http://itproffs.se/forumv2/"
open(url) do |page|
content = page.read()
content.scan(/<a href="(.*?)" target="_blank">/) do |match|
match.each { |link| puts link}
end
end
There're better ways to do it, I am sure. But this should work.
Hope it helps

Resources