How to scrape Google search using Nokogiri - ruby

I'd like to scrape a few Google search pages for the "Did you mean" spelling checking section.
For example, if I search for "cardiovascular diesese", it will be linked to
https://www.google.com/webhp?sourceid=chrome-instant&ion=1&espv=2&ie=UTF-8#q=cardiovascular%20diesese
I want to scrape the "Search instead for cardiovascular diesese" part.
How can I this by using Nokogiri and XPath?

If you can use the non-JavaScript URL, this should work:
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(open("https://www.google.com/search?q=cardiovascular+diesese"))
doc.xpath("string(//span[#class='spell_orig']/a)") # => "cardiovascular diesese"
If you can render JavaScript and need to use your original example URL, this XPath selector should work once you've loaded the document into Nokogiri (tested with $x in Chrome):
doc.xpath("//a[#class='spell_orig'][boolean(#href)]/text()") # => "cardiovascular diesese"

Since you want to extract only a single result, you can use at_xpath shortcut which under the hood is still doing xpath/css.first. To locate element via Dev Tools you need to go to Elements Tab -> Right Click on the element -> Copy -> Copy Xpath.
To grab text:
doc.at_xpath("//*[#id='fprs']/a[2]/text()") #=> cardiovascular disease
# or you can use at_css which is faster for class names
doc.at_css("a.spell_orig/text()") #=> cardiovascular disease
To grab link:
doc.at_xpath("//*[#id='fprs']/a[2]/#href") #=> /search?hl=en&q=cardiovascular+diesese&nfpr=1&sa=X&ved=2ahUKEwjqhZfu0KbyAhVLRKwKHWbBDNsQvgUoAXoECAEQMg
# or you can use at_css which is faster for class names
doc.at_css("a.spell_orig/#href") #=> /search?hl=en&q=cardiovascular+diesese&nfpr=1&sa=X&ved=2ahUKEwjqhZfu0KbyAhVLRKwKHWbBDNsQvgUoAXoECAEQMg
Code and example in the online IDE:
require 'nokogiri'
require 'httparty'
headers = {
"User-Agent" => "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {
q: "cardiovascular diesese",
hl: "en"
}
response = HTTParty.get("https://www.google.com/search",
query: params,
headers: headers)
doc = Nokogiri::HTML(response.body)
puts doc.at_xpath("//*[#id='fprs']/a[2]/text()"),
"https://www.google.com#{doc.at_xpath("//*[#id='fprs']/a[2]/#href")}"
# or at_css which is faster for class names and produces better XPath than written by hand
puts doc.at_css("a.spell_orig/text()"),
doc.at_css("a.spell_orig/#href")
-------
=begin
cardiovascular diesese
https://www.google.com/search?hl=en&q=cardiovascular+diesese&nfpr=1&sa=X&ved=2ahUKEwjS5Mevr6vyAhWMK80KHXg8AwoQvgUoAXoECAEQMQ
cardiovascular diesese
/search?hl=en&q=cardiovascular+diesese&nfpr=1&sa=X&ved=2ahUKEwjS5Mevr6vyAhWMK80KHXg8AwoQvgUoAXoECAEQMQ
=end
Alternatively, you can use Google Organic Results API from SerpApi. It's a paid API with a free plan that supports different languages.
The difference is that in this case, the figure out part of how to extract some elements from the page is missing. All that needs to be done is to iterate over a structured JSON.
Code to integrate:
require 'google_search_results'
params = {
api_key: ENV["API_KEY"],
engine: "google",
q: "cardiovascular diesese",
hl: "en"
}
search = GoogleSearch.new(params)
hash_results = search.get_hash
search_instead_for = hash_results[:search_information][:spelling_fix]
puts search_instead_for
-------
#=> cardiovascular disease
Disclaimer, I work for SerpApi.

Related

Mechanize Rails - Web Scraping - Server responds with JSON - How to Parse URL from to Download CSV

I am new to Mechanize and trying to overcome this probably very obvious answer.
I put together a short script to auth on an external site, then click a link that generates a CSV file dynamically.
I have finally got it to click on the export button, however, it returns an AWS URL.
I'm trying to get the script to download said CSV from this JSON Response (seen below).
Myscript.rb
require 'mechanize'
require 'logger'
require 'rubygems'
require 'nokogiri'
require 'open-uri'
require 'zlib'
USERNAME = "myemail"
PASSWORD = "mysecret"
USER_AGENT = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36"
mechanize = Mechanize.new do |a|
a.user_agent = USER_AGENT
end
form_page = mechanize.get('https://XXXX.XXXXX.com/signin')
form = form_page.form_with(:id =>'login')
form.field_with(:id => 'user_email').value=USERNAME
form.field_with(:id => 'user_password').value=PASSWORD
page = form.click_button
donations = mechanize.get('https://XXXXX.XXXXXX.com/pages/ACCOUNT/statistics')
puts donations.body
donations = mechanize.get('https://xxx.siteimscraping.com/pages/myaccount/statistics')
bs_csv_download = page.link_with(:text => 'Download CSV')
JSON response from website containing link to CSV I need to parse and download via Mechanize and/or nokogiri.
{"message":"Find your report at https://s3.amazonaws.com/reports.XXXXXXX.com/XXXXXXX.csv?X-Amz-Algorithm=AWS4-HMAC-SHA256\u0026X-Amz-Credential=AKIAIKW4BJKQUNOJ6D2A%2F20190228%2Fus-east-1%2Fs3%2Faws4_request\u0026X-Amz-Date=20190228T025844Z\u0026X-Amz-Expires=86400\u0026X-Amz-SignedHeaders=host\u0026X-Amz-Signature=b19b6f1d5120398c850fc03c474889570820d33f5ede5ff3446b7b8ecbaf706e"}
I very much appreciate any help.
You could parse it as JSON and then retrieve a substring from the response (assuming it always responds in the same format):
require 'json'
...
bs_csv_download = page.link_with(:text => 'Download CSV')
json_response = JSON.parse(bs_csv_download)
direct_link = json_response["message"][20..-1]
mechanize.get(direct_link).save('file.csv')
We're getting the 20th character in the "message" value with [20..-1] (-1 means till the end of the string).

Ruby - nokogiri, open-uri - Fail to parse page [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 7 years ago.
Improve this question
This code work on some pages, like klix.ba, but cant figure out why it doesn't work for others.
There is no error to explain what went wrong, nothing.
If puts page works, which means I can target the page, and parse it, why I cant get single elements?
require 'nokogiri'
require 'open-uri'
url = 'http://www.olx.ba/'
user_agent = "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7"
page = Nokogiri::XML(open(url,'User-Agent' => user_agent), nil, "UTF-8")
#puts page - This line work
puts page.xpath('a')
First of all, why are you parsing it as XML?
The following should be correct, considering your page is a HTML website:
page = Nokogiri::HTML(open(url,'User-Agent' => user_agent), nil, "UTF-8")
Furthermore, if you want to strip out all the links (a-tags), this is how:
page.css('a').each do |element|
puts element
end
If you are want to parse content from a web page you need to do this:
require 'nokogiri'
require 'open-uri'
url = 'http://www.olx.ba/'
user_agent = "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7"
page = Nokogiri::HTML(open(url,'User-Agent' => user_agent), nil, "UTF-8")
#puts page - This line work
puts page.xpath('a')
Here take a look at the Nokogiri documentation
One thing I would suggest is to use a debugger break point in your code (probably after assigning page). Look at the Pry-debugger gem.
So I would do something like this:
require 'nokogiri'
require 'open-uri'
require 'pry' # require the necessary library
url = 'http://www.olx.ba/'
user_agent = "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7"
page = Nokogiri::HTML(open(url,'User-Agent' => user_agent), nil, "UTF-8")
binding.pry # stop a moment in time in you code (break point)
#puts page - This line work
puts page.xpath('a')

Trouble scraping Google trends using Capybara and Poltergeist

I want to get the top trending queries in a particular category on Google Trends. I could download the CSV for that category but that is not a viable solution because I want to branch into each query and find the trending sub-queries for each.
I am unable to capture the contents of the following table, which contains the top 10 trending queries for a topic. Also for some weird reason taking a screenshot using capybara returns a darkened image.
<div id="TOP_QUERIES_0_0table" class="trends-table">
Please run the code on the Ruby console to see it working. Capturing elements/screenshot works fine for facebook.com or google.com but doesn't work for trends.
I am guessing this has to do with the table getting generated dynamically on page load but I'm not sure if that should block capybara from capturing the elements already loaded on the page. Any hints would be very valuable.
require 'capybara/poltergeist'
require 'capybara/dsl'
require 'csv'
class PoltergeistCrawler
include Capybara::DSL
def initialize
Capybara.register_driver :poltergeist_crawler do |app|
Capybara::Poltergeist::Driver.new(app, {
:js_errors => false,
:inspector => false,
phantomjs_logger: open('/dev/null')
})
end
Capybara.default_wait_time = 3
Capybara.run_server = false
Capybara.default_driver = :poltergeist_crawler
page.driver.headers = {
"DNT" => 1,
"User-Agent" => "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:22.0) Gecko/20100101 Firefox/22.0"
}
end
# handy to peek into what the browser is doing right now
def screenshot(name="screenshot")
page.driver.render("public/#{name}.jpg",full: true)
end
# find("path") and all("path") work ok for most cases. Sometimes I need more control, like finding hidden fields
def doc
Nokogiri.parse(page.body)
end
end
crawler = PoltergeistCrawler.new
url = "http://www.google.com/trends/explore#cat=0-45&geo=US&date=today%2012-m&cmpt=q"
crawler.visit url
crawler.screenshot
crawler.find(:xpath, "//div[#id='TOP_QUERIES_0_0table']")
Capybara::ElementNotFound: Unable to find xpath "//div[#id='TOP_QUERIES_0_0table']"
from /Users/karan/.rvm/gems/ruby-1.9.3-p545/gems/capybara-2.4.4/lib/capybara/node/finders.rb:41:in block in find'
from /Users/karan/.rvm/gems/ruby-1.9.3-p545/gems/capybara-2.4.4/lib/capybara/node/base.rb:84:insynchronize'
from /Users/karan/.rvm/gems/ruby-1.9.3-p545/gems/capybara-2.4.4/lib/capybara/node/finders.rb:30:in find'
from /Users/karan/.rvm/gems/ruby-1.9.3-p545/gems/capybara-2.4.4/lib/capybara/session.rb:676:inblock (2 levels) in '
from /Users/karan/.rvm/gems/ruby-1.9.3-p545/gems/capybara-2.4.4/lib/capybara/dsl.rb:51:in block (2 levels) in <module:DSL>'
from (irb):45
from /Users/karan/.rbenv/versions/1.9.3-p484/bin/irb:12:in'
The javascript error was due to the incorrect USER-Agent. Once I changed the User Agent to that of my chrome browser it worked !
"User-Agent" => "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.71 Safari/537.36"

Using Nokogiri to scrape a value from Yahoo Finance?

I wrote a simple script:
require 'rubygems'
require 'nokogiri'
require 'open-uri'
url = "http://au.finance.yahoo.com/q/bs?s=MYGN"
doc = Nokogiri::HTML(open(url))
name = doc.at_css("#yfi_rt_quote_summary h2").text
market_cap = doc.at_css("#yfs_j10_mygn").text
ebit = doc.at("//*[#id='yfncsumtab']/tbody/tr[2]/td/table[2]/tbody/tr/td/table/tbody/tr[11]/td[2]/strong").text
puts "#{name} - #{market_cap} - #{ebit}"
The script grabs three values from Yahoo finance. The problem is that the ebit XPath returns nil. The way I got the XPath was using the Chrome developer tools and copy and pasting.
This is the page I'm trying to get the value from http://au.finance.yahoo.com/q/bs?s=MYGN and the actual value is 483,992 in the total current assets row.
Any help would be appreciated, especially if there is a way to get this value with CSS selectors.
Nokogiri supports:
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(open("http://au.finance.yahoo.com/q/bs?s=MYGN"))
ebit = doc.at('strong:contains("Total Current Assets")').parent.next_sibling.text.gsub(/[^,\d]+/, '')
puts ebit
# >> 483,992
I'm using the <strong> tag as an place-marker with the :contains pseudo-class, then backing up to the containing <td>, moving to the next <td> and grabbing its text, then finally stripping the white-space using gsub(/[^,\d]+/, '') which removes everything that isn't a number or a comma.
Nokogiri supports a number of jQuery's JavaScript extensions, which is why :contains works.
This seems to work for me
doc.css("table.yfnc_tabledata1 tr[11] td[2]").text.tr(",","").to_i
#=> 483992
Or as a string
doc.css("table.yfnc_tabledata1 tr[11] td[2]").text.strip.gsub(/\u00A0/,"")
#=> "483,992"

Scraping track data from HTML?

I'd like to be able to scrape data from a track list page at 1001tracklists. A URL example is:
http://www.1001tracklists.com/tracklist/25122_lange-intercity-podcast-115-2013-03-06.html
Here is an example of how the data is displayed on the page:
Above & Beyond - Black Room Boy (Above & Beyond Club Mix) [ANJUNABEATS]
I'd like to pull out all the songs from this page in the following format:
$byArtist - $name [$publisher]
After reviewing the HTML for this page, it appears the content I am after is stored in HTML5 meta microdata format:
<td class="" id="tlptr_433662">
<a name="tlp_433662"></a>
<div itemprop="tracks" itemscope itemtype="http://schema.org/MusicRecording" id="tlp5_content">
<meta itemprop="byArtist" content="Above & Beyond">
<meta itemprop="name" content="Black Room Boy (Above & Beyond Club Mix)">
<meta itemprop="publisher" content="ANJUNABEATS">
<meta itemprop="url" content="/track/103905_above-beyond-black-room-boy-above-beyond-club-mix/index.html">
<span class="tracklistTrack floatL"id="tr_103905" >Above & Beyond - Black Room Boy (Above & Beyond Club Mix) </span><span class="floatL">[ANJUNABEATS]</span>
<div id="tlp5_actions" class="floatL" style="margin-top:1px;">
There is a CSS selector with a "tlp_433662" value. Each song on the page will have its own unique id. One will have "tlp_433662" and the next will have "tlp_433628" or something similar.
Is there a way to extract all songs listed on the tracklist page using Nokogiri and XPath?
I will probably want to "do" an "each" on my "data" listed below so that the scraper loops over the data extracting each set of relevant data. Here is the start of my Ruby program:
require 'rubygems'
require 'nokogiri'
require 'open-uri'
url = "http://www.1001tracklists.com/tracklist/25122_lange-intercity-podcast-115-2013-03-06.html"
data = Nokogiri::HTML(open(url))
# what do do next? print out xpath loop code which extracts my data.
# code block I need help with
data.xpath.........each do |block|
block.xpath("...........").each do |span|
puts stuff printing out what I want.
end
end
My ultimate goal, which I know how to do, is to take this Ruby script to Sinatra to "webify" the data and add some nice Twitter bootstrap CSS as shown in this youtube video: http://www.youtube.com/watch?v=PWI1PIvy4A8
Can you help me with the XPath code block so that I can scrape the data and print the array?
Here's some code to gather the information into an array of hashes.
I prefer using CSS accessors over XPath, because they're more readable if you have any HTML/CSS or jQuery experience.
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(open('http://www.1001tracklists.com/tracklist/25122_lange-intercity-podcast-115-2013-03-06.html'))
data = doc.search('tr.tlpItem div[itemtype="http://schema.org/MusicRecording"]').each_with_object([]) do |div, array|
hash = div.search('meta').each_with_object({}) do |m, h|
h[m['itemprop']] = m['content']
end
link = div.at('span a')
hash['tracklistTrack'] = [ link['href'], link.text ]
title = div.at('span.floatL a')
hash['title'] = [title['href'], title.text ]
array << hash
end
pp data[0, 2]
Which outputs a subset of the page's data. After some massaging the structure looks like this:
[
{
"byArtist"=>"Markus Schulz",
"name"=>"The Spiritual Gateway (Transmission 2013 Theme)",
"publisher"=>"COLDHARBOUR RECORDINGS",
"url"=>"/track/108928_markus-schulz-the-spiritual-gateway-transmission-2013-theme/index.html",
"tracklistTrack"=>[
"/track/108928_markus-schulz-the-spiritual-gateway-transmission-2013-theme/index.html",
"Markus Schulz - The Spiritual Gateway (Transmission 2013 Theme)"
],
"title"=>[
"/track/108928_markus-schulz-the-spiritual-gateway-transmission-2013-theme/index.html",
"Markus Schulz - The Spiritual Gateway (Transmission 2013 Theme)"
]
},
{
"byArtist"=>"Lange & Audrey Gallagher",
"name"=>"Our Way Home (Noah Neiman Remix)",
"publisher"=>"LANGE RECORDINGS",
"url"=>"/track/119667_lange-audrey-gallagher-our-way-home-noah-neiman-remix/index.html",
"tracklistTrack"=>[
"/track/119667_lange-audrey-gallagher-our-way-home-noah-neiman-remix/index.html",
"Lange & Audrey Gallagher - Our Way Home (Noah Neiman Remix)"
],
"title"=>[
"/track/119667_lange-audrey-gallagher-our-way-home-noah-neiman-remix/index.html",
"Lange & Audrey Gallagher - Our Way Home (Noah Neiman Remix)"
]
}
]
require 'nokogiri'
require 'rest-client'
url = 'http://www.1001tracklists.com/tracklist/25122_lange-intercity-podcast-115-2013-03-06.html'
page = Nokogiri::HTML(RestClient.get(url,:user_agent=>'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)'), nil, 'UTF-8');
page.css('table.detail tr.tlpItem').each do |row|
artist = row.css('meta[#itemprop="byArtist"]').attr('content')
name = row.css('meta[#itemprop="name"]').attr('content')
puts "#{artist} - #{name}"
end
...a more advanced version, that grabs all the meta info from the row and prints 'Artist - Song [Publisher]
require 'nokogiri'
require 'rest-client'
url = 'http://www.1001tracklists.com/tracklist/25122_lange-intercity-podcast-115-2013-03-06.html'
page = Nokogiri::HTML(RestClient.get(url,:user_agent=>'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)'), nil, 'UTF-8');
page.css('table.detail tr.tlpItem').each do |row|
meta = row.search('meta').each_with_object({}) do |tag, hash|
hash[tag['itemprop']] = tag['content']
end
puts "#{meta['byArtist']} - #{meta['name']} [#{meta['publisher']||'Unknown'}]"
end
You get the picture for the rest of the properties. You will need to do some error/exists? checking because some songs don't have all the properties. But this should get you on the right track. I've also used the rest-client gem so feel free to use whatever you want to retrieve the page.
There is this free webservice wich scrapes all the 400+ schema.org classes from given url and give them back as JSON
http://scrappy.netfluid.org/

Resources