How to extract data from dynamic collapsing table with hidden elements using Nokogiri and Ruby - ruby

I am trying to scrape through the following website :
https://www.cdc.gov/coronavirus/2019-ncov/cases-updates/cases-in-us.html
to get all of the state statistics on coronavirus.
My code below works:
require 'nokogiri'
require 'open-uri'
require 'httparty'
require 'pry'
url = "https://www.cdc.gov/coronavirus/2019-ncov/cases-updates/cases-in-us.html"
doc = Nokogiri::HTML.parse(open(url))
total_cases = doc.css("span.count")[0].text
total_deaths = doc.css("span.count")[1].text
new_cases = doc.css("span.new-cases")[0].text
new_deaths = doc.css("span.new-cases")[1].text
However, I am unable to get into the collapsed data/gridcell data.
I have tried searching by the class .aria-label and by the .rt-tr-group class. Any help would be appreciated. Thank you.

Although the answer of Layon Ferreira already states the problem it does not provide the steps needed to load the data.
Like already said in the linked answer the data is loaded asynchronously. This means that the data is not present on the initial page and is loaded through the JavaScript engine executing code.
When you open up the browser development tools, go to the "Network" tab. You can clear out all requests, then refresh the page. You'll get to see a list of all requests made. If you're looking for asynchronously loaded data the most interesting requests are often those of type "json" or "xml".
When browsing through the requests you'll find that the data you're looking for is located at:
https://www.cdc.gov/coronavirus/2019-ncov/json/us-cases-map-data.json
Since this is JSON you don't need "nokogiri" to parse it.
require 'httparty'
require 'json'
response = HTTParty.get('https://www.cdc.gov/coronavirus/2019-ncov/json/us-cases-map-data.json')
data = JSON.parse(response.body)
When executing the above you'll get the exception:
JSON::ParserError ...
This seems to be a Byte Order Mark (BOM) that is not removed by HTTParty. Most likely because the response doesn't specify an UTF-8 charset.
response.body[0]
#=> ""
format '%X', response.body[0].ord
#=> "FEFF"
To correctly handle the BOM Ruby 2.7 added the set_encoding_by_bom method to IO which is also available on StringIO.
require 'httparty'
require 'json'
require 'stringio'
response = HTTParty.get('https://www.cdc.gov/coronavirus/2019-ncov/json/us-cases-map-data.json')
body = StringIO.new(response.body)
body.set_encoding_by_bom
data = JSON.parse(body.gets(nil))
#=> [{"Jurisdiction"=>"Alabama", "Range"=>"10,001 to 20,000", "Cases Reported"=>10145, ...
If you're not yet using Ruby 2.7 you can use a substitute to remove the BOM, however the former is probably the safer option:
data = JSON.parse(response.body.force_encoding('utf-8').sub(/\A\xEF\xBB\xBF/, ''))

That page is using AJAX to load its data.
in that case you may use Watir to fetch the page using a browser
as answered here: https://stackoverflow.com/a/13792540/2784833
Another way is to get data from the API directly.
You can see the other endpoints by checking the network tab on your browser console

I replicated your code and found some of the errors that you might have done
require 'HTTParty'
will not work. You need to use
require 'httparty'
Secondly, there should be quotes around your variable url value i.e
url = "https://www.cdc.gov/coronavirus/2019-ncov/cases-updates/cases-in-us.html"
Other than that, it just worked fine for me.
Also, if you're trying to get the Covid-19 data you might want to use these APIs
For US Count
For US Daily Count
For US Count - States
You could learn more about the APIs here

Related

How to scrape pages which have lazy loading

Here is the code which i used for parsing of web page.I did it in rails console.But i am not getting any output in my rails console.The site which i want to scrape is having lazy loading
require 'nokogiri'
require 'open-uri'
page = 1
while true
url = "http://www.justdial.com/functions"+"/ajxsearch.php?national_search=0&act=pagination&city=Delhi+%2F+NCR&search=Pandits"+"&where=Delhi+Cantt&catid=1195&psearch=&prid=&page=#{page}"
doc = Nokogiri::HTML(open(url))
doc = Nokogiri::HTML(doc.at_css('#ajax').text)
d = doc.css(".rslwrp")
d.each do |t|
puts t.css(".jrcw").text
puts t.css("span.jcn").text
puts t.css(".jaid").text
puts t.css(".estd").text
page+=1
end
end
You have 2 options here:
Switch pure HTTP scraping to some tool which supports javascript evaluation, such as Capybara (with proper driver selected). This can be slow, since you're running headless browser under the hood plus you'll have to set some timeouts or figure another way to make sure the blocks of text you're interested in are loaded before you start any scraping.
Second option is to use Web Developer console and figure out how those blocks of text are loaded (which AJAX calls, their parameters and etc.) and implement them in your scraper. This is more advanced approach, but more performant, since you won't make any extra work, like you've done in option 1.
Have a nice day!
UPDATE:
Your code above doesn't work, because the response is HTML code wrapped in JSON object, while you're trying to parse it as a raw HTML. It looks like this:
{
"error": 0,
"msg": "request successful",
"paidDocIds": "some ids here",
"itemStartIndex": 20,
"lastPageNum": 50,
"markup": 'LOTS AND LOTS AND LOTS OF MARKUP'
}
What you need is unwrap JSON and then parse as HTML:
require 'json'
json = JSON.parse(open(url).read) # make sure you check http errors here
html = json['markup'] # can this field be empty? check for the json['error'] field
doc = Nokogiri::HTML(html) # parse as you like
I'd also advise you against using open-uri since your code may become vulnerable if you use dynamic urls because of the way open-uri works (read the linked article for the details) and use good and more feature-wise libraries such as HTTParty and RestClient.
UPDATE 2: Minimal working script for me:
require 'json'
require 'open-uri'
require 'nokogiri'
url = 'http://www.justdial.com/functions/ajxsearch.php?national_search=0&act=pagination&city=Delhi+%2F+NCR&search=Pandits&where=Delhi+Cantt&catid=1195&psearch=&prid=&page=2'
json = JSON.parse(open(url).read) # make sure you check http errors here
html = json['markup'] # can this field be empty? check for the json['error'] field
doc = Nokogiri::HTML(html) # parse as you like
puts doc.at_css('#newphoto10').attr('title')
# => Dr Raaj Batra Lal Kitab Expert in East Patel Nagar, Delhi

How to parse a webpage in Ruby without any library or gem?

I want to use the API of a website in a Ruby script, and the only return from the API is a number through the HTTPS protocol. Nothing more, not even tags or something, so I was wondering if there is a way to get that number in a string or integer in my script without using any XML parsing livrary or gem like REXML or hpricot or libXML, because the webpages that I want to parse are, as I said, extremely basic...
If I understand. A request to https://www.website.com/api/getid return 2.
Then, I guess this would do:
require 'net/https'
require 'uri'
def open(url)
Net::HTTP.get(URI.parse(url))
end
response = open("https://www.website.com/api/getid")
EDIT
You'll find much usefull examples here.
As it is mentioned in the link above, HTTParty is quite popular. An example:
require 'httparty'
response = HTTParty.get('http://twitter.com/statuses/public_timeline.json')
puts response.body, response.code, response.message, response.headers.inspect

How do I use Nokogiri to parse a bit.ly stats page?

I'm trying to parse the Twitter usernames from a bit.ly stats page using Nokogiri:
require 'rubygems'
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(open('http://bitly.com/U026ue+/global'))
twitter_accounts = []
shares = doc.xpath('//*[#id="tweets"]/li')
shares.map do |tweet|
twitter_accounts << tweet.at_css('.conv.tweet.a')
end
puts twitter_accounts
My understanding is that Nokogiri will save shares in some form of tree structure, which I can use to drill down into, but my mileage is varying.
That data is coming in from an Ajax request with a JSON response. It's pretty easy to get at though:
require 'json'
url = 'http://search.twitter.com/search.json?_usragnt=Bitly&include_entities=true&rpp=100&q=nowness.com%2Fday%2F2012%2F12%2F6%2F2643'
hash = JSON.parse open(url).read
puts hash['results'].map{|x| x['from_user']}
I got that URL by loading the page in Chrome and then looking at the network panel, I also removed the timestamp and callback parameters just to clean things up a bit.
Actually, Eric Walker was onto something. If you look at doc, the section where the tweets are supposed to be look like:
<h2>Tweets</h2>
<ul id="tweets"></ul>
</div>
This is likely because they're generated by some JavaScript call which Nokogiri isn't executing. One possible solution is to use watir to traverse to the page, load the JavaScript and then save the HTML.
Here is a script that accomplishes just that. Note that you had some issues with your XPath arguments which I've since solved, and that watir will open a new browser every time you run this script:
require 'watir'
require 'nokogiri'
browser = Watir::Browser.new
browser.goto 'http://bitly.com/U026ue+/global'
doc = Nokogiri::HTML.parse(browser.html)
twitter_accounts = []
shares = doc.xpath('//li[contains(#class, "tweet")]/a')
shares.each do |tweet|
twitter_accounts << tweet.attr('title')
end
puts twitter_accounts
browser.close
You can also use headless to prevent a window from opening.

Use ruby mechanize to get data from foursquare

I am trying to use ruby and Mechanize to parse data on foursquare's website. Here is my code:
require 'rubygems'
require 'mechanize'
agent = Mechanize.new
page = agent.get('https://foursquare.com')
page = agent.click page.link_with(:text => /Log In/)
form = page.forms[1]
form.F12778070592981DXGWJ = ARGV[0]
form.F1277807059296KSFTWQ = ARGV[1]
page = form.submit form.buttons.first
puts page.body
But then, when I run this code, the following error poped up:
C:/Ruby192/lib/ruby/gems/1.9.1/gems/mechanize-2.0.1/lib/mechanize/form.rb:162:in
`method_missing': undefined method `F12778070592981DXGWJ='
for #<Mechanize::Form:0x2b31f70> (NoMethodError)
from four.rb:10:in `<main>'
I checked and found that these two variables for the form object "F12778070592981DXGWJ" and "F1277807059296KSFTWQ" are changing every time when I try to open foursquare's webpage.
Does any one have the same problem before? your variables change every time you try to open a webpage? How should I solve this problem?
Our project is about parsing the data on foursquare. So I need to be able to login first.
Mechanize is useful for sites which don't expose an API, but Foursquare has an established REST API already. I'd recommend using one of the Ruby libraries, perhaps foursquare2. These libraries abstract away things like authentication, so you just have to register your app and use the provided keys.
Instead of indexing the form fields by their name, just index them by their order. That way you don't have to worry about the name that changes on each request:
form.fields[0].value = ARGV[0]
form.fields[1].value = ARGV[1]
...
However like dwhalen said, using the REST API is probably a much better way. That's why it's there.

Anyone know of a caching plugin for Ruby Mechanize?

I have a Mechanize based Ruby script to scrape a website. I am hoping to speed it up by caching the downloaded HTML pages locally to make the whole "tweak output -> run -> tweak output" cycle quicker. I would prefer not to have to install an external cache on the machine just for this script. The ideal solution would plugin to Mechanize and transparently cache fetched pages, images and so on.
Anyone know of a library that will do this? Or another way of achieving the same outcome (script runs much quicker second time round)?
A good way of doing this type of thing is to use the (AWESOME) VCR gem.
Here's an example of how you would do it:
require 'vcr'
require 'mechanize'
# Setup VCR's configs. The cassette library directory is where
# all of your "recordings" are saved as YAML files.
VCR.configure do |c|
c.cassette_library_dir = 'vcr_cassettes'
c.hook_into :webmock
end
# Make a request...
# The first time you do this it will actually make the call out
# Subsequent calls will read the cassette file instead of hitting the network
VCR.use_cassette('google_homepage') do
a = Mechanize.new
a.get('http://google.com/')
end
As you can see... VCR records the communication as a YAML file on the first run:
mario$ find tester -mindepth 1 -maxdepth 3
tester/vcr_cassettes
tester/vcr_cassettes/google_homepage.yml
If you want to have VCR create new versions of the cassettes, just delete the corresponding file.
I'm not sure that caching the pages is going to help that much. What will help more is to have a record of previously visited URLs so you don't revisit them repeatedly. The page caching is moot because you should have already grabbed the important information when you saw the page the first time so all you need to do is check to see if you've seen it already. If you have, grab the summary information you care about and manipulate it as necessary.
I used to write analytical spiders using Perl's Mechanize. Ruby's Mechanize is based on it. Storing the previously visited URLs in SOME sort of cache was useful, like a hash, but, because apps crash or hosts go down mid-session, all the previous results would be gone. A real disk-based database was essential at that point.
I like Postgres, but even SQLite is a good choice. Whatever you use, get the important information on the drive where it can survive a restart or crash.
Something else I'd recommend, is use a YAML file for configuration of your app. Put every parameter that is likely to be changed during the app's run in there. Then, write the app so it periodically checks that file's modification time and reloads it if there's been a change. That way, you can adjust its run-time behavior on the fly. I had to write a spider to analyze a Fortune 50 corporation's multiple-websites several years ago. The app ran for three weeks spidering many different sites tied to that corporation, and because I could tweak the regex used to control which pages the app processed, I could fine tune it without shutting down that app.
If you store some information about the page after the first request, you can rebuild the page later without having to re-request it from the server.
# 1) store the page information
# uri: a URI instance
# response: a hash of response headers
# body: a string
# code: the HTTP response code
page = agent.get(url)
uri, response, body, code = [page.uri, page.response, page.body, page.code]
# 2) rebuild the page, given the stored information
page = Mechanize::Page.new(uri, response, body, code, agent)
I've used this technique in spiders/scrapers so that the code can be tweaked without having to re-request all the pages. e.g.:
# agent: a Mechanize instance
# storage: must respond to [] and []=, and must accept and return arbitrary ruby objects.
# for in-memory storage, you could use a Hash.
# or, you could write something that is backed by a filesystem, mongodb, riak, redis, s3, etc...
# logger: a Logger instance
class Foobar < Struct.new(:agent, :storage, :logger)
def get_cached(uri)
cache_key = "_cache/#{uri}"
if args = storage[cache_key]
logger.debug("getting (cached) #{uri}")
uri, response, body, code = args
page = Mechanize::Page.new(uri, response, body, code, agent)
agent.send(:add_to_history, page)
page
else
logger.debug("getting (UNCACHED) #{uri}")
page = agent.get(uri)
storage[cache_key] = [page.uri, page.response, page.body, page.code]
page
end
end
end
Which you could use like this:
require 'logger'
require 'pp'
require 'rubygems'
require 'mechanize'
storage = {}
foo = Foobar.new(Mechanize.new, storage, Logger.new(STDOUT))
foo.get_cached("http://ifconfig.me/ua")
foo.get_cached("http://ifconfig.me/ua")
foo.get_cached("http://ifconfig.me/ua")
foo.get_cached("http://ifconfig.me/encoding")
foo.get_cached("http://ifconfig.me/encoding")
pp storage
Which prints the following information:
D, [2013-10-19T14:13:32.019291 #18107] DEBUG -- : getting (UNCACHED) http://ifconfig.me/ua
D, [2013-10-19T14:13:36.375649 #18107] DEBUG -- : getting (cached) http://ifconfig.me/ua
D, [2013-10-19T14:13:36.376822 #18107] DEBUG -- : getting (cached) http://ifconfig.me/ua
D, [2013-10-19T14:13:36.376910 #18107] DEBUG -- : getting (UNCACHED) http://ifconfig.me/encoding
D, [2013-10-19T14:13:52.830416 #18107] DEBUG -- : getting (cached) http://ifconfig.me/encoding
{"_cache/http://ifconfig.me/ua"=>
[#<URI::HTTP:0x007fe4ac94d098 URL:http://ifconfig.me/ua>,
{"date"=>"Sat, 19 Oct 2013 19:13:33 GMT",
"server"=>"Apache",
"vary"=>"Accept-Encoding",
"content-encoding"=>"gzip",
"content-length"=>"87",
"connection"=>"close",
"content-type"=>"text/plain"},
"Mechanize/2.7.2 Ruby/2.0.0p247 (http://github.com/sparklemotion/mechanize/)\n",
"200"],
"_cache/http://ifconfig.me/encoding"=>
[#<URI::HTTP:0x007fe4ac99d2a0 URL:http://ifconfig.me/encoding>,
{"date"=>"Sat, 19 Oct 2013 19:13:48 GMT",
"server"=>"Apache",
"vary"=>"Accept-Encoding",
"content-encoding"=>"gzip",
"content-length"=>"42",
"connection"=>"close",
"content-type"=>"text/plain"},
"gzip,deflate,identity\n",
"200"]}
How about writing pages out to files, each page in an individual file, and separating the tweak and run cycles?

Resources