How to scrape pages which have lazy loading - ruby

Here is the code which i used for parsing of web page.I did it in rails console.But i am not getting any output in my rails console.The site which i want to scrape is having lazy loading
require 'nokogiri'
require 'open-uri'
page = 1
while true
url = "http://www.justdial.com/functions"+"/ajxsearch.php?national_search=0&act=pagination&city=Delhi+%2F+NCR&search=Pandits"+"&where=Delhi+Cantt&catid=1195&psearch=&prid=&page=#{page}"
doc = Nokogiri::HTML(open(url))
doc = Nokogiri::HTML(doc.at_css('#ajax').text)
d = doc.css(".rslwrp")
d.each do |t|
puts t.css(".jrcw").text
puts t.css("span.jcn").text
puts t.css(".jaid").text
puts t.css(".estd").text
page+=1
end
end

You have 2 options here:
Switch pure HTTP scraping to some tool which supports javascript evaluation, such as Capybara (with proper driver selected). This can be slow, since you're running headless browser under the hood plus you'll have to set some timeouts or figure another way to make sure the blocks of text you're interested in are loaded before you start any scraping.
Second option is to use Web Developer console and figure out how those blocks of text are loaded (which AJAX calls, their parameters and etc.) and implement them in your scraper. This is more advanced approach, but more performant, since you won't make any extra work, like you've done in option 1.
Have a nice day!
UPDATE:
Your code above doesn't work, because the response is HTML code wrapped in JSON object, while you're trying to parse it as a raw HTML. It looks like this:
{
"error": 0,
"msg": "request successful",
"paidDocIds": "some ids here",
"itemStartIndex": 20,
"lastPageNum": 50,
"markup": 'LOTS AND LOTS AND LOTS OF MARKUP'
}
What you need is unwrap JSON and then parse as HTML:
require 'json'
json = JSON.parse(open(url).read) # make sure you check http errors here
html = json['markup'] # can this field be empty? check for the json['error'] field
doc = Nokogiri::HTML(html) # parse as you like
I'd also advise you against using open-uri since your code may become vulnerable if you use dynamic urls because of the way open-uri works (read the linked article for the details) and use good and more feature-wise libraries such as HTTParty and RestClient.
UPDATE 2: Minimal working script for me:
require 'json'
require 'open-uri'
require 'nokogiri'
url = 'http://www.justdial.com/functions/ajxsearch.php?national_search=0&act=pagination&city=Delhi+%2F+NCR&search=Pandits&where=Delhi+Cantt&catid=1195&psearch=&prid=&page=2'
json = JSON.parse(open(url).read) # make sure you check http errors here
html = json['markup'] # can this field be empty? check for the json['error'] field
doc = Nokogiri::HTML(html) # parse as you like
puts doc.at_css('#newphoto10').attr('title')
# => Dr Raaj Batra Lal Kitab Expert in East Patel Nagar, Delhi

Related

How to extract data from dynamic collapsing table with hidden elements using Nokogiri and Ruby

I am trying to scrape through the following website :
https://www.cdc.gov/coronavirus/2019-ncov/cases-updates/cases-in-us.html
to get all of the state statistics on coronavirus.
My code below works:
require 'nokogiri'
require 'open-uri'
require 'httparty'
require 'pry'
url = "https://www.cdc.gov/coronavirus/2019-ncov/cases-updates/cases-in-us.html"
doc = Nokogiri::HTML.parse(open(url))
total_cases = doc.css("span.count")[0].text
total_deaths = doc.css("span.count")[1].text
new_cases = doc.css("span.new-cases")[0].text
new_deaths = doc.css("span.new-cases")[1].text
However, I am unable to get into the collapsed data/gridcell data.
I have tried searching by the class .aria-label and by the .rt-tr-group class. Any help would be appreciated. Thank you.
Although the answer of Layon Ferreira already states the problem it does not provide the steps needed to load the data.
Like already said in the linked answer the data is loaded asynchronously. This means that the data is not present on the initial page and is loaded through the JavaScript engine executing code.
When you open up the browser development tools, go to the "Network" tab. You can clear out all requests, then refresh the page. You'll get to see a list of all requests made. If you're looking for asynchronously loaded data the most interesting requests are often those of type "json" or "xml".
When browsing through the requests you'll find that the data you're looking for is located at:
https://www.cdc.gov/coronavirus/2019-ncov/json/us-cases-map-data.json
Since this is JSON you don't need "nokogiri" to parse it.
require 'httparty'
require 'json'
response = HTTParty.get('https://www.cdc.gov/coronavirus/2019-ncov/json/us-cases-map-data.json')
data = JSON.parse(response.body)
When executing the above you'll get the exception:
JSON::ParserError ...
This seems to be a Byte Order Mark (BOM) that is not removed by HTTParty. Most likely because the response doesn't specify an UTF-8 charset.
response.body[0]
#=> ""
format '%X', response.body[0].ord
#=> "FEFF"
To correctly handle the BOM Ruby 2.7 added the set_encoding_by_bom method to IO which is also available on StringIO.
require 'httparty'
require 'json'
require 'stringio'
response = HTTParty.get('https://www.cdc.gov/coronavirus/2019-ncov/json/us-cases-map-data.json')
body = StringIO.new(response.body)
body.set_encoding_by_bom
data = JSON.parse(body.gets(nil))
#=> [{"Jurisdiction"=>"Alabama", "Range"=>"10,001 to 20,000", "Cases Reported"=>10145, ...
If you're not yet using Ruby 2.7 you can use a substitute to remove the BOM, however the former is probably the safer option:
data = JSON.parse(response.body.force_encoding('utf-8').sub(/\A\xEF\xBB\xBF/, ''))
That page is using AJAX to load its data.
in that case you may use Watir to fetch the page using a browser
as answered here: https://stackoverflow.com/a/13792540/2784833
Another way is to get data from the API directly.
You can see the other endpoints by checking the network tab on your browser console
I replicated your code and found some of the errors that you might have done
require 'HTTParty'
will not work. You need to use
require 'httparty'
Secondly, there should be quotes around your variable url value i.e
url = "https://www.cdc.gov/coronavirus/2019-ncov/cases-updates/cases-in-us.html"
Other than that, it just worked fine for me.
Also, if you're trying to get the Covid-19 data you might want to use these APIs
For US Count
For US Daily Count
For US Count - States
You could learn more about the APIs here

Manipulate webpage and have it send back results using ruby

so here it is:
I use ruby to get user input, the easy way..say I request 2 inputs:
input1 = gets.chomp
input2 = gets.chomp
Now I would like to send this information to say, a search engine that takes these two options separately and does the search. How can I do this? What API/Gems will be helpful for me in this case?
I know that i can take these 2 inputs and insert them into the url but its not that simple because according to the inputs the url structure is not constant..(I wouldn't want to use this way though..)
Its been a long time since I lat programmed in ruby, I know how to access webpages and things like that, but I want to manipulate and receive back. Any Ideas?
If you are talking about some front-end of a site without any API access or sophisticated JS logic, you could simply use mechanize gem which allows you to do something like:
require 'rubygems'
require 'mechanize'
agent = Mechanize.new
page = agent.get('http://google.com/')
form = page.forms.first
form['field_name_1'] = input1
form['field_name_2'] = input2
page = agent.submit(form, form.buttons.first)
puts page
→ Check out the official documentation for more examples
If you are going to use third party REST API you should better try something like faraday or other popular gems (depending on your taste and particular task).
Please correct me if I misunderstood you.
From what I understand, you want to encode your two inputs in a URL, send them to an API and receive the results back.
You can use the Net::HTTP library from the Ruby stdlib. Here's the example with dynamic parameters from the docs:
uri = URI('http://example.com/index.html')
params = { :limit => 10, :page => 3 }
uri.query = URI.encode_www_form(params)
res = Net::HTTP.get_response(uri)
puts res.body if res.is_a?(Net::HTTPSuccess)
Or you can use some gems to wrap it up for you. HTTParty seems quite popular. You can do it as simple as
HTTParty.get('http://foo.com/resource.json', query: {limit: 10})

How to parse a webpage in Ruby without any library or gem?

I want to use the API of a website in a Ruby script, and the only return from the API is a number through the HTTPS protocol. Nothing more, not even tags or something, so I was wondering if there is a way to get that number in a string or integer in my script without using any XML parsing livrary or gem like REXML or hpricot or libXML, because the webpages that I want to parse are, as I said, extremely basic...
If I understand. A request to https://www.website.com/api/getid return 2.
Then, I guess this would do:
require 'net/https'
require 'uri'
def open(url)
Net::HTTP.get(URI.parse(url))
end
response = open("https://www.website.com/api/getid")
EDIT
You'll find much usefull examples here.
As it is mentioned in the link above, HTTParty is quite popular. An example:
require 'httparty'
response = HTTParty.get('http://twitter.com/statuses/public_timeline.json')
puts response.body, response.code, response.message, response.headers.inspect

How do I use Nokogiri to parse a bit.ly stats page?

I'm trying to parse the Twitter usernames from a bit.ly stats page using Nokogiri:
require 'rubygems'
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(open('http://bitly.com/U026ue+/global'))
twitter_accounts = []
shares = doc.xpath('//*[#id="tweets"]/li')
shares.map do |tweet|
twitter_accounts << tweet.at_css('.conv.tweet.a')
end
puts twitter_accounts
My understanding is that Nokogiri will save shares in some form of tree structure, which I can use to drill down into, but my mileage is varying.
That data is coming in from an Ajax request with a JSON response. It's pretty easy to get at though:
require 'json'
url = 'http://search.twitter.com/search.json?_usragnt=Bitly&include_entities=true&rpp=100&q=nowness.com%2Fday%2F2012%2F12%2F6%2F2643'
hash = JSON.parse open(url).read
puts hash['results'].map{|x| x['from_user']}
I got that URL by loading the page in Chrome and then looking at the network panel, I also removed the timestamp and callback parameters just to clean things up a bit.
Actually, Eric Walker was onto something. If you look at doc, the section where the tweets are supposed to be look like:
<h2>Tweets</h2>
<ul id="tweets"></ul>
</div>
This is likely because they're generated by some JavaScript call which Nokogiri isn't executing. One possible solution is to use watir to traverse to the page, load the JavaScript and then save the HTML.
Here is a script that accomplishes just that. Note that you had some issues with your XPath arguments which I've since solved, and that watir will open a new browser every time you run this script:
require 'watir'
require 'nokogiri'
browser = Watir::Browser.new
browser.goto 'http://bitly.com/U026ue+/global'
doc = Nokogiri::HTML.parse(browser.html)
twitter_accounts = []
shares = doc.xpath('//li[contains(#class, "tweet")]/a')
shares.each do |tweet|
twitter_accounts << tweet.attr('title')
end
puts twitter_accounts
browser.close
You can also use headless to prevent a window from opening.

Use ruby mechanize to get data from foursquare

I am trying to use ruby and Mechanize to parse data on foursquare's website. Here is my code:
require 'rubygems'
require 'mechanize'
agent = Mechanize.new
page = agent.get('https://foursquare.com')
page = agent.click page.link_with(:text => /Log In/)
form = page.forms[1]
form.F12778070592981DXGWJ = ARGV[0]
form.F1277807059296KSFTWQ = ARGV[1]
page = form.submit form.buttons.first
puts page.body
But then, when I run this code, the following error poped up:
C:/Ruby192/lib/ruby/gems/1.9.1/gems/mechanize-2.0.1/lib/mechanize/form.rb:162:in
`method_missing': undefined method `F12778070592981DXGWJ='
for #<Mechanize::Form:0x2b31f70> (NoMethodError)
from four.rb:10:in `<main>'
I checked and found that these two variables for the form object "F12778070592981DXGWJ" and "F1277807059296KSFTWQ" are changing every time when I try to open foursquare's webpage.
Does any one have the same problem before? your variables change every time you try to open a webpage? How should I solve this problem?
Our project is about parsing the data on foursquare. So I need to be able to login first.
Mechanize is useful for sites which don't expose an API, but Foursquare has an established REST API already. I'd recommend using one of the Ruby libraries, perhaps foursquare2. These libraries abstract away things like authentication, so you just have to register your app and use the provided keys.
Instead of indexing the form fields by their name, just index them by their order. That way you don't have to worry about the name that changes on each request:
form.fields[0].value = ARGV[0]
form.fields[1].value = ARGV[1]
...
However like dwhalen said, using the REST API is probably a much better way. That's why it's there.

Resources