How to parse a webpage in Ruby without any library or gem? - ruby

I want to use the API of a website in a Ruby script, and the only return from the API is a number through the HTTPS protocol. Nothing more, not even tags or something, so I was wondering if there is a way to get that number in a string or integer in my script without using any XML parsing livrary or gem like REXML or hpricot or libXML, because the webpages that I want to parse are, as I said, extremely basic...

If I understand. A request to https://www.website.com/api/getid return 2.
Then, I guess this would do:
require 'net/https'
require 'uri'
def open(url)
Net::HTTP.get(URI.parse(url))
end
response = open("https://www.website.com/api/getid")
EDIT
You'll find much usefull examples here.
As it is mentioned in the link above, HTTParty is quite popular. An example:
require 'httparty'
response = HTTParty.get('http://twitter.com/statuses/public_timeline.json')
puts response.body, response.code, response.message, response.headers.inspect

Related

How to extract data from dynamic collapsing table with hidden elements using Nokogiri and Ruby

I am trying to scrape through the following website :
https://www.cdc.gov/coronavirus/2019-ncov/cases-updates/cases-in-us.html
to get all of the state statistics on coronavirus.
My code below works:
require 'nokogiri'
require 'open-uri'
require 'httparty'
require 'pry'
url = "https://www.cdc.gov/coronavirus/2019-ncov/cases-updates/cases-in-us.html"
doc = Nokogiri::HTML.parse(open(url))
total_cases = doc.css("span.count")[0].text
total_deaths = doc.css("span.count")[1].text
new_cases = doc.css("span.new-cases")[0].text
new_deaths = doc.css("span.new-cases")[1].text
However, I am unable to get into the collapsed data/gridcell data.
I have tried searching by the class .aria-label and by the .rt-tr-group class. Any help would be appreciated. Thank you.
Although the answer of Layon Ferreira already states the problem it does not provide the steps needed to load the data.
Like already said in the linked answer the data is loaded asynchronously. This means that the data is not present on the initial page and is loaded through the JavaScript engine executing code.
When you open up the browser development tools, go to the "Network" tab. You can clear out all requests, then refresh the page. You'll get to see a list of all requests made. If you're looking for asynchronously loaded data the most interesting requests are often those of type "json" or "xml".
When browsing through the requests you'll find that the data you're looking for is located at:
https://www.cdc.gov/coronavirus/2019-ncov/json/us-cases-map-data.json
Since this is JSON you don't need "nokogiri" to parse it.
require 'httparty'
require 'json'
response = HTTParty.get('https://www.cdc.gov/coronavirus/2019-ncov/json/us-cases-map-data.json')
data = JSON.parse(response.body)
When executing the above you'll get the exception:
JSON::ParserError ...
This seems to be a Byte Order Mark (BOM) that is not removed by HTTParty. Most likely because the response doesn't specify an UTF-8 charset.
response.body[0]
#=> ""
format '%X', response.body[0].ord
#=> "FEFF"
To correctly handle the BOM Ruby 2.7 added the set_encoding_by_bom method to IO which is also available on StringIO.
require 'httparty'
require 'json'
require 'stringio'
response = HTTParty.get('https://www.cdc.gov/coronavirus/2019-ncov/json/us-cases-map-data.json')
body = StringIO.new(response.body)
body.set_encoding_by_bom
data = JSON.parse(body.gets(nil))
#=> [{"Jurisdiction"=>"Alabama", "Range"=>"10,001 to 20,000", "Cases Reported"=>10145, ...
If you're not yet using Ruby 2.7 you can use a substitute to remove the BOM, however the former is probably the safer option:
data = JSON.parse(response.body.force_encoding('utf-8').sub(/\A\xEF\xBB\xBF/, ''))
That page is using AJAX to load its data.
in that case you may use Watir to fetch the page using a browser
as answered here: https://stackoverflow.com/a/13792540/2784833
Another way is to get data from the API directly.
You can see the other endpoints by checking the network tab on your browser console
I replicated your code and found some of the errors that you might have done
require 'HTTParty'
will not work. You need to use
require 'httparty'
Secondly, there should be quotes around your variable url value i.e
url = "https://www.cdc.gov/coronavirus/2019-ncov/cases-updates/cases-in-us.html"
Other than that, it just worked fine for me.
Also, if you're trying to get the Covid-19 data you might want to use these APIs
For US Count
For US Daily Count
For US Count - States
You could learn more about the APIs here

How to scrape pages which have lazy loading

Here is the code which i used for parsing of web page.I did it in rails console.But i am not getting any output in my rails console.The site which i want to scrape is having lazy loading
require 'nokogiri'
require 'open-uri'
page = 1
while true
url = "http://www.justdial.com/functions"+"/ajxsearch.php?national_search=0&act=pagination&city=Delhi+%2F+NCR&search=Pandits"+"&where=Delhi+Cantt&catid=1195&psearch=&prid=&page=#{page}"
doc = Nokogiri::HTML(open(url))
doc = Nokogiri::HTML(doc.at_css('#ajax').text)
d = doc.css(".rslwrp")
d.each do |t|
puts t.css(".jrcw").text
puts t.css("span.jcn").text
puts t.css(".jaid").text
puts t.css(".estd").text
page+=1
end
end
You have 2 options here:
Switch pure HTTP scraping to some tool which supports javascript evaluation, such as Capybara (with proper driver selected). This can be slow, since you're running headless browser under the hood plus you'll have to set some timeouts or figure another way to make sure the blocks of text you're interested in are loaded before you start any scraping.
Second option is to use Web Developer console and figure out how those blocks of text are loaded (which AJAX calls, their parameters and etc.) and implement them in your scraper. This is more advanced approach, but more performant, since you won't make any extra work, like you've done in option 1.
Have a nice day!
UPDATE:
Your code above doesn't work, because the response is HTML code wrapped in JSON object, while you're trying to parse it as a raw HTML. It looks like this:
{
"error": 0,
"msg": "request successful",
"paidDocIds": "some ids here",
"itemStartIndex": 20,
"lastPageNum": 50,
"markup": 'LOTS AND LOTS AND LOTS OF MARKUP'
}
What you need is unwrap JSON and then parse as HTML:
require 'json'
json = JSON.parse(open(url).read) # make sure you check http errors here
html = json['markup'] # can this field be empty? check for the json['error'] field
doc = Nokogiri::HTML(html) # parse as you like
I'd also advise you against using open-uri since your code may become vulnerable if you use dynamic urls because of the way open-uri works (read the linked article for the details) and use good and more feature-wise libraries such as HTTParty and RestClient.
UPDATE 2: Minimal working script for me:
require 'json'
require 'open-uri'
require 'nokogiri'
url = 'http://www.justdial.com/functions/ajxsearch.php?national_search=0&act=pagination&city=Delhi+%2F+NCR&search=Pandits&where=Delhi+Cantt&catid=1195&psearch=&prid=&page=2'
json = JSON.parse(open(url).read) # make sure you check http errors here
html = json['markup'] # can this field be empty? check for the json['error'] field
doc = Nokogiri::HTML(html) # parse as you like
puts doc.at_css('#newphoto10').attr('title')
# => Dr Raaj Batra Lal Kitab Expert in East Patel Nagar, Delhi

How do I use Nokogiri to parse a bit.ly stats page?

I'm trying to parse the Twitter usernames from a bit.ly stats page using Nokogiri:
require 'rubygems'
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(open('http://bitly.com/U026ue+/global'))
twitter_accounts = []
shares = doc.xpath('//*[#id="tweets"]/li')
shares.map do |tweet|
twitter_accounts << tweet.at_css('.conv.tweet.a')
end
puts twitter_accounts
My understanding is that Nokogiri will save shares in some form of tree structure, which I can use to drill down into, but my mileage is varying.
That data is coming in from an Ajax request with a JSON response. It's pretty easy to get at though:
require 'json'
url = 'http://search.twitter.com/search.json?_usragnt=Bitly&include_entities=true&rpp=100&q=nowness.com%2Fday%2F2012%2F12%2F6%2F2643'
hash = JSON.parse open(url).read
puts hash['results'].map{|x| x['from_user']}
I got that URL by loading the page in Chrome and then looking at the network panel, I also removed the timestamp and callback parameters just to clean things up a bit.
Actually, Eric Walker was onto something. If you look at doc, the section where the tweets are supposed to be look like:
<h2>Tweets</h2>
<ul id="tweets"></ul>
</div>
This is likely because they're generated by some JavaScript call which Nokogiri isn't executing. One possible solution is to use watir to traverse to the page, load the JavaScript and then save the HTML.
Here is a script that accomplishes just that. Note that you had some issues with your XPath arguments which I've since solved, and that watir will open a new browser every time you run this script:
require 'watir'
require 'nokogiri'
browser = Watir::Browser.new
browser.goto 'http://bitly.com/U026ue+/global'
doc = Nokogiri::HTML.parse(browser.html)
twitter_accounts = []
shares = doc.xpath('//li[contains(#class, "tweet")]/a')
shares.each do |tweet|
twitter_accounts << tweet.attr('title')
end
puts twitter_accounts
browser.close
You can also use headless to prevent a window from opening.

Ruby: expand shorten urls the hard way

Is there a way to open URLS in ruby and output the re-directed url:
ie convert http://bit.ly/l223ue to http://paper.li/CoyDavidsonCRE/1309121465
I find that there are more url shortener services than gems can keep up with, so I'm asking for the hard -but robust- way, instead of using a gem that connects to some API.
Here is a lengthen method
This has very little error handling but it might help you get started.
You could wrap lengthen with a begin rescue block that returns nil or attempt to retry it later. Not sure what you are trying to build but hope it helps.
require 'uri'
require 'net/http'
def lengthen(url)
uri = URI(url)
Net::HTTP.new(uri.host, uri.port).get(uri.path).header['location']
end
irb(main):008:0> lengthen('http://bit.ly/l223ue')
=> "http://paper.li/CoyDavidsonCRE/1309121465"

Parsing an HTML table using Hpricot (Ruby)

I am trying to parse an HTML table using Hpricot but am stuck, not able to select a table element from the page which has a specified id.
Here is my ruby code:-
require 'rubygems'
require 'mechanize'
require 'hpricot'
agent = WWW::Mechanize.new
page = agent.get('http://www.indiapost.gov.in/pin/pinsearch.aspx')
form = page.forms.find {|f| f.name == 'form1'}
form.fields.find {|f| f.name == 'ddl_state'}.options[1].select
page = agent.submit(form, form.buttons[2])
doc = Hpricot(page.body)
puts doc.to_html # Here the doc contains the full HTML page
puts doc.search("//table[#id='gvw_offices']").first # This is NIL
Can anyone help me to identify what's wrong with this.
Mechanize will use hpricot internally (it's mechanize's default parser). What's more, it'll pass the hpricot stuff on to the parser, so you don't have to do it yourself:
require 'rubygems'
require 'mechanize'
#You don't really need this if you don't use hpricot directly
require 'hpricot'
agent = WWW::Mechanize.new
page = agent.get('http://www.indiapost.gov.in/pin/pinsearch.aspx')
form = page.forms.find {|f| f.name == 'form1'}
form.fields.find {|f| f.name == 'ddl_state'}.options[1].select
page = agent.submit(form, form.buttons[2])
puts page.parser.to_html # page.parser returns the hpricot parser
puts page.at("//table[#id='gvw_offices']") # This passes through to hpricot
Also note that page.search("foo").first is equivalent to page.at("foo").
Note that Mechanize no longer uses Hpricot (it uses Nokogiri) by default in the later versions (0.9.0) and you have to explicitly specify Hpricot to continue using with:
WWW::Mechanize.html_parser = Hpricot
Just like that, no quotes or anything around Hpricot - there's probably a module you can specify for Hpricot, because it won't work if you put this statement inside your own module declaration. Here's the best way to do it at the top of your class (before opening module or class)
require 'mechanize'
require 'hpricot'
# Later versions of Mechanize no longer use Hpricot by default
# but have an attribute we can set to use it
begin
WWW::Mechanize.html_parser = Hpricot
rescue NoMethodError
# must be using an older version of Mechanize that doesn't
# have the html_parser attribute - just ignore it since
# this older version will use Hpricot anyway
end
By using the rescue block you ensure that if they do have an older version of mechanize, it won't barf on the nonexistent html_parser attribute. (Otherwise you need to make your code dependent on the latest version of Mechanize)
Also in the latest version, WWW::Mechanize::List was deprecated. Don't ask me why because it totally breaks backward compatibility for statements like
page.forms.name('form1').first
which used to be a common idiom that worked because Page#forms returned a mechanize List which had a "name" method. Now it returns a simple array of Forms.
I found this out the hard way, but your usage will work because you're using find which is a method of array.
But a better method for finding the first form with a given name is Page#form so your form finding line becomes
form = page.form('form1')
this method works with old an new versions.

Resources