Download an image from a URL? - ruby

I am trying to use HTTP::get to download an image of a Google chart from a URL I created.
This was my first attempt:
failures_url = [title, type, data, size, colors, labels].join("&")
require 'net/http'
Net::HTTP.start("http://chart.googleapis.com") { |http|
resp = http.get("/chart?#{failures_url")
open("pie.png" ,"wb") { |file|
file.write(resp.body)
}
}
Which produced only an empty PNG file.
For my second attempt I used the value stored inside failure_url inside the http.get() call.
require 'net/http'
Net::HTTP.start("http://chart.googleapis.com") { |http|
resp = http.get("/chart?chtt=Builds+in+the+last+12+months&cht=bvg&chd=t:296,1058,1217,1615,1200,611,2055,1663,1746,1950,2044,2781,1553&chs=800x375&chco=4466AA&chxl=0:|Jul-2010|Aug-2010|Sep-2010|Oct-2010|Nov-2010|Dec-2010|Jan-2011|Feb-2011|Mar-2011|Apr-2011|May-2011|Jun-2011|Jul-2011|2:|Months|3:|Builds&chxt=x,y,x,y&chg=0,6.6666666666666666666666666666667,5,5,0,0&chxp=3,50|2,50&chbh=23,5,30&chxr=1,0,3000&chds=0,3000")
open("pie.png" ,"wb") { |file|
file.write(resp.body)
}
}
And, for some reason, this version works even though the first attempt had the same data inside the http.get() call. Does anyone know why this is?
SOLUTION:
After trying to figure why this is happening I found "How do I download a binary file over HTTP?".
One of the comments mentions removing http:// in the Net::HTTP.start(...) call otherwise it won't succeed. Sure enough after I did this:
failures_url = [title, type, data, size, colors, labels].join("&")
require 'net/http'
Net::HTTP.start("chart.googleapis.com") { |http|
resp = http.get("/chart?#{failures_url")
open("pie.png" ,"wb") { |file|
file.write(resp.body)
}
}
it worked.

I'd go after the file using Ruby's Open::URI:
require "open-uri"
File.open('pie.png', 'wb') do |fo|
fo.write open("http://chart.googleapis.com/chart?#{failures_url}").read
end
The reason I prefer Open::URI is it handles redirects automatically, so WHEN Google makes a change to their back-end and tries to redirect the URL, the code will handle it magically. It also handles timeouts and retries more gracefully if I remember right.
If you must have lower level control then I'd look at one of the many other HTTP clients for Ruby; Net::HTTP is fine for creating new services or when a client doesn't exist, but I'd use Open::URI or something besides Net::HTTP until the need presents itself.
The URL:
http://chart.googleapis.com/chart?chtt=Builds+in+the+last+12+months&cht=bvg&chd=t:296,1058,1217,1615,1200,611,2055,1663,1746,1950,2044,2781,1553&chs=800x375&chco=4466AA&chxl=0:|Jul-2010|Aug-2010|Sep-2010|Oct-2010|Nov-2010|Dec-2010|Jan-2011|Feb-2011|Mar-2011|Apr-2011|May-2011|Jun-2011|Jul-2011|2:|Months|3:|Builds&chxt=x,y,x,y&chg=0,6.6666666666666666666666666666667,5,5,0,0&chxp=3,50|2,50&chbh=23,5,30&chxr=1,0,3000&chds=0,3000
makes URI upset. I suspect it is seeing characters that should be encoded in URLs.
For documentation purposes, here is what URI says when trying to parse that URL as-is:
URI::InvalidURIError: bad URI(is not URI?)
If I encode the URI first, I get a successful parse. Testing further using Open::URI shows it is able to retrieve the document at that point and returns 23701 bytes.
I think that is the appropriate fix for the problem if some of those characters are truly not acceptable to URI AND they are out of the RFC.
Just for information, the Addressable::URI gem is a great replacement for the built-in URI.

resp = http.get("/chart?#{failures_url")
If you copied your original code then you're missing a closing curly bracket in your path string.

Your original version did not have the parameter name for each parameter, just the data. For example, on the title, you cannot just submit "Builds+in+the+last+12+months", but instead it must be "chtt=Builds+in+the+last+12+months".
Try this:
failures_url = ["title="+title, "type="+type, "data="+data, "size="+size, "colors="+colors, "labels="+labels].join("&")

Related

How to extract data from dynamic collapsing table with hidden elements using Nokogiri and Ruby

I am trying to scrape through the following website :
https://www.cdc.gov/coronavirus/2019-ncov/cases-updates/cases-in-us.html
to get all of the state statistics on coronavirus.
My code below works:
require 'nokogiri'
require 'open-uri'
require 'httparty'
require 'pry'
url = "https://www.cdc.gov/coronavirus/2019-ncov/cases-updates/cases-in-us.html"
doc = Nokogiri::HTML.parse(open(url))
total_cases = doc.css("span.count")[0].text
total_deaths = doc.css("span.count")[1].text
new_cases = doc.css("span.new-cases")[0].text
new_deaths = doc.css("span.new-cases")[1].text
However, I am unable to get into the collapsed data/gridcell data.
I have tried searching by the class .aria-label and by the .rt-tr-group class. Any help would be appreciated. Thank you.
Although the answer of Layon Ferreira already states the problem it does not provide the steps needed to load the data.
Like already said in the linked answer the data is loaded asynchronously. This means that the data is not present on the initial page and is loaded through the JavaScript engine executing code.
When you open up the browser development tools, go to the "Network" tab. You can clear out all requests, then refresh the page. You'll get to see a list of all requests made. If you're looking for asynchronously loaded data the most interesting requests are often those of type "json" or "xml".
When browsing through the requests you'll find that the data you're looking for is located at:
https://www.cdc.gov/coronavirus/2019-ncov/json/us-cases-map-data.json
Since this is JSON you don't need "nokogiri" to parse it.
require 'httparty'
require 'json'
response = HTTParty.get('https://www.cdc.gov/coronavirus/2019-ncov/json/us-cases-map-data.json')
data = JSON.parse(response.body)
When executing the above you'll get the exception:
JSON::ParserError ...
This seems to be a Byte Order Mark (BOM) that is not removed by HTTParty. Most likely because the response doesn't specify an UTF-8 charset.
response.body[0]
#=> ""
format '%X', response.body[0].ord
#=> "FEFF"
To correctly handle the BOM Ruby 2.7 added the set_encoding_by_bom method to IO which is also available on StringIO.
require 'httparty'
require 'json'
require 'stringio'
response = HTTParty.get('https://www.cdc.gov/coronavirus/2019-ncov/json/us-cases-map-data.json')
body = StringIO.new(response.body)
body.set_encoding_by_bom
data = JSON.parse(body.gets(nil))
#=> [{"Jurisdiction"=>"Alabama", "Range"=>"10,001 to 20,000", "Cases Reported"=>10145, ...
If you're not yet using Ruby 2.7 you can use a substitute to remove the BOM, however the former is probably the safer option:
data = JSON.parse(response.body.force_encoding('utf-8').sub(/\A\xEF\xBB\xBF/, ''))
That page is using AJAX to load its data.
in that case you may use Watir to fetch the page using a browser
as answered here: https://stackoverflow.com/a/13792540/2784833
Another way is to get data from the API directly.
You can see the other endpoints by checking the network tab on your browser console
I replicated your code and found some of the errors that you might have done
require 'HTTParty'
will not work. You need to use
require 'httparty'
Secondly, there should be quotes around your variable url value i.e
url = "https://www.cdc.gov/coronavirus/2019-ncov/cases-updates/cases-in-us.html"
Other than that, it just worked fine for me.
Also, if you're trying to get the Covid-19 data you might want to use these APIs
For US Count
For US Daily Count
For US Count - States
You could learn more about the APIs here

Ruby get RSS feed won't get the latest feed

I have a problem parsing an RSS feed.
When I do this:
feed = getFeed("http://example.com/rss)
If the feed content changes it don't update.
If I do it like this:
feed = getFeed("http://example.com/rss?" + Random.rand(20).to_s)
It works most of the time but not always.
getFeed() is implemented like this:
def getFeed(url)
rss_content = ""
open(url) do |f|
rss_content = f.read
end
return rss_content
end
I used this in Sinatra with Ruby 1.9.3, if this make a difference.
On my opinion somewhere it gets cached but I have no idea where.
Edit:
Okey after 1/2 day running on the server it works with out a problem.
This:
feed = getFeed("http://example.com/rss?" + Random.rand(20).to_s)
implies the problem is with caching, but Ruby, OpenURI and Sinatra shouldn't be caching anything. Perhaps your code is running behind a caching device or app that is handling outgoing requests as well as incoming?
This isn't the fix, but your code can be streamlined greatly:
def getFeed(url)
open(url).read
end

Ruby: expand shorten urls the hard way

Is there a way to open URLS in ruby and output the re-directed url:
ie convert http://bit.ly/l223ue to http://paper.li/CoyDavidsonCRE/1309121465
I find that there are more url shortener services than gems can keep up with, so I'm asking for the hard -but robust- way, instead of using a gem that connects to some API.
Here is a lengthen method
This has very little error handling but it might help you get started.
You could wrap lengthen with a begin rescue block that returns nil or attempt to retry it later. Not sure what you are trying to build but hope it helps.
require 'uri'
require 'net/http'
def lengthen(url)
uri = URI(url)
Net::HTTP.new(uri.host, uri.port).get(uri.path).header['location']
end
irb(main):008:0> lengthen('http://bit.ly/l223ue')
=> "http://paper.li/CoyDavidsonCRE/1309121465"

Using Watir to check for bad links

I have an unordered list of links that I save off to the side, and I want to click each link and make sure it goes to a real page and doesnt 404, 500, etc.
The issue is that I do not know how to do it. Is there some object I can inspect which will give me the http status code or anything?
mylinks = Browser.ul(:id, 'my_ul_id').links
mylinks.each do |link|
link.click
# need to check for a 200 status or something here! how?
Browser.back
end
My answer is similar idea with the Tin Man's.
require 'net/http'
require 'uri'
mylinks = Browser.ul(:id, 'my_ul_id').links
mylinks.each do |link|
u = URI.parse link.href
status_code = Net::HTTP.start(u.host,u.port){|http| http.head(u.request_uri).code }
# testing with rspec
status_code.should == '200'
end
if you use Test::Unit for testing framework, you can test like the following, i think
assert_equal '200',status_code
another sample (including Chuck van der Linden's idea): check status code and log out URLs if the status is not good.
require 'net/http'
require 'uri'
mylinks = Browser.ul(:id, 'my_ul_id').links
mylinks.each do |link|
u = URI.parse link.href
status_code = Net::HTTP.start(u.host,u.port){|http| http.head(u.request_uri).code }
unless status_code == '200'
File.open('error_log.txt','a+'){|file| file.puts "#{link.href} is #{status_code}" }
end
end
There's no need to use Watir for this. A HTTP HEAD request will give you an idea whether the URL resolves and will be faster.
Ruby's Net::HTTP can do it, or you can use Open::URI.
Using Open::URI you can request a URI, and get a page back. Because you don't really care what the page contains, you can throw away that part and only return whether you got something:
require 'open-uri'
if (open('http://www.example.com').read.any?)
puts "is"
else
puts "isn't"
end
The upside is the Open::URI resolves HTTP redirects. The downside is it returns full pages so it can be slow.
Ruby's Net::HTTP can help somewhat, because it can use HTTP HEAD requests, which don't return the entire page, only a header. That by itself isn't enough to know whether the actual page is reachable because the HEAD response could redirect to a page that doesn't resolve, so you have to loop through the redirects until you either don't get a redirect, or you get an error. The Net::HTTP docs have an example to get you started:
require 'net/http'
require 'uri'
def fetch(uri_str, limit = 10)
# You should choose better exception.
raise ArgumentError, 'HTTP redirect too deep' if limit == 0
response = Net::HTTP.get_response(URI.parse(uri_str))
case response
when Net::HTTPSuccess then response
when Net::HTTPRedirection then fetch(response['location'], limit - 1)
else
response.error!
end
end
print fetch('http://www.ruby-lang.org')
Again, that example is returning pages, which might slow you down. You can replace get_response with request_head, which returns a response like get_response does, which should help.
In either case, there's another thing you have to consider. A lot of sites use "meta refreshes", which cause the browser to refresh the page, using an alternate URL, after parsing the page. Handling these requires requesting the page and parsing it, looking for the <meta http-equiv="refresh" content="5" /> tags.
Other HTTP gems like Typhoeus and Patron also can do HEAD requests easily, so take a look at them too. In particular, Typhoeus can handle some heavy loads via its companion Hydra, allowing you to easily use parallel requests.
EDIT:
require 'typhoeus'
response = Typhoeus::Request.head("http://www.example.com")
response.code # => 302
case response.code
when (200 .. 299)
#
when (300 .. 399)
headers = Hash[*response.headers.split(/[\r\n]+/).map{ |h| h.split(' ', 2) }.flatten]
puts "Redirected to: #{ headers['Location:'] }"
when (400 .. 499)
#
when (500 .. 599)
#
end
# >> Redirected to: http://www.iana.org/domains/example/
Just in case you haven't played with one, here's what the response looks like. It's useful for exactly the sort of situation you're look at:
(rdb:1) pp response
#<Typhoeus::Response:0x00000100ac3f68
#app_connect_time=0.0,
#body="",
#code=302,
#connect_time=0.055054,
#curl_error_message="No error",
#curl_return_code=0,
#effective_url="http://www.example.com",
#headers=
"HTTP/1.0 302 Found\r\nLocation: http://www.iana.org/domains/example/\r\nServer: BigIP\r\nConnection: Keep-Alive\r\nContent-Length: 0\r\n\r\n",
#http_version=nil,
#mock=false,
#name_lookup_time=0.001436,
#pretransfer_time=0.055058,
#request=
:method => :head,
:url => http://www.example.com,
:headers => {"User-Agent"=>"Typhoeus - http://github.com/dbalatero/typhoeus/tree/master"},
#requested_http_method=nil,
#requested_url=nil,
#start_time=nil,
#start_transfer_time=0.109741,
#status_message=nil,
#time=0.109822>
If you have a lot of URLs to check, see the Hydra example that is part of Typhoeus.
There's a bit of a philosophical debate on whether watir or watir-webdriver should provide HTTP return code information. The premise being that an ordinary 'user' which is what Watir is simulating on the DOM is ignorant of HTTP return codes. I don't necessarily agree with this, as I have a slightly different use case perhaps to the main (performance testing etc)... but it is what it is. This thread expresses some opinions about the distinction => http://groups.google.com/group/watir-general/browse_thread/thread/26486904e89340b7
At present there's no easy way to determine HTTP response codes from Watir without using supplementary tools like proxies/Fiddler/HTTPWatch/TCPdump, or downgrading to a net/http level of scripting mid test... I personally like using firebug with the netexport plugin to keep a retrospective look at tests.
All previous solutions are inefficient if you have a very huge number of links because for each one, it will establish a new HTTP connection with the server hosting the link.
I have written a one-liner bash command that will use the curl command to fetch a list of links supplied from stdin and returns a list of status codes corresponding to each link. The key point here is that curl takes all bunch of links in the same invocation and it will reuse HTTP connections which will dramatically improve speed.
However, curl will divide the list into chunks of 256, which is still by far more than 1! To make sure connections are reused, sort the links first (simply using the sort command).
cat <YOUR_LINKS_FILE_ONE_PER_LINE> | xargs curl --head --location -w '---HTTP_STATUS_CODE:%{http_code}\n\n' -s --retry 10 --globoff | grep HTTP_STATUS_CODE | cut -d: -f2 > <RESULTS_FILE>
It is worth noting that the above command will follow HTTP redirects, retry 10 times for temporary errors (timeouts or 5xx) and of course will only fetch headers.
Update: added --globoff so that curl won't expand any url if it contains {} or []

Is there a workaround to open URLs containing underscores in Ruby?

I'm using open-uri to open URLs.
resp = open("http://sub_domain.domain.com")
If it contains underscore I get an error:
URI::InvalidURIError: the scheme http does not accept registry part: sub_domain.domain.com (or bad hostname?)
I understand that this is because according to RFC URLs can contain only letters and numbers. Is there any workaround?
This looks like a bug in URI, and uri-open, HTTParty and many other gems make use of URI.parse.
Here's a workaround:
require 'net/http'
require 'open-uri'
def hopen(url)
begin
open(url)
rescue URI::InvalidURIError
host = url.match(".+\:\/\/([^\/]+)")[1]
path = url.partition(host)[2] || "/"
Net::HTTP.get host, path
end
end
resp = hopen("http://dear_raed.blogspot.com/2009_01_01_archive.html")
URI has an old-fashioned idea of what an url looks like.
Lately I'm using addressable to get around that:
require 'open-uri'
require 'addressable/uri'
class URI::Parser
def split url
a = Addressable::URI::parse url
[a.scheme, a.userinfo, a.host, a.port, nil, a.path, nil, a.query, a.fragment]
end
end
resp = open("http://sub_domain.domain.com") # Yay!
Don't forget to gem install addressable
This initializer in my rails app seems to make URI.parse work at least:
# config/initializers/uri_underscore.rb
class URI::Generic
def initialize_with_registry_check(scheme,
userinfo, host, port, registry,
path, opaque,
query,
fragment,
parser = DEFAULT_PARSER,
arg_check = false)
if %w(http https).include?(scheme) && host.nil? && registry =~ /_/
initialize_without_registry_check(scheme, userinfo, registry, port, nil, path, opaque, query, fragment, parser, arg_check)
else
initialize_without_registry_check(scheme, userinfo, host, port, registry, path, opaque, query, fragment, parser, arg_check)
end
end
alias_method_chain :initialize, :registry_check
end
Here is a patch that solves the problem for a wide variety of situations (rest-client, open-uri, etc.) without using external gems or overriding parts of URI.parse:
module URI
DEFAULT_PARSER = Parser.new(:HOSTNAME => "(?:(?:[a-zA-Z\\d](?:[-\\_a-zA-Z\\d]*[a-zA-Z\\d])?)\\.)*(?:[a-zA-Z](?:[-\\_a-zA-Z\\d]*[a-zA-Z\\d])?)\\.?")
end
Source: lib/uri/rfc2396_parser.rb#L86
Ruby-core has an open issue: https://bugs.ruby-lang.org/issues/8241
An underscore can not be contained in a domain name like that. That is part of the DNS standard. Did you mean to use a dash(-)?
Even if open-uri didn't throw an error such a command would be pointless. Why? Because there is no way it can resolve such a domain name. At best you'd get an unknown host error. There is no way for you to register a domain name with an _ in it, and even running your own private DNS server, it is against the specification to use a _. You could bend the rules and allow it(by modifying the DNS server software), but then your operating system's DNS resolver won't support it, neither will your router's DNS software.
Solution: Don't try to use a _ in a DNS name. It won't work anywhere and it's against the specifications
I had this same error while trying to use gem update / gem install etc. so I used the IP address instead and its fine now.
Here is another ugly hack, no gem needed:
def parse(url = nil)
begin
URI.parse(url)
rescue URI::InvalidURIError
host = url.match(".+\:\/\/([^\/]+)")[1]
uri = URI.parse(url.sub(host, 'dummy-host'))
uri.instance_variable_set('#host', host)
uri
end
end
I recommend using the Curb gem: https://github.com/taf2/curb which just wraps libcurl. Here is a simple example that will automatically follow redirects and print the response code and response body:
rsp = Curl::Easy.http_get(url){|curl| curl.follow_location = true; curl.max_redirects=10;}
puts rsp.response_code
puts rsp.body_str
I usually avoid the ruby URI classes since they are too strick to the spec which as you know the web is the wild west :) Curl / curb handles every url I throw at it like a champ.
For anyone stumbling upon this:
Ruby's URI.parse used to be based on RFC2396 (published in Aug 1998), see https://bugs.ruby-lang.org/issues/8241
But starting at ruby 2.2 URI is upgraded into RFC 3986, so if you're on a modern version, no monkey patches are necessary now.

Resources