Why doesn't Nokogiri load the full page? - ruby

I'm using Nokogiri to open Wikipedia pages about various countries, and then extracting the names of these countries in other languages from the interwiki links (links to foreign-language wikipedias). However, when I try to open the page for France, Nokogiri does not download the full page. Maybe it's too large, anyway it doesn't contain the interwiki links that I need. How can I force it to download all?
Here's my code:
url = "http://en.wikipedia.org/wiki/" + country_name
page = nil
begin
page = Nokogiri::HTML(open(url))
rescue OpenURI::HTTPError=>e
puts "No article found for " + country_name
end
language_part = page.css('div#p-lang')
Test:
with country_name = "France"
=> []
with country_name = "Thailand"
=> really long array that I don't want to quote here,
but containing all the right data
Maybe this issue goes beyond Nokogiri and into OpenURI - anyway I need to find a solution.

Nokogiri does not retrieve the page, it asks OpenURI to do it with an internal read on the StringIO object that Open::URI returns.
require 'open-uri'
require 'zlib'
stream = open('http://en.wikipedia.org/wiki/France')
if (stream.content_encoding.empty?)
body = stream.read
else
body = Zlib::GzipReader.new(stream).read
end
p body
Here's what you can key off of:
>> require 'open-uri' #=> true
>> open('http://en.wikipedia.org/wiki/France').content_encoding #=> ["gzip"]
>> open('http://en.wikipedia.org/wiki/Thailand').content_encoding #=> []
In this case if it's [], AKA "text/html", it reads. If it's ["gzip"] it decodes.
Doing all the stuff above and tossing it to:
require 'nokogiri'
page = Nokogiri::HTML(body)
language_part = page.css('div#p-lang')
should get you back on track.
Do this after all the above to confirm visually you're getting something usable:
p language_part.text.gsub("\t", '')
See Casper's answer and comments about why you saw two different results. Originally it looked like Open-URI was inconsistent in its processing of the returned data, but based on what Casper said, and what I saw using curl, Wikipedia isn't honoring the "Accept-Encoding" header for large documents and returns gzip. That is fairly safe with today's browsers but clients like Open-URI that don't automatically sense the encoding will have problems. That's what the code above should help fix.

After quite a bit of head scratching the problem is here:
> wget -S 'http://en.wikipedia.org/wiki/France'
Resolving en.wikipedia.org... 91.198.174.232
Connecting to en.wikipedia.org|91.198.174.232|:80... connected.
HTTP request sent, awaiting response...
HTTP/1.0 200 OK
Content-Language: en
Last-Modified: Fri, 01 Jul 2011 23:31:36 GMT
Content-Encoding: gzip <<<<------ BINGO!
...
You need to unpack the gzipped data, which open-uri does not do automatically.
Solution:
def http_get(uri)
url = URI.parse uri
res = Net::HTTP.start(url.host, url.port) { |h|
h.get(url.path)
}
headers = res.to_hash
gzipped = headers['content-encoding'] && headers['content-encoding'][0] == "gzip"
content = gzipped ? Zlib::GzipReader.new(StringIO.new(res.body)).read : res.body
content
end
And then:
page = Nokogiri::HTML(http_get("http://en.wikipedia.org/wiki/France"))

require 'open-uri'
require 'zlib'
open('Accept-Encoding' => 'gzip, deflate') do |response|
if response.content_encoding.include?('gzip')
response = Zlib::GzipReader.new(response)
response.define_singleton_method(:method_missing) do |name|
to_io.public_send(name)
end
end
yield response if block_given?
response
end

Related

How to get HTTP headers before downloading with Ruby's OpenUri

I am currently using OpenURI to download a file in Ruby. Unfortunately, it seems impossible to get the HTTP headers without downloading the full file:
open(base_url,
:content_length_proc => lambda {|t|
if t && 0 < t
pbar = ProgressBar.create(:total => t)
end
},
:progress_proc => lambda {|s|
pbar.progress = s if pbar
}) {|io|
puts io.size
puts io.meta['content-disposition']
}
Running the code above shows that it first downloads the full file and only then prints the header I need.
Is there a way to get the headers before the full file is downloaded, so I can cancel the download if the headers are not what I expect them to be?
You can use Net::HTTP for this matter, for example:
require 'net/http'
http = Net::HTTP.start('stackoverflow.com')
resp = http.head('/')
resp.each { |k, v| puts "#{k}: #{v}" }
http.finish
Another example, this time getting the header of the wonderful book, Object Orient Programming With ANSI-C:
require 'net/http'
http = Net::HTTP.start('www.planetpdf.com')
resp = http.head('/codecuts/pdfs/ooc.pdf')
resp.each { |k, v| puts "#{k}: #{v}" }
http.finish
It seems what I wanted is not possible to archieve using OpenURI, at least not, as I said, without loading the whole file first.
I was able to do what I wanted using Net::HTTP's request_get
Here an example:
http.request_get('/largefile.jpg') {|response|
if (response['content-length'] < max_length)
response.read_body do |str| # read body now
# save to file
end
end
}
Note that this only works when using a block, doing it like:
response = http.request_get('/largefile.jpg')
the body will already be read.
Rather than use Net::HTTP, which can be like digging a pool on the beach using a sand shovel, you can use a number of the HTTP clients for Ruby and clean up the code.
Here's a sample using HTTParty:
require 'httparty'
resp = HTTParty.head('http://example.org')
resp.headers
# => {"accept-ranges"=>["bytes"], "cache-control"=>["max-age=604800"], "content-type"=>["text/html"], "date"=>["Thu, 02 Mar 2017 18:52:42 GMT"], "etag"=>["\"359670651\""], "expires"=>["Thu, 09 Mar 2017 18:52:42 GMT"], "last-modified"=>["Fri, 09 Aug 2013 23:54:35 GMT"], "server"=>["ECS (oxr/83AB)"], "x-cache"=>["HIT"], "content-length"=>["1270"], "connection"=>["close"]}
At that point it's easy to check the size of the document:
resp.headers['content-length'] # => "1270"
Unfortunately, the HTTPd you're talking to might not know how big the content will be; In order to respond quickly servers don't necessarily calculate the size of dynamically generated output, which would take almost as long and be almost as CPU intensive as actually sending it, so relying on the "content-length" value might be buggy.
The issue with Net::HTTP is it won't automatically handle redirects, so then you have to add additional code. Granted, that code is supplied in the documentation, but the code keeps growing as you need to do more things, until you've ended up writing yet another http client (YAHC). So, avoid that and use an existing wheel.

Ruby Net::HTTP - following 301 redirects

My users submit urls (to mixes on mixcloud.com) and my app uses them to perform web requests.
A good url returns a 200 status code:
uri = URI.parse("http://www.mixcloud.com/ErolAlkan/hard-summer-mix/")
request = Net::HTTP.get_response(uri)(
#<Net::HTTPOK 200 OK readbody=true>
But if you forget the trailing slash then our otherwise good url returns a 301:
uri = "http://www.mixcloud.com/ErolAlkan/hard-summer-mix"
#<Net::HTTPMovedPermanently 301 MOVED PERMANENTLY readbody=true>
The same thing happens with 404's:
# bad path returns a 404
"http://www.mixcloud.com/bad/path/"
# bad path minus trailing slash returns a 301
"http://www.mixcloud.com/bad/path"
How can I 'drill down' into the 301 to see if it takes us on to a valid resource or an error page?
Is there a tool that provides a comprehensive overview of the rules that a particular domain might apply to their urls?
301 redirects are fairly common if you do not type the URL exactly as the web server expects it. They happen much more frequently than you'd think, you just don't normally ever notice them while browsing because the browser does all that automatically for you.
Two alternatives come to mind:
1: Use open-uri
open-uri handles redirects automatically. So all you'd need to do is:
require 'open-uri'
...
response = open('http://xyz...').read
If you have trouble redirecting between HTTP and HTTPS, then have a look here for a solution:
Ruby open-uri redirect forbidden
2: Handle redirects with Net::HTTP
def get_response_with_redirect(uri)
r = Net::HTTP.get_response(uri)
if r.code == "301"
r = Net::HTTP.get_response(URI.parse(r['location']))
end
r
end
If you want to be even smarter you could try to add or remove missing backslashes to the URL when you get a 404 response. You could do that by creating a method like get_response_smart which handles this URL fiddling in addition to the redirects.
I can't figure out how to comment on the accepted answer (this question might be closed), but I should note that r.header is now obsolete, so r.header['location'] should be replaced by r['location'] (per https://stackoverflow.com/a/6934503/1084675 )
rest-client follows the redirections for GET and HEAD requests without any additional configuration. It works very nice.
for result codes between 200 and 207, a RestClient::Response will be returned
for result codes 301, 302 or 307, the redirection will be followed if the request is a GET or a HEAD
for result code 303, the redirection will be followed and the request transformed into a GET
example of usage:
require 'rest-client'
RestClient.get 'http://example.com/resource'
The rest-client README also gives an example of following redirects with POST requests:
begin
RestClient.post('http://example.com/redirect', 'body')
rescue RestClient::MovedPermanently,
RestClient::Found,
RestClient::TemporaryRedirect => err
err.response.follow_redirection
end
Here is the code I came up with (derived from different examples) which will bail out if there are too many redirects (note that ensure_success is optional):
require "net/http"
require "uri"
class Net::HTTPResponse
def ensure_success
unless kind_of? Net::HTTPSuccess
warn "Request failed with HTTP #{#code}"
each_header do |h,v|
warn "#{h} => #{v}"
end
abort
end
end
end
def do_request(uri_string)
response = nil
tries = 0
loop do
uri = URI.parse(uri_string)
http = Net::HTTP.new(uri.host, uri.port)
request = Net::HTTP::Get.new(uri.request_uri)
response = http.request(request)
uri_string = response['location'] if response['location']
unless response.kind_of? Net::HTTPRedirection
response.ensure_success
break
end
if tries == 10
puts "Timing out after 10 tries"
break
end
tries += 1
end
response
end
Not sure if anyone is looking for this exact solution, but if you are trying to download an image http/https and store it to a variable
require 'open_uri_redirections'
require 'net/https'
web_contents = open('file_url_goes_here', :ssl_verify_mode => OpenSSL::SSL::VERIFY_NONE, :allow_redirections => :all) {|f| f.read }
puts web_contents

Using Ruby, what is the most efficient way to get the content type of a given URL?

What is the most efficient way to get the content-type of a given URL using Ruby?
This is what I'd do if I want simple code:
require 'open-uri'
str = open('http://example.com')
str.content_type #=> "text/html"
The big advantage is it follows redirects.
If you're checking a bunch of URLs you might want to call close on the handles after you've found what you want.
Take a look at the Net::HTTP library.
require 'net/http'
response = nil
uri, path = 'google.com', '/'
Net::HTTP.start(uri, 80) { |http| response = http.head(path) }
p response['content-type']

use ruby to get content length of URLs

I am trying to write a ruby script that gets some details about files on a website using net/http. My code looks like this:
require 'open-uri'
require 'net/http'
url = URI.parse asset
res = Net::HTTP.start(url.host, url.port) {|http|
http.get(asset)
}
headers = res.to_hash
p headers
I would like to get two pieces of information from this request: the total length of the content inflated, and (as appropriate) the length of the content deflated.
Sometimes, the headers will include a content-length parameter, which appears to be the gzipped length of the content. I can also approximate the inflated size of the content using res.body.length, but this has not been foolproof by any stretch of the imagination. The documentation on net/http says that gzip headers are removed from the list automatically (to help me, gee thanks) so I cannot seem to get a reliable handle on this information.
Any help is appreciated (including other gems if they will do this more easily).
Got it! The "magic" behavior here only occurs if you don't specify your own accept-encoding header. Amended code as follows:
require 'open-uri'
require 'net/http'
require 'date'
require 'zlib'
headers = { "accept-encoding" => "gzip;q=1.0,deflate;q=0.6,identity;q=0.3" }
url = URI.parse asset
res = Net::HTTP.start(url.host, url.port) {|http|
http.get(asset, headers)
}
headers = res.to_hash
gzipped = headers['content-encoding'] && headers['content-encoding'][0] == "gzip"
content = gzipped ? Zlib::GzipReader.new(StringIO.new(res.body)).read : res.body
full_length = content.length,
compressed_length = (headers["content-length"] && headers["content-length"][0] || res.body.length),
You can try use sockets to send HEAD request to the server with is faster (no content) and don't send "Accept-Encoding: gzip", so your response will not be gzip.

How to make an HTTP GET with modified headers?

What is the best way to make an HTTP GET request in Ruby with modified headers?
I want to get a range of bytes from the end of a log file and have been toying with the following code, but the server is throwing back a response saying that "it is a request that the server could not understand" (the server is Apache).
require 'net/http'
require 'uri'
#with #address, #port, #path all defined elsewhere
httpcall = Net::HTTP.new(#address, #port)
headers = {
'Range' => 'bytes=1000-'
}
resp, data = httpcall.get2(#path, headers)
Is there a better way to define headers in Ruby?
Does anyone know why this would be failing against Apache? If I do a get in a browser to http://[address]:[port]/[path] I get the data I am seeking without issue.
Created a solution that worked for me (worked very well) - this example getting a range offset:
require 'uri'
require 'net/http'
size = 1000 #the last offset (for the range header)
uri = URI("http://localhost:80/index.html")
http = Net::HTTP.new(uri.host, uri.port)
headers = {
'Range' => "bytes=#{size}-"
}
path = uri.path.empty? ? "/" : uri.path
#test to ensure that the request will be valid - first get the head
code = http.head(path, headers).code.to_i
if (code >= 200 && code < 300) then
#the data is available...
http.get(uri.path, headers) do |chunk|
#provided the data is good, print it...
print chunk unless chunk =~ />416.+Range/
end
end
If you have access to the server logs, try comparing the request from the browser with the one from Ruby and see if that tells you anything. If this isn't practical, fire up Webrick as a mock of the file server. Don't worry about the results, just compare the requests to see what they are doing differently.
As for Ruby style, you could move the headers inline, like so:
httpcall = Net::HTTP.new(#address, #port)
resp, data = httpcall.get2(#path, 'Range' => 'bytes=1000-')
Also, note that in Ruby 1.8+, what you are almost certainly running, Net::HTTP#get2 returns a single HTTPResponse object, not a resp, data pair.

Resources