Skip large pages in Mechanize (Ruby) - ruby

I'm trying to skip processing a few large pages (some over 10MB) scattered in a result set, as Mechanize (version 2.7.3) crawls an array of links.
Unfortunately I can't find a 'content-length' property or a similar indicator. The Mechanize::FileResponse class has a content_length method but Mechanize::Page does not.
Current approach
At the moment I'm calling content.length on the page. This is very slow when one of the large pages is crawled:
detail_links.each do |detail_link|
detail_page = detail_link.click
# skip long pages
break if detail_page.content.length > 100_000
# rest of the processing
end
Content_length during response_read:
In the Mechanize source code I found a reference to content_length when the response is read. Is querying the response properties a possible solution?
# agent.rb extract from the Mechanize project
def response_read response, request, uri
content_length = response.content_length
if use_tempfile? content_length then
body_io = make_tempfile 'mechanize-raw'
else
body_io = StringIO.new.set_encoding(Encoding::BINARY)
end

Mechanize will normally "get" the entire page. Instead you should use a head request first to get the page size, then conditionally get the page. See "How can I perform a Head request using mechanize in Ruby" for an example.
The thing to be careful of is that a dynamically generated resource might not have a known size when you do the head request, so you could get a response without the size entry. Notice that in the selected answer for the question linked above, that Google didn't return the content-length header because it's a dynamically generated page. Static pages and resources should have the header... unless the server doesn't return them for some reason.
The Mechanize documentation mentions this:
Problems with content-length
Some sites return an incorrect content-length value. Unlike a browser, mechanize raises an error when the content-length header does not match the response length since it does not know if there was a connection problem or if the mismatch is a server bug.
The error raised, Mechanize::ResponseReadError, can be converted to a parsed Page, File, etc. depending upon the content-type:
agent = Mechanize.new
uri = URI 'http://example/invalid_content_length'
begin
page = agent.get uri
rescue Mechanize::ResponseReadError => e
page = e.force_parse
end
In other words, while head can help, it's not necessarily going to give you enough information to allow you to skip huge pages. You have to investigate the site you're crawling and learn how their server responds.

Related

Wait for selector to present

When doing web scraping with Nokogiri I occasionally get the following error message
undefined method `at_css' for nil:NilClass (NoMethodError)
I know that the selected element is present at some time, but the site is sometimes a bit slow to respond, and I guess this is the reason why I'm getting the error.
Is there some way to wait until a certain selector is present before proceeding with the script?
My current http request block looks like this
url = URL
body = BODY
uri = URI.parse(url)
http = Net::HTTP.new(uri.host, uri.port)
http.read_timeout = 200 # default 60 seconds
http.open_timeout = 200 # default nil
http.use_ssl = true
request = Net::HTTP::Post.new(uri.request_uri)
request.body = body
request["Content-Type"] = "application/x-www-form-urlencoded"
begin
response = http.request(request)
doc = Nokogiri::HTML(response.body)
rescue
sleep 100
retry
end
While you can use a streaming Net::HTTP like #Stefan says in his comment, and an associated handler that includes Nokogiri, you can't parse a partial HTTP document using a DOM model, which is Nokogiri's default, because it expects the full document also.
You could use Nokogiri's SAX parser, but that's an entirely different programming style.
If you're retrieving an entire page, then use OpenURI instead of the lower-level Net::HTTP. It automatically handles a number of things that Net::HTTP will not do by default, such as redirection, which makes it a lot easier to retrieve pages and will greatly simplify your code.
I suspect the problem is either that the site is timing out, or the tag you're trying to find is dynamically loaded after the real page loads.
If it's timing out you'll need to increase your wait time.
If it's dynamically loading that markup, you can request the main page, locate the appropriate URL for the dynamic content and load it separately. Once you have it, you can either insert it into the first page if you need everything, or just parse it separately.

Maintaining session and cookies over a 302 redirect

I am trying to make fetch a PDF file that gets generated on-demand behind an auth wall. Based on my testing, the flow is as follows:
I make a GET request with several parameters (including auth credentials) to the appropriate page. That page validates my credentials and then processes my request. When the request is finished processing (nearly instantly), I am sent a 302 response that redirects me to the location of the generated PDF. This PDF can then only be accessed by that session.
Using a browser, there's really nothing strange that happens. I attempted to do the same via curl and wget without any optional parameters, but those both failed. I was able to get curl working by adding -L -b /tmp/cookie.txt as options, though (to follow redirects and store cookies).
According to the ruby-doc, using Net::HTTP.start should get me close to what I want. After playing around with it, I was indeed fairly close. I believe the only issue, however, was that my Set-Cookie values were different between requests, even though they were using the same http object in the same start block.
I tried keeping it as simple as possible and then expanding once I got the results I was looking for:
url = URI.parse("http://dev.example.com:8888/path/to/page.jsp?option1=test1&option2=test2&username=user1&password=password1")
Net::HTTP.start(url.host, url.port) do |http|
# Request the first URL
first_req = Net::HTTP::Get.new url
first_res = http.request first_req
# Grab the 302 redirect location (it will always be relative like "../servlet/sendfile/result/543675843657843965743895642865273847328.pdf")
redirect_loc = URI.parse(first_res['Location']
# Request the PDF
second_req = Net::HTTP::Get.new redirect_loc
second_res = http.request first_req
end
I also attempted to use http.get instead of creating a new request each time, but still no luck.
The problem is with cookie: it should be passed within the second request. Smth like:
second_req = Net::HTTP::Get.new(uri.path, {'Cookie' => first_req['Set-Cookie']})

To 406 or not to 406 (http status code)

I'm developing a RESTful web application in Ruby with Sinatra. It should support CRUD operations, and to respond to Read requests I have the following function that formats the data according to what the request specified:
def handleResponse(data, haml_path, haml_locals)
case true
when request.accept.include?("application/json") #JSON requested
return data.to_json
when request.accept.include?("text/html") #HTML requested
return haml(haml_path.to_sym, :locals => haml_locals, :layout => !request.xhr?)
else # Unknown/unsupported type requested
return 406 # Not acceptable
end
end
Only I don't know what is best to do in the else statement. The main problem is that browsers and jQuery AJAX will accept */*, so technically a 406 error is not really the best idea. But: what do I send? I could do data.to_s which is meaningless. I could send what HAML returns, but they didn't ask for text/html and I would rather notify them of that somehow.
Secondly, supposing the 406 code is the right way to go, how do I format the response to be valid according to the W3 spec?
Unless it was a HEAD request, the response SHOULD include an entity containing a list of available entity characteristics and location(s) from which the user or user agent can choose the one most appropriate. The entity format is specified by the media type given in the Content-Type header field. Depending upon the format and the capabilities of the user agent, selection of the most appropriate choice MAY be performed automatically. However, this specification does not define any standard for such automatic selection.
It looks like you're trying to do a clearing-house method for all the data types you could return, but that can be confusing for the user of the API. Instead, they should know that a particular URL will always return the same data type.
For my in-house REST APIs, I create certain URLs that return HTML for documentation, and others that return JSON for data. If the user crosses the streams, they'll do it during their development phase and they'll get some data they didn't expect and will fix it.
If I had to use something like you're writing, and they can't handle 'application/json' and can't handle 'text/html', I'd return 'text/plain' and send data.to_s and let them sort out the mess. JSON and HTML are pretty well established standards now.
Here's the doc for Setting Sinatra response headers.

Fetching only X/HTML links (not images) based on mime type

I'm crawling a site using Ruby + OpenURI + Nokogiri. Fetch a page, find all the a[href] and (if they're in the same domain and right protocol) follow them to crawl again.
Sometimes there are links to large binaries (e.g. jpeg, exe), and I don't want to crawl those.
I tried using the HTTP "Accept" header to get an error or empty response for the wrong mime types like so:
require 'open-uri'
page = open(url, 'Accept'=>'text/html,application/xhtml+xml,application/xml')
...but OpenURI still downloads binaries sent with another mime type.
Other than looking at file extensions in the url for a probable file type, how can I prevent the download (or detect a conflicting response type) for an arbitrary URL?
You could send a HEAD request first, then check the Content-type header of the response and only make the real request if it’s acceptable:
ACCEPTABLE_TYPES = %w{text/html application/xhtml+xml application/xml}
uri = URI(url)
type = Net::HTTP.start(uri.host, uri.port) do |http|
http.head(uri.path).content_type
end
if ACCEPTABLE_TYPES.include? type
# fetch the url
else
# do whatever
end
This will need an extra request for each page, but I can’t see a way of avoiding it. It also relies on the server sending the same headers for a HEAD request as it does for a GET, which I think is a reasonable assumption but something to be aware of.

How do I read only x number of bytes of the body using Net::HTTP?

It seems like the methods of Ruby's Net::HTTP are all or nothing when it comes to reading the body of a web page. How can I read, say, the just the first 100 bytes of the body?
I am trying to read from a content server that returns a short error message in the body of the response if the file requested isn't available. I need to read enough of the body to determine whether the file is there. The files are huge, so I don't want to get the whole body just to check if the file is available.
This is an old thread, but the question of how to read only a portion of a file via HTTP in Ruby is still a mostly unanswered one according to my research. Here's a solution I came up with by monkey-patching Net::HTTP a bit:
require 'net/http'
# provide access to the actual socket
class Net::HTTPResponse
attr_reader :socket
end
uri = URI("http://www.example.com/path/to/file")
begin
Net::HTTP.start(uri.host, uri.port) do |http|
request = Net::HTTP::Get.new(uri.request_uri)
# calling request with a block prevents body from being read
http.request(request) do |response|
# do whatever limited reading you want to do with the socket
x = response.socket.read(100);
# be sure to call finish before exiting the block
http.finish
end
end
rescue IOError
# ignore
end
The rescue catches the IOError that's thrown when you call HTTP.finish prematurely.
FYI, the socket within the HTTPResponse object isn't a true IO object (it's an internal class called BufferedIO), but it's pretty easy to monkey-patch that, too, to mimic the IO methods you need. For example, another library I was using (exifr) needed the readchar method, which was easy to add:
class Net::BufferedIO
def readchar
read(1)[0].ord
end
end
Shouldn't you just use an HTTP HEAD request (Ruby Net::HTTP::Head method) to see if the resource is there, and only proceed if you get a 2xx or 3xx response? This presumes your server is configured to return a 4xx error code if the document is not available. I would argue this was the correct solution.
An alternative is to request the HTTP head and look at the content-length header value in the result: if your server is correctly configured, you should easily be able to tell the difference in length between a short message and a long document. Another alternative: set the content-range header field in the request (which again assumes that the server is behaving correctly WRT the HTTP spec).
I don't think that solving the problem in the client after you've sent the GET request is the way to go: by that time, the network has done the heavy lifting, and you won't really save any wasted resources.
Reference: http header definitions
I wanted to do this once, and the only thing that I could think of is monkey patching the Net::HTTP#read_body and Net::HTTP#read_body_0 methods to accept a length parameter, and then in the former just pass the length parameter to the read_body_0 method, where you can read only as much as length bytes.
To read the body of an HTTP request in chunks, you'll need to use Net::HTTPResponse#read_body like this:
http.request_get('/large_resource') do |response|
response.read_body do |segment|
print segment
end
end
Are you sure the content server only returns a short error page?
Doesn't it also set the HTTPResponse to something appropriate like 404. In which case you can trap the HTTPClientError derived exception (most likely HTTPNotFound) which is raised when accessing Net::HTTP.value().
If you get an error then your file wasn't there if you get 200 the file is starting to download and you can close the connection.
You can't. But why do you need to? Surely if the page just says that the file isn't available then it won't be a huge page (i.e. by definition, the file won't be there)?

Resources