Fetching only X/HTML links (not images) based on mime type - ruby

I'm crawling a site using Ruby + OpenURI + Nokogiri. Fetch a page, find all the a[href] and (if they're in the same domain and right protocol) follow them to crawl again.
Sometimes there are links to large binaries (e.g. jpeg, exe), and I don't want to crawl those.
I tried using the HTTP "Accept" header to get an error or empty response for the wrong mime types like so:
require 'open-uri'
page = open(url, 'Accept'=>'text/html,application/xhtml+xml,application/xml')
...but OpenURI still downloads binaries sent with another mime type.
Other than looking at file extensions in the url for a probable file type, how can I prevent the download (or detect a conflicting response type) for an arbitrary URL?

You could send a HEAD request first, then check the Content-type header of the response and only make the real request if it’s acceptable:
ACCEPTABLE_TYPES = %w{text/html application/xhtml+xml application/xml}
uri = URI(url)
type = Net::HTTP.start(uri.host, uri.port) do |http|
http.head(uri.path).content_type
end
if ACCEPTABLE_TYPES.include? type
# fetch the url
else
# do whatever
end
This will need an extra request for each page, but I can’t see a way of avoiding it. It also relies on the server sending the same headers for a HEAD request as it does for a GET, which I think is a reasonable assumption but something to be aware of.

Related

What is the "accept" part for?

When connecting to a website using Net::HTTP you can parse the URL and output each of the URL headers by using #.each_header. I understand what the encoding and the user agent and such means, but not what the "accept"=>["*/*"] part is. Is this the accepted payload? Or is it something else?
require 'net/http'
uri = URI('http://www.bible-history.com/subcat.php?id=2')
http://www.bible-history.com/subcat.php?id=2>
http_request = Net::HTTP::Get.new(uri)
http_request.each_header { |header| puts header }
# => {"accept-encoding"=>["gzip;q=1.0,deflate;q=0.6,identity;q=0.3"], "accept"=>["*/*"], "user-agent"=>["Ruby"], "host"=>["www.bible-history.com"]}
From https://www.w3.org/Protocols/HTTP/HTRQ_Headers.html#z3
This field contains a semicolon-separated list of representation schemes ( Content-Type metainformation values) which will be accepted in the response to this request.
Basically, it specifies what kinds of content you can read back. If you write an api client, you may only be interested in application/json, for example (and you couldn't care less about text/html).
In this case, your header would look like this:
Accept: application/json
And the app will know not to send any html your way.
Using the Accept header, the client can specify MIME types they are willing to accept for the requested URL. If the requested resource is e.g. available in multiple representations (e.g an image as PNG, JPG or SVG), the user agent can specify that they want the PNG version only. It is up to the server to honor this request.
In your example, the request header specifies that you are willing to accept any content type.
The header is defined in RFC 2616.

Wait for selector to present

When doing web scraping with Nokogiri I occasionally get the following error message
undefined method `at_css' for nil:NilClass (NoMethodError)
I know that the selected element is present at some time, but the site is sometimes a bit slow to respond, and I guess this is the reason why I'm getting the error.
Is there some way to wait until a certain selector is present before proceeding with the script?
My current http request block looks like this
url = URL
body = BODY
uri = URI.parse(url)
http = Net::HTTP.new(uri.host, uri.port)
http.read_timeout = 200 # default 60 seconds
http.open_timeout = 200 # default nil
http.use_ssl = true
request = Net::HTTP::Post.new(uri.request_uri)
request.body = body
request["Content-Type"] = "application/x-www-form-urlencoded"
begin
response = http.request(request)
doc = Nokogiri::HTML(response.body)
rescue
sleep 100
retry
end
While you can use a streaming Net::HTTP like #Stefan says in his comment, and an associated handler that includes Nokogiri, you can't parse a partial HTTP document using a DOM model, which is Nokogiri's default, because it expects the full document also.
You could use Nokogiri's SAX parser, but that's an entirely different programming style.
If you're retrieving an entire page, then use OpenURI instead of the lower-level Net::HTTP. It automatically handles a number of things that Net::HTTP will not do by default, such as redirection, which makes it a lot easier to retrieve pages and will greatly simplify your code.
I suspect the problem is either that the site is timing out, or the tag you're trying to find is dynamically loaded after the real page loads.
If it's timing out you'll need to increase your wait time.
If it's dynamically loading that markup, you can request the main page, locate the appropriate URL for the dynamic content and load it separately. Once you have it, you can either insert it into the first page if you need everything, or just parse it separately.

Skip large pages in Mechanize (Ruby)

I'm trying to skip processing a few large pages (some over 10MB) scattered in a result set, as Mechanize (version 2.7.3) crawls an array of links.
Unfortunately I can't find a 'content-length' property or a similar indicator. The Mechanize::FileResponse class has a content_length method but Mechanize::Page does not.
Current approach
At the moment I'm calling content.length on the page. This is very slow when one of the large pages is crawled:
detail_links.each do |detail_link|
detail_page = detail_link.click
# skip long pages
break if detail_page.content.length > 100_000
# rest of the processing
end
Content_length during response_read:
In the Mechanize source code I found a reference to content_length when the response is read. Is querying the response properties a possible solution?
# agent.rb extract from the Mechanize project
def response_read response, request, uri
content_length = response.content_length
if use_tempfile? content_length then
body_io = make_tempfile 'mechanize-raw'
else
body_io = StringIO.new.set_encoding(Encoding::BINARY)
end
Mechanize will normally "get" the entire page. Instead you should use a head request first to get the page size, then conditionally get the page. See "How can I perform a Head request using mechanize in Ruby" for an example.
The thing to be careful of is that a dynamically generated resource might not have a known size when you do the head request, so you could get a response without the size entry. Notice that in the selected answer for the question linked above, that Google didn't return the content-length header because it's a dynamically generated page. Static pages and resources should have the header... unless the server doesn't return them for some reason.
The Mechanize documentation mentions this:
Problems with content-length
Some sites return an incorrect content-length value. Unlike a browser, mechanize raises an error when the content-length header does not match the response length since it does not know if there was a connection problem or if the mismatch is a server bug.
The error raised, Mechanize::ResponseReadError, can be converted to a parsed Page, File, etc. depending upon the content-type:
agent = Mechanize.new
uri = URI 'http://example/invalid_content_length'
begin
page = agent.get uri
rescue Mechanize::ResponseReadError => e
page = e.force_parse
end
In other words, while head can help, it's not necessarily going to give you enough information to allow you to skip huge pages. You have to investigate the site you're crawling and learn how their server responds.

Maintaining session and cookies over a 302 redirect

I am trying to make fetch a PDF file that gets generated on-demand behind an auth wall. Based on my testing, the flow is as follows:
I make a GET request with several parameters (including auth credentials) to the appropriate page. That page validates my credentials and then processes my request. When the request is finished processing (nearly instantly), I am sent a 302 response that redirects me to the location of the generated PDF. This PDF can then only be accessed by that session.
Using a browser, there's really nothing strange that happens. I attempted to do the same via curl and wget without any optional parameters, but those both failed. I was able to get curl working by adding -L -b /tmp/cookie.txt as options, though (to follow redirects and store cookies).
According to the ruby-doc, using Net::HTTP.start should get me close to what I want. After playing around with it, I was indeed fairly close. I believe the only issue, however, was that my Set-Cookie values were different between requests, even though they were using the same http object in the same start block.
I tried keeping it as simple as possible and then expanding once I got the results I was looking for:
url = URI.parse("http://dev.example.com:8888/path/to/page.jsp?option1=test1&option2=test2&username=user1&password=password1")
Net::HTTP.start(url.host, url.port) do |http|
# Request the first URL
first_req = Net::HTTP::Get.new url
first_res = http.request first_req
# Grab the 302 redirect location (it will always be relative like "../servlet/sendfile/result/543675843657843965743895642865273847328.pdf")
redirect_loc = URI.parse(first_res['Location']
# Request the PDF
second_req = Net::HTTP::Get.new redirect_loc
second_res = http.request first_req
end
I also attempted to use http.get instead of creating a new request each time, but still no luck.
The problem is with cookie: it should be passed within the second request. Smth like:
second_req = Net::HTTP::Get.new(uri.path, {'Cookie' => first_req['Set-Cookie']})

Output Net::HTTP request to human readable form

I am trying to debug a request that I am making to a server, but I can't figure out what is wrong with it, as it seems to be just like the request that I am putting through on RESTClient.
I am initializing it like so:
request = Net::HTTP::Post.new(url.to_s)
request.add_field "HeaderKey", "HeaderValue"
request.body = requestBody
and then I am executing it like so:
Net::HTTP.start(url.host, url.port) do |http|
response = http.request(request)
end
The requestBody is a string that is encoded with Base64.encode64.
Is there a way to output the request to see exactly where it's going and with what contents? I've used Paros for checking my iOS connections and I can also output a description of requests from most platforms I've worked with, but I can't figure it out for Ruby.
I've found that HTTP Scoop works pretty well for grabbing network traffic (disclaimer - it's not free!)

Resources