Wait for selector to present - ruby

When doing web scraping with Nokogiri I occasionally get the following error message
undefined method `at_css' for nil:NilClass (NoMethodError)
I know that the selected element is present at some time, but the site is sometimes a bit slow to respond, and I guess this is the reason why I'm getting the error.
Is there some way to wait until a certain selector is present before proceeding with the script?
My current http request block looks like this
url = URL
body = BODY
uri = URI.parse(url)
http = Net::HTTP.new(uri.host, uri.port)
http.read_timeout = 200 # default 60 seconds
http.open_timeout = 200 # default nil
http.use_ssl = true
request = Net::HTTP::Post.new(uri.request_uri)
request.body = body
request["Content-Type"] = "application/x-www-form-urlencoded"
begin
response = http.request(request)
doc = Nokogiri::HTML(response.body)
rescue
sleep 100
retry
end

While you can use a streaming Net::HTTP like #Stefan says in his comment, and an associated handler that includes Nokogiri, you can't parse a partial HTTP document using a DOM model, which is Nokogiri's default, because it expects the full document also.
You could use Nokogiri's SAX parser, but that's an entirely different programming style.
If you're retrieving an entire page, then use OpenURI instead of the lower-level Net::HTTP. It automatically handles a number of things that Net::HTTP will not do by default, such as redirection, which makes it a lot easier to retrieve pages and will greatly simplify your code.
I suspect the problem is either that the site is timing out, or the tag you're trying to find is dynamically loaded after the real page loads.
If it's timing out you'll need to increase your wait time.
If it's dynamically loading that markup, you can request the main page, locate the appropriate URL for the dynamic content and load it separately. Once you have it, you can either insert it into the first page if you need everything, or just parse it separately.

Related

Skip large pages in Mechanize (Ruby)

I'm trying to skip processing a few large pages (some over 10MB) scattered in a result set, as Mechanize (version 2.7.3) crawls an array of links.
Unfortunately I can't find a 'content-length' property or a similar indicator. The Mechanize::FileResponse class has a content_length method but Mechanize::Page does not.
Current approach
At the moment I'm calling content.length on the page. This is very slow when one of the large pages is crawled:
detail_links.each do |detail_link|
detail_page = detail_link.click
# skip long pages
break if detail_page.content.length > 100_000
# rest of the processing
end
Content_length during response_read:
In the Mechanize source code I found a reference to content_length when the response is read. Is querying the response properties a possible solution?
# agent.rb extract from the Mechanize project
def response_read response, request, uri
content_length = response.content_length
if use_tempfile? content_length then
body_io = make_tempfile 'mechanize-raw'
else
body_io = StringIO.new.set_encoding(Encoding::BINARY)
end
Mechanize will normally "get" the entire page. Instead you should use a head request first to get the page size, then conditionally get the page. See "How can I perform a Head request using mechanize in Ruby" for an example.
The thing to be careful of is that a dynamically generated resource might not have a known size when you do the head request, so you could get a response without the size entry. Notice that in the selected answer for the question linked above, that Google didn't return the content-length header because it's a dynamically generated page. Static pages and resources should have the header... unless the server doesn't return them for some reason.
The Mechanize documentation mentions this:
Problems with content-length
Some sites return an incorrect content-length value. Unlike a browser, mechanize raises an error when the content-length header does not match the response length since it does not know if there was a connection problem or if the mismatch is a server bug.
The error raised, Mechanize::ResponseReadError, can be converted to a parsed Page, File, etc. depending upon the content-type:
agent = Mechanize.new
uri = URI 'http://example/invalid_content_length'
begin
page = agent.get uri
rescue Mechanize::ResponseReadError => e
page = e.force_parse
end
In other words, while head can help, it's not necessarily going to give you enough information to allow you to skip huge pages. You have to investigate the site you're crawling and learn how their server responds.

Maintaining session and cookies over a 302 redirect

I am trying to make fetch a PDF file that gets generated on-demand behind an auth wall. Based on my testing, the flow is as follows:
I make a GET request with several parameters (including auth credentials) to the appropriate page. That page validates my credentials and then processes my request. When the request is finished processing (nearly instantly), I am sent a 302 response that redirects me to the location of the generated PDF. This PDF can then only be accessed by that session.
Using a browser, there's really nothing strange that happens. I attempted to do the same via curl and wget without any optional parameters, but those both failed. I was able to get curl working by adding -L -b /tmp/cookie.txt as options, though (to follow redirects and store cookies).
According to the ruby-doc, using Net::HTTP.start should get me close to what I want. After playing around with it, I was indeed fairly close. I believe the only issue, however, was that my Set-Cookie values were different between requests, even though they were using the same http object in the same start block.
I tried keeping it as simple as possible and then expanding once I got the results I was looking for:
url = URI.parse("http://dev.example.com:8888/path/to/page.jsp?option1=test1&option2=test2&username=user1&password=password1")
Net::HTTP.start(url.host, url.port) do |http|
# Request the first URL
first_req = Net::HTTP::Get.new url
first_res = http.request first_req
# Grab the 302 redirect location (it will always be relative like "../servlet/sendfile/result/543675843657843965743895642865273847328.pdf")
redirect_loc = URI.parse(first_res['Location']
# Request the PDF
second_req = Net::HTTP::Get.new redirect_loc
second_res = http.request first_req
end
I also attempted to use http.get instead of creating a new request each time, but still no luck.
The problem is with cookie: it should be passed within the second request. Smth like:
second_req = Net::HTTP::Get.new(uri.path, {'Cookie' => first_req['Set-Cookie']})

How to properly close a HTTP connection in Ruby

I've been looking for a proper way to close a HTTP connection and have found nothing yet.
require 'net/http'
require 'uri'
uri = URI.parse("http://www.google.com")
http = Net::HTTP.new(uri.host, uri.port)
request = Net::HTTP::Get.new("/")
http.start
resp = http.request request
p resp.body
So far so good, but now when I try to close the connection guessing which method to use:
http.shutdown # error
http.close # error
when I check ri Net::HTTP.start, the documentation explicitly says
the caller is responsible for closing it (the connection) upon completion
I don't want to use the block form.
The method to close the connection is http.finish.
The net/http API is particularly confusing to use. It is generally easier to use a higher-level library instead, such as HTTParty or REST Client, which will provide a more intuitive API and take care of the lower level details for you.
If the optional block is given, the newly created Net::HTTP object is passed to it and closed when the block finishes.
from ruby-doc.org

Fetching only X/HTML links (not images) based on mime type

I'm crawling a site using Ruby + OpenURI + Nokogiri. Fetch a page, find all the a[href] and (if they're in the same domain and right protocol) follow them to crawl again.
Sometimes there are links to large binaries (e.g. jpeg, exe), and I don't want to crawl those.
I tried using the HTTP "Accept" header to get an error or empty response for the wrong mime types like so:
require 'open-uri'
page = open(url, 'Accept'=>'text/html,application/xhtml+xml,application/xml')
...but OpenURI still downloads binaries sent with another mime type.
Other than looking at file extensions in the url for a probable file type, how can I prevent the download (or detect a conflicting response type) for an arbitrary URL?
You could send a HEAD request first, then check the Content-type header of the response and only make the real request if it’s acceptable:
ACCEPTABLE_TYPES = %w{text/html application/xhtml+xml application/xml}
uri = URI(url)
type = Net::HTTP.start(uri.host, uri.port) do |http|
http.head(uri.path).content_type
end
if ACCEPTABLE_TYPES.include? type
# fetch the url
else
# do whatever
end
This will need an extra request for each page, but I can’t see a way of avoiding it. It also relies on the server sending the same headers for a HEAD request as it does for a GET, which I think is a reasonable assumption but something to be aware of.

How do I read only x number of bytes of the body using Net::HTTP?

It seems like the methods of Ruby's Net::HTTP are all or nothing when it comes to reading the body of a web page. How can I read, say, the just the first 100 bytes of the body?
I am trying to read from a content server that returns a short error message in the body of the response if the file requested isn't available. I need to read enough of the body to determine whether the file is there. The files are huge, so I don't want to get the whole body just to check if the file is available.
This is an old thread, but the question of how to read only a portion of a file via HTTP in Ruby is still a mostly unanswered one according to my research. Here's a solution I came up with by monkey-patching Net::HTTP a bit:
require 'net/http'
# provide access to the actual socket
class Net::HTTPResponse
attr_reader :socket
end
uri = URI("http://www.example.com/path/to/file")
begin
Net::HTTP.start(uri.host, uri.port) do |http|
request = Net::HTTP::Get.new(uri.request_uri)
# calling request with a block prevents body from being read
http.request(request) do |response|
# do whatever limited reading you want to do with the socket
x = response.socket.read(100);
# be sure to call finish before exiting the block
http.finish
end
end
rescue IOError
# ignore
end
The rescue catches the IOError that's thrown when you call HTTP.finish prematurely.
FYI, the socket within the HTTPResponse object isn't a true IO object (it's an internal class called BufferedIO), but it's pretty easy to monkey-patch that, too, to mimic the IO methods you need. For example, another library I was using (exifr) needed the readchar method, which was easy to add:
class Net::BufferedIO
def readchar
read(1)[0].ord
end
end
Shouldn't you just use an HTTP HEAD request (Ruby Net::HTTP::Head method) to see if the resource is there, and only proceed if you get a 2xx or 3xx response? This presumes your server is configured to return a 4xx error code if the document is not available. I would argue this was the correct solution.
An alternative is to request the HTTP head and look at the content-length header value in the result: if your server is correctly configured, you should easily be able to tell the difference in length between a short message and a long document. Another alternative: set the content-range header field in the request (which again assumes that the server is behaving correctly WRT the HTTP spec).
I don't think that solving the problem in the client after you've sent the GET request is the way to go: by that time, the network has done the heavy lifting, and you won't really save any wasted resources.
Reference: http header definitions
I wanted to do this once, and the only thing that I could think of is monkey patching the Net::HTTP#read_body and Net::HTTP#read_body_0 methods to accept a length parameter, and then in the former just pass the length parameter to the read_body_0 method, where you can read only as much as length bytes.
To read the body of an HTTP request in chunks, you'll need to use Net::HTTPResponse#read_body like this:
http.request_get('/large_resource') do |response|
response.read_body do |segment|
print segment
end
end
Are you sure the content server only returns a short error page?
Doesn't it also set the HTTPResponse to something appropriate like 404. In which case you can trap the HTTPClientError derived exception (most likely HTTPNotFound) which is raised when accessing Net::HTTP.value().
If you get an error then your file wasn't there if you get 200 the file is starting to download and you can close the connection.
You can't. But why do you need to? Surely if the page just says that the file isn't available then it won't be a huge page (i.e. by definition, the file won't be there)?

Resources