Ruby: How to screen-scrape the result of an Ajax request - ruby

I have written a ruby script to screen scrape something using the 'open-uri' and 'hpricot' gems - everything works great so far.
But now I have to screen scrape something which is returned after a form is submitted via a javascript function (called by an 'onchange' event handler from a drop-down menu):
function submit_form() {
document.list.action="/some/sort/of/path";
document.list.submit();
}
AFAIK, open-uri lets you submit only GET requests. And if I'm not mistaken, a POST request would be needed here.
So my question is: what do I need to install and to 'require' and how would the ruby code then look like (to make that POST request) - sorry, I'm still pretty much of a n00b...
Thank you very much for your help!
Tom

I think you definitely should use Mechanize. It provides a nifty interface to interact with remote pages, forms on them, and so forth (see this example).

The Ruby standard library has the http class, which naturally supports the POST operation.
Net::HTTP.post_form(URI.parse('http://www.example.com/some/sort/of/path')
If you find the API there less than optimal, then take a look at the httparty gem
Finally, while hpricot is a great gem, it isn't actively developed any longer. You should consider moving to nokogiri which practically replaces hpricot and improves upon it.

Related

Fastest Net::HTTP/Net::HTTPS wrapper for Ruby

What is the fastest way to download a webpage for parsing in Ruby? I've tried using open-uri and HTTParty both seem to take roughly about 25 seconds to download simple webpages (I've tried multiple sites).
I'm passing the sites to Nokogiri but the latency takes place prior to passing any parameters to Nokogiri.
i prefer to use gem 'http' (https://github.com/httprb/http). It is fast, clean api.
Also you can take a look on the http clients comparison table:
https://github.com/httprb/http#another-ruby-http-library-why-should-i-care

Ruby Net::HTTP::Options does not allow response body

I am writing a script to test various web-services in ruby. To make http requests thus far I have been using Net::HTTP but today I realized I needed to make an OPTIONS request and retrieve some JSON from the response.
Unfortunately ruby does not currently support this: https://bugs.ruby-lang.org/issues/8429
Does anyone know of gem that supports this or some other way to get this response?
This is an alternative which supports lot of options
https://rubygems.org/gems/curb
Mechanize Gem
Try this gem it's very usefull and simple in use. I use it for parsing and another different tasks.

Ruby Equivalent of Python Requests Library (HTTP Client)

There is a library in Python that I love called "Requests". Requests is a HTTP client build on urllib3. "requests doc".
I am looking for something similar in Ruby. Basically what I need is:
Upload files support (multipart/form-data).
Easy get/post.
Cookies can be passed from a response object to a request object (build manually login script).
Stable and Flexible.
Sessions support (to not have to handle cookies manually if we don't have too).
I've looked at Typhoeus, but the code example in the home page doesn't work; they have moved code along and the get method is not longer directly accessible like that, so it's not starting well. Curb seems nice and I like cURL, there is also rest-client, which seems popular, and em-http seems pretty fast according to benchmark. There is a also Patron and curb-fu, which I haven't have the time to try. And, of course, Net:HTTP. But, it doesn't seem to have a mainstream solution that everyone points to.
I think a lot of people have been in my situation and I wonder what they have choosen and why?
The author of the comparison is the author of httpclient, but from the looks of it the comparison is fair.
For a more narrative style with some explanation of the matrix, see http://www.slideshare.net/HiroshiNakamura/rubyhttp-clients-comparison from the same author.
The comparison comes out partly in favor of httpclient, which I can also recommend. Simple, featureful, compatible with all Ruby platforms and performant. Better cookie support than anything else out there, but the presentation mentions that cookies may leak from one (malevolent) site to another if you use the same client object. Don't know if this is still true.
There is https://github.com/cyx/requests, which is exactly what the question is asking for, a port of the requests lib from python.
The built-in OpenURI is the first place to look. It's simple and handles the basics nicely.
Typhoeus, which I've used several times for parallel processes, works nicely. Documentation and the codebase are available at Github.
irb(main):009:0> response = Typhoeus::Request.get("www.example.com")
=> #<Typhoeus::Response:0x007ffbcc067cf8 #code=302, #curl_return_code=0, #curl_error_message="No error", #status_message=nil, #http_version=nil, #headers="HTTP/1.0 302 Found\r\nLocation: http://www.iana.org/domains/example/\r\nServer: BigIP\r\nConnection: close\r\nContent-Length: 0\r\n\r\n", #body="", #time=0.035584, #requested_url=nil, #requested_http_method=nil, #start_time=nil, #start_transfer_time=0.035529, #app_connect_time=2.8e-05, #pretransfer_time=0.000429, #connect_time=2.8e-05, #name_lookup_time=2.8e-05, #request=:method => :get,
:url => www.example.com, #effective_url="HTTP://www.example.com", #primary_ip="192.0.43.10", #redirect_count=0, #mock=false>
irb(main):010:0> puts response.headers
HTTP/1.0 302 Found
Location: http://www.iana.org/domains/example/
Server: BigIP
Connection: close
Content-Length: 0
I use Net::HTTP occasionally too, but OpenURI and Typhoeus, with Hydra, have proven to be easy to use and integrate with my code.
I've eventually found this HTTPClient :
https://github.com/nahi/httpclient
I've started using it, it matches the features I wanted, and more over it's pretty fast according to some benchmark. It also support some advanced things like streaming or chunked response. It's shame though it's not famous in the ruby community. :)
Have you looked at the HTTParty gem?
If you need cookies and form handling, mechanize is the only way to go.
I'm sorry to hear, that Typhoeus didn't work out for you. The reason is, that the README shows howto work with Typhoeus v0.5.0.rc which can be installed with
gem install typhoeus --pre
or
gem "typhoeus", git: "git://github.com/typhoeus/typhoeus.git"
.
There is no session support for Typhoeus but other than that it could be a good fit. At least its stable as hell since it is build on top of libcurl.
File sending example:
Typhoeus.post("www.example.com/file", body: { file: File.open("testfile.txt","r") })
There is unfortunately no shortcut to deal with cookies, you have to set them manually:
Typhoeus.get("www.example.com/needs_cookie", headers: { Cookie: "PRIVATE" })
TLDR: I would choose Typhoeus for its speed and libcurl if you're willing to set things up yourself. Otherwise I would look into Faraday and use it with the Typhoeus adapter.
Edit: I've added installation instructions to the README.
Edit: 0.5 is released.
This question seems to be lacking recent answers. So am filling in the void.
Coming from python myself, and having loved requests library for what it does easily, I recently discovered a very nice Ruby equivalent in rest_client
It supports all the features mentioned in the question, and seems to be very nice from usability perspective - what requests library aimed to achieve.

Rails HTTP streaming with HAML

There appears to be an issue with using HTTP streaming with HAML projects in rails. It works perfectly if I use ERB instead. Apparently, I'm not the only one with this problem.
It doesn't work with placing stream at the top of the controller, or with using render :stream => true in the action.
How can I get HAML and HTTP streaming to play nicely together?
Update: I've opened an issue on the gem's page, here.
This is not yet supported by HAML (source):
HTTP streaming is the sort of thing that would require a substantial
set of modifications to the core Haml engine. It's only moderately
tricky to get it working even in basic cases, but when you factor in
things like the whitespace-eating operators it gets much more
difficult.
This isn't something I'm opposed to in theory, but it's also not
something that's high on my priority list given the difficulty of
implementing it.
The internals of Haml are such that it is indeed writing out to a buffer as it goes along. However, the "standard" API that Rails has traditionally provided for templating languages is a fairly straightforward in-and-out call. I don't think Haml does currently have "streaming support", but its simply more of an API issue than anything else.
I'm curious as to how Rails is plugging into ERB to do this.

Easy Ruby http/curl API to program with

I've been using Ruby for quite some time now, however unlike PHP, as far as I know there is not a standard http/Curl (fetching, processing forms) like library that is easy and powerful like PHP's libCuRL binding.
While Net::HTTP is part of the Ruby standard library, I always find that API hard to remember and program with.
Can anyone give suggestions on which http/curl library I should use over Net::HTTP?
Take a look at HTTParty or REST Client.
I would recommend using the Typhoeus gem. It's got a pretty clean API and allows you to make concurrent requests.
I'll second Ryan's recommendation for Typhoeus, and recommend HTTPClient also. Both are very full featured and handle parallel
requests easily.
For simple requests it's hard to beat Open-URI for its simplicity:
require 'open-uri'
html = open('http://www.example.com').read
If you're parsing a page it works great with Nokogiri:
require 'open-uri'
require 'nokogiri'
doc = Nokogiri::HTML(open('http://www.example.com'))
I wrote a wrapper for the Net:HTTP lib recently, its very very simplistic. I wanted something with a simple API that was easy to use and remember, it's been working well for me:
https://github.com/ctcherry/plain_http

Resources