Fastest Net::HTTP/Net::HTTPS wrapper for Ruby - ruby

What is the fastest way to download a webpage for parsing in Ruby? I've tried using open-uri and HTTParty both seem to take roughly about 25 seconds to download simple webpages (I've tried multiple sites).
I'm passing the sites to Nokogiri but the latency takes place prior to passing any parameters to Nokogiri.

i prefer to use gem 'http' (https://github.com/httprb/http). It is fast, clean api.
Also you can take a look on the http clients comparison table:
https://github.com/httprb/http#another-ruby-http-library-why-should-i-care

Related

Ruby Net::HTTP::Options does not allow response body

I am writing a script to test various web-services in ruby. To make http requests thus far I have been using Net::HTTP but today I realized I needed to make an OPTIONS request and retrieve some JSON from the response.
Unfortunately ruby does not currently support this: https://bugs.ruby-lang.org/issues/8429
Does anyone know of gem that supports this or some other way to get this response?
This is an alternative which supports lot of options
https://rubygems.org/gems/curb
Mechanize Gem
Try this gem it's very usefull and simple in use. I use it for parsing and another different tasks.

Ruby Equivalent of Python Requests Library (HTTP Client)

There is a library in Python that I love called "Requests". Requests is a HTTP client build on urllib3. "requests doc".
I am looking for something similar in Ruby. Basically what I need is:
Upload files support (multipart/form-data).
Easy get/post.
Cookies can be passed from a response object to a request object (build manually login script).
Stable and Flexible.
Sessions support (to not have to handle cookies manually if we don't have too).
I've looked at Typhoeus, but the code example in the home page doesn't work; they have moved code along and the get method is not longer directly accessible like that, so it's not starting well. Curb seems nice and I like cURL, there is also rest-client, which seems popular, and em-http seems pretty fast according to benchmark. There is a also Patron and curb-fu, which I haven't have the time to try. And, of course, Net:HTTP. But, it doesn't seem to have a mainstream solution that everyone points to.
I think a lot of people have been in my situation and I wonder what they have choosen and why?
The author of the comparison is the author of httpclient, but from the looks of it the comparison is fair.
For a more narrative style with some explanation of the matrix, see http://www.slideshare.net/HiroshiNakamura/rubyhttp-clients-comparison from the same author.
The comparison comes out partly in favor of httpclient, which I can also recommend. Simple, featureful, compatible with all Ruby platforms and performant. Better cookie support than anything else out there, but the presentation mentions that cookies may leak from one (malevolent) site to another if you use the same client object. Don't know if this is still true.
There is https://github.com/cyx/requests, which is exactly what the question is asking for, a port of the requests lib from python.
The built-in OpenURI is the first place to look. It's simple and handles the basics nicely.
Typhoeus, which I've used several times for parallel processes, works nicely. Documentation and the codebase are available at Github.
irb(main):009:0> response = Typhoeus::Request.get("www.example.com")
=> #<Typhoeus::Response:0x007ffbcc067cf8 #code=302, #curl_return_code=0, #curl_error_message="No error", #status_message=nil, #http_version=nil, #headers="HTTP/1.0 302 Found\r\nLocation: http://www.iana.org/domains/example/\r\nServer: BigIP\r\nConnection: close\r\nContent-Length: 0\r\n\r\n", #body="", #time=0.035584, #requested_url=nil, #requested_http_method=nil, #start_time=nil, #start_transfer_time=0.035529, #app_connect_time=2.8e-05, #pretransfer_time=0.000429, #connect_time=2.8e-05, #name_lookup_time=2.8e-05, #request=:method => :get,
:url => www.example.com, #effective_url="HTTP://www.example.com", #primary_ip="192.0.43.10", #redirect_count=0, #mock=false>
irb(main):010:0> puts response.headers
HTTP/1.0 302 Found
Location: http://www.iana.org/domains/example/
Server: BigIP
Connection: close
Content-Length: 0
I use Net::HTTP occasionally too, but OpenURI and Typhoeus, with Hydra, have proven to be easy to use and integrate with my code.
I've eventually found this HTTPClient :
https://github.com/nahi/httpclient
I've started using it, it matches the features I wanted, and more over it's pretty fast according to some benchmark. It also support some advanced things like streaming or chunked response. It's shame though it's not famous in the ruby community. :)
Have you looked at the HTTParty gem?
If you need cookies and form handling, mechanize is the only way to go.
I'm sorry to hear, that Typhoeus didn't work out for you. The reason is, that the README shows howto work with Typhoeus v0.5.0.rc which can be installed with
gem install typhoeus --pre
or
gem "typhoeus", git: "git://github.com/typhoeus/typhoeus.git"
.
There is no session support for Typhoeus but other than that it could be a good fit. At least its stable as hell since it is build on top of libcurl.
File sending example:
Typhoeus.post("www.example.com/file", body: { file: File.open("testfile.txt","r") })
There is unfortunately no shortcut to deal with cookies, you have to set them manually:
Typhoeus.get("www.example.com/needs_cookie", headers: { Cookie: "PRIVATE" })
TLDR: I would choose Typhoeus for its speed and libcurl if you're willing to set things up yourself. Otherwise I would look into Faraday and use it with the Typhoeus adapter.
Edit: I've added installation instructions to the README.
Edit: 0.5 is released.
This question seems to be lacking recent answers. So am filling in the void.
Coming from python myself, and having loved requests library for what it does easily, I recently discovered a very nice Ruby equivalent in rest_client
It supports all the features mentioned in the question, and seems to be very nice from usability perspective - what requests library aimed to achieve.

vcr with capybara-webkit

I'm using capybara-webkit to test integration with a third party website (I need javascript).
I want to use vcr to record requests made during the integration test but capybara-webkit doesn't go over net http so vcr is unable to record them. How would I go about writing an adaptor for vcr that would allow me to record the reqeusts?
Unfortunately, VCR is very much incompatible with capybara-webkit. The fact is that capybara webkit is using webkit, which is in c. Webmock and Fakeweb, which are the basis for VCR, can only be used for Ruby web requests. Making the two work together would likely be a monumental task.
I've solved this problem two ways:
The first (hacky, but valid) is to add a new javascript file to the application that is only included in the test environment. This file stubs out the JS classes which make external web requests. Aside from the pure hackatude of this approach, it requires that every time a new request is added or changed you must change the stubs as well.
The second approach is to route all external requests through my own server, effectively proxying all external requests through my server. This has the huge disadvantage that you have to have an action for everything you want to consume (you could genericize it, with some work). It also suffers from the fact that it could as much as double the time for the request to complete. However, since the requests are now being made by Ruby you can use VCR in all it's glory.
In my situations, approach #2 has been much more to my advantage thanks to the fact that I need ruby to manipulate the data so that I can keep my javascript source-agnostic. I was, however, using approach #1 for quite a while successfully.
I've written a small ruby library (puffing-billy) for rspec+capybara that does exactly this -- it injects a proxy in between your browser and the outside world and allows you to fake responses to specific requests.
Example:
describe 'fetching badges from stackoverflow API' do
it 'should show a nice message when you have no badges' do
# stub some JSONP
proxy.stub('http://api.stackoverflow.com/1.1/users/1/badges',
:jsonp => { :badges => [] })
visit '/my_badges'
page.should have_content("You don't have any badges :(")
end
end

Ruby: How to screen-scrape the result of an Ajax request

I have written a ruby script to screen scrape something using the 'open-uri' and 'hpricot' gems - everything works great so far.
But now I have to screen scrape something which is returned after a form is submitted via a javascript function (called by an 'onchange' event handler from a drop-down menu):
function submit_form() {
document.list.action="/some/sort/of/path";
document.list.submit();
}
AFAIK, open-uri lets you submit only GET requests. And if I'm not mistaken, a POST request would be needed here.
So my question is: what do I need to install and to 'require' and how would the ruby code then look like (to make that POST request) - sorry, I'm still pretty much of a n00b...
Thank you very much for your help!
Tom
I think you definitely should use Mechanize. It provides a nifty interface to interact with remote pages, forms on them, and so forth (see this example).
The Ruby standard library has the http class, which naturally supports the POST operation.
Net::HTTP.post_form(URI.parse('http://www.example.com/some/sort/of/path')
If you find the API there less than optimal, then take a look at the httparty gem
Finally, while hpricot is a great gem, it isn't actively developed any longer. You should consider moving to nokogiri which practically replaces hpricot and improves upon it.

Easy Ruby http/curl API to program with

I've been using Ruby for quite some time now, however unlike PHP, as far as I know there is not a standard http/Curl (fetching, processing forms) like library that is easy and powerful like PHP's libCuRL binding.
While Net::HTTP is part of the Ruby standard library, I always find that API hard to remember and program with.
Can anyone give suggestions on which http/curl library I should use over Net::HTTP?
Take a look at HTTParty or REST Client.
I would recommend using the Typhoeus gem. It's got a pretty clean API and allows you to make concurrent requests.
I'll second Ryan's recommendation for Typhoeus, and recommend HTTPClient also. Both are very full featured and handle parallel
requests easily.
For simple requests it's hard to beat Open-URI for its simplicity:
require 'open-uri'
html = open('http://www.example.com').read
If you're parsing a page it works great with Nokogiri:
require 'open-uri'
require 'nokogiri'
doc = Nokogiri::HTML(open('http://www.example.com'))
I wrote a wrapper for the Net:HTTP lib recently, its very very simplistic. I wanted something with a simple API that was easy to use and remember, it's been working well for me:
https://github.com/ctcherry/plain_http

Resources