running parallelHttp requests using typhoeus with Hydra in ruby - ruby

I was going through http://typhoeus.github.com/articles/getting_started.html#making_parallel_requests
and I couldn't actually understand how does typhoeus with Hydra achieve in making parallel HTTP requests possible. Is it similar to EventMachine::Iterator and EvenMachine::HTTPRequest handle concurrent requests? I am planning to go through its source code, but if anyone already knows what is going on at the back end please enlighten me. It will help me a lot to understand Typhoeus better.
Thanks!

Typhoeus is a libcurl wrapper and doesn't do parallel requests itself. But is provides an interface to libcurls multi: http://curl.haxx.se/libcurl/c/libcurl-multi.html which takes care of doing parallel requests. That makes it different from Eventmachine, because libcurl does the heavy lifting so you don't have to worry about your ruby code.
To be even more precise, Typhoeus(since 0.5.0.alpha) uses Ethon: https://github.com/typhoeus/ethon instead dealing with libcurl on its own. If you want to see how Ethon works with libcurls multi, this is a good starting point: https://github.com/typhoeus/ethon/blob/master/lib/ethon/multi.rb.
In case you want to know whats really going on, you should look into libcurl itself.

Related

Issues with EventMachine (and looking into Sinatra Async)

I've been trying to find a good way of dealing with asynchronous requests and organizing jobs that need to be repeated, and eventmachine seemed a good way to go, but I found some posts trying to discourage users from eventmachine (for example https://github.com/kyledrake/sinatra-synchrony). I was wondering what the issues they are referring to are? (and if someone would be nice enough, what the alternatives are?)
Considering you're basically searching for a job queue, take a look at Background Jobs at Ruby Toolbox and you'll find a plethora of good options. Manageability vs Speed goes something like this,
Delayed Job
Sidekiq/Resque
Beanstalkd
with DJ being slowest and most manageable and beanstalkd being fastest and least manageable. Your best bet is probably sidekiq or resque, they both depend on redis for managing their queue.
I'd discourage you to use EventMachine because:
It's hard to reason about the reactor pattern.
Fibers detangle reactor pattern's callback pyramid of doom into synchronous looking code but fiber support in third party apps tend to bite you.
You're limited to a very limited eco system when it comes to net-related code.
It's hard not to block the reactor and it's often not easy to catch it when you do.
There are finished solutions for background processing, you don't need to code your own.
It's not really maintained any more, just take a look at last commits and issue list on github.
There's celluloid and celluloid-io and dcell.
Actually, the Sinatra Synchrony people sum it up good:
This gem should not be considered for a new application. It is better
to use threads with Ruby, rather than EventMachine. It also tends to
break when new releases of ruby come out, and EM itself is not
maintained very well and has some pretty fundamental problems.
I will not be maintaining this gem anymore. If anyone is interested in
maintaining it, feel free to inquire, but I recommend not using
EventMachine or sinatra-synchrony anymore.
Use EM if it fits your workflow. Callbacks can be fine to work with as long as you don't get too crazy. We built a lot of software on top of EM at my last job.
There is pretty good support for third party protocols, just take a look at the protocol implementations page.
As to blocking the reactor, you just need to make sure you don't do work on the main thread, and if you do, make sure it's work you do fast. There are some things you can do to determine if this is working. The simplest is just to add a latency check into your code. It's as simple as adding a periodic timer for every x seconds and logging a message (in development). Printing out the time between the calls will tell you how lagged the reactor has become. The greater this time is then your x value the more work you're doing on the main thread.
So, I'd say, try it for yourself. Try celluloid, try straight up threads, try EM with EM-Synchrony and fibers.
It really comes down to personal preference.

Best ruby binding/gem for curl/libcurl

I want to use the curl tool through ruby. So far I have invoked curl through the command line using curl and then parsing the data dumped from a file. However, I would like to use it from within my application. That would give me better control over the handling etc.
There are few gems out there http://curb.rubyforge.org/ and http://curl-multi.rubyforge.org/ However it's not clear which one is the best to use. I have the following criteria for decision
Stability and reliability of the library
Comprehensive support of underlying curl features. (I would be needing data posting, forging HTTP headers, redirects and multi-thread requests heavily.)
It would be great to get some feedback.
Thanks for your help.
-Pulkit
I highly recommend Typhoeus. It relies on lib-curl, and allows for all sorts of parallel and async possibilities. It offers ssl, stubbing, follows redirects, allows custom headers, true parallel requests for blazing speed, and generally has yet to let me down. Also, it is well maintained--at the moment, the last commit was 2 days ago!

Are there any web frameworks on top of EventMachine?

Are there any web frameworks on top of EventMachine? So far, I've found Fastr and Cramp. Both seem to be outdated.
Moreover, Googling how to setup Rails + EventMachine, returns a limited amount of results.
NodeJS is really nothing new. Evented I/O has been around for a very long time (Twisted for Python and EventMachine for Ruby). However, what attracts me to NodeJS, is the implementations that are built on top of it.
For example. NodeJS has TowerJS. Among plenty others. Perhaps, this is one of the many contributing reasons to its trending factor.
What I like most about TowerJS, is its Rails-like structure. Is there anything like it for EventMachine?
Goliath is an open source version of the non-blocking (asynchronous) Ruby web server framework.
You may find async sinatra interesting
Besides EventMachine and the others mentioned here, there's vert.x. I'm not sure how much of a "web framework" it is, but its site shows examples for a simple app like one might write in Sinatra.

Nokogiri vs Goliath...or, can they get along?

I have a project that needs to parse literally hundreds of thousands of HTML and XML documents.
I thought this would be a perfect opportunity to learn Ruby fibers and the new Goliath framework.
But obviously, Goliath falls flat if you use blocking libraries. But the problem is, I don't know how to tell what is "thread safe" (if that's even the correct term for Goliath).
So my question is, is Nokogiri going to cause any issues with Goliath or multi-threading/fibers in general?
If so, is there something safer to use than Nokogiri?
Thanks
Goliath is a web framework, so I'm assuming you're planning to "ingest" these documents via HTTP? Each request gets mapped into a ruby fiber, but effectively, the server runs in a single reactor thread.
So, to answer your question: Nokogiri is thread safe to the best of my knowledge, but that shouldn't even really matter here. The thing you will have to look out for: while the document is being parsed, the CPU is pinned, and Goliath wont accept any new requests in the meantime. So, you'll have to implement correct logic to handle your specific case (ex: you could do a stream parse on chunks of data arriving from the socket, or load balance between multiple goliath servers, or both ... :-))

Client side http proxy in ruby

What is a good approach to a client proxy written in ruby that I can use to create a custom filter.
So far I've found
Ruby Proxy using webrick
Mousehole, a scriptable Ruby proxy by _why (UPDATE this was not robust)
A little on the fringe, this guy wants to Use rack as thin proxy with his question. I don't think he got an answer; or even a hint that it was possible.
What is your advice on these suggested approaches or do you have a better approach.
Thanks!
I can’t speak on personal experience as I’ve not done this myself, but I have heard of mouseHole before and it seems to be a good package. Why not try writing a simple script for it and see how you find it?
There are also some sample scripts in that repository that you could check out.

Resources