Nokogiri vs Goliath...or, can they get along? - ruby

I have a project that needs to parse literally hundreds of thousands of HTML and XML documents.
I thought this would be a perfect opportunity to learn Ruby fibers and the new Goliath framework.
But obviously, Goliath falls flat if you use blocking libraries. But the problem is, I don't know how to tell what is "thread safe" (if that's even the correct term for Goliath).
So my question is, is Nokogiri going to cause any issues with Goliath or multi-threading/fibers in general?
If so, is there something safer to use than Nokogiri?
Thanks

Goliath is a web framework, so I'm assuming you're planning to "ingest" these documents via HTTP? Each request gets mapped into a ruby fiber, but effectively, the server runs in a single reactor thread.
So, to answer your question: Nokogiri is thread safe to the best of my knowledge, but that shouldn't even really matter here. The thing you will have to look out for: while the document is being parsed, the CPU is pinned, and Goliath wont accept any new requests in the meantime. So, you'll have to implement correct logic to handle your specific case (ex: you could do a stream parse on chunks of data arriving from the socket, or load balance between multiple goliath servers, or both ... :-))

Related

Issues with EventMachine (and looking into Sinatra Async)

I've been trying to find a good way of dealing with asynchronous requests and organizing jobs that need to be repeated, and eventmachine seemed a good way to go, but I found some posts trying to discourage users from eventmachine (for example https://github.com/kyledrake/sinatra-synchrony). I was wondering what the issues they are referring to are? (and if someone would be nice enough, what the alternatives are?)
Considering you're basically searching for a job queue, take a look at Background Jobs at Ruby Toolbox and you'll find a plethora of good options. Manageability vs Speed goes something like this,
Delayed Job
Sidekiq/Resque
Beanstalkd
with DJ being slowest and most manageable and beanstalkd being fastest and least manageable. Your best bet is probably sidekiq or resque, they both depend on redis for managing their queue.
I'd discourage you to use EventMachine because:
It's hard to reason about the reactor pattern.
Fibers detangle reactor pattern's callback pyramid of doom into synchronous looking code but fiber support in third party apps tend to bite you.
You're limited to a very limited eco system when it comes to net-related code.
It's hard not to block the reactor and it's often not easy to catch it when you do.
There are finished solutions for background processing, you don't need to code your own.
It's not really maintained any more, just take a look at last commits and issue list on github.
There's celluloid and celluloid-io and dcell.
Actually, the Sinatra Synchrony people sum it up good:
This gem should not be considered for a new application. It is better
to use threads with Ruby, rather than EventMachine. It also tends to
break when new releases of ruby come out, and EM itself is not
maintained very well and has some pretty fundamental problems.
I will not be maintaining this gem anymore. If anyone is interested in
maintaining it, feel free to inquire, but I recommend not using
EventMachine or sinatra-synchrony anymore.
Use EM if it fits your workflow. Callbacks can be fine to work with as long as you don't get too crazy. We built a lot of software on top of EM at my last job.
There is pretty good support for third party protocols, just take a look at the protocol implementations page.
As to blocking the reactor, you just need to make sure you don't do work on the main thread, and if you do, make sure it's work you do fast. There are some things you can do to determine if this is working. The simplest is just to add a latency check into your code. It's as simple as adding a periodic timer for every x seconds and logging a message (in development). Printing out the time between the calls will tell you how lagged the reactor has become. The greater this time is then your x value the more work you're doing on the main thread.
So, I'd say, try it for yourself. Try celluloid, try straight up threads, try EM with EM-Synchrony and fibers.
It really comes down to personal preference.

Why should I avoid using CGI?

I was trying to create my website using CGI and ERB, but when I search on the web, I see people saying I should always avoid using CGI, and always use Rack.
I understand CGI will fork a lot of Ruby processes, but if I use FastCGI, only one persistent process will be created, and it is adopted by PHP websites too. Plus FastCGI interface only create one object for one request and has very good performance, as opposed to Rack which creates 7 objects at once.
Is there any specific reason I should not use CGI? Or it is just false assumption and it is entirely ok to use CGI/FastCGI?
CGI, by which I mean both the interface and the common programming libraries and practices around it, was written in a different time. It has a view of request handlers as distinct processes connected to the webserver via environment variables and standard I/O streams.
This was state-of-the-art in its day, when there were not really "web frameworks" and "embedded server modules" as we think of them today. Thus...
CGI tends to be slow
Again, the CGI model spawns one new process per connection. While spawning processes per se is cheap these days, heavy web app initialization — reading and parsing scores of modules, making database connections, etc. — makes this quite expensive.
CGI tends toward too-low-level (IMHO) design
Again, the CGI model explicitly mentions environment variables and standard input as the interface between request and handler. But ... who cares? That's much lower level than the app designer should generally be thinking about. If you look at libraries and code based on CGI, you'll see that the bulk of it encourages "business logic" right alongside form parsing and HTML generation, which is now widely seen as a dangerous mixing of concerns.
Contrast with something like Rack::Builder, where right away the coder is thinking of mapping a namespace to an action, and what that means for the broader web application. (Suddenly we are free to argue about the semantic web and the virtues of REST and this and that, because we're not thinking about generating radio buttons based off user-supplied input.)
Yes, something like Rack::Builder could be implemented on top of CGI, but, that's the point. It'd have to be a layer of abstraction built on top of CGI.
CGI tends to be sneeringly dismissed
Despite CGI working perfectly well within its limitations, despite it being simple and widely understood, CGI is often dismissed out of hand. You, too, might be dismissed out of hand if CGI is all you know.
Don't use CGI. Please. It's not worth it. Back in the 1990s when nobody knew better it seemed like a good idea, but that was when scripts were infrequent, used for special cases like handling form submissions, not driving entire sites.
FastCGI is an attempt at a "better CGI" but it's still deficient in a large number of ways, especially because you have to manage your FastCGI worker processes.
Rack is a much better system, and it works very well. If you use Rack, you have a wide variety of hosting systems to choose from, even Passenger which is really simple and reliable.
I don't know what mean when you say Rack creates "7 objects at once" unless you mean there are 7 different Rack processes running somehow or you've made a mistake in your implementation.
I can't think of a single instance where CGI would be better than a Rack equivalent.
There exists a lot of confusion about what CGI, Rack etc. really are. As I describe here, Rack is an API, and FastCGI is a protocol. CGI is also a protocol, but in its narrow sense also an implementation, and for what you're speaking of is not at all the same thing as FastCGI. So let's start with the background.
Back in the early 90s, web servers simply read files (HTML, images, whatever) off the disk and sent them to the client. People started to want to do some processing at the time of the request, and the early solution that came out was to run a program that would produce the result sent back to the client, rather than just reading the file. The "protocol" for this was for the web server to be given a URL that it was configured to execute as a program (e.g., /cgi-bin/my-script), where the web server would then set up a set of environment variables with various information about the request and run the program with the body of the request on the standard input. This was referred to as the "Common Gateway Interface."
Given that this forks off a new process for every request, it's clearly inefficient, and you almost certainly don't want to use this style of dynamic request handling on high-volume web sites. (Starting a whole new process is relatively expensive in computational resources.)
One solution to making this more efficient is to, rather than starting a new process, send the request information to an existing process that's already running. This is what FastCGI is all about; it maintains a very similar interface to CGI (you have a set of variables with most of the request information, and a stream of data for the body of the request). But instead of setting actual Unix environment variables and starting a new process with the body on stdin, it sends a request similar to an HTTP request to an FCGI server already running on the machine where it specifies the values of these variables and the request body contents.
If the web server can have the program code embedded in it somehow, this becomes even more efficient because it just runs the code itself. Two classic examples of how you might do this would be:
Have PHP embedded in Apache, so that the "Apache server code" just calls the "PHP server code" that's part of the same process; and
Not run Apache at all, but have the web server be written in Ruby (or Python, or whatever) and load and run more Ruby code that's been custom-written to handle the request.
So where does Rack come in to this? Rack is an API that lets code that handles web requests receive it in a common way, regardless of the web server. So given some Ruby code to process a request that uses the Rack API, the web server might:
Be a Ruby web server that simply makes function calls in its own process to the Rack-compliant code that it loaded;
Be a web server (written in any language) that uses the FastCGI protocol to talk to another process with FastCGI server code that, again, makes function calls to the Rack-compliant code that handles the request; or
Be a server that starts a brand new process that interprets the CGI environment variables and standard input passed to it and then calls the Rack-compliant code.
So whether you're using CGI, FastCGI, another inter-process protocol, or an intra-process protocol, makes no difference; you can do any of those using Rack so long as the server knows about it or is talking to a process that can understand CGI, FastCGI or whatever and call Rack-compliant code based on that request.
So:
For performance scaling, you definitely don't want to be using CGI; you want to be using FastCGI, a similar protocol (such as the Tomcat one), or direct in-process calling of the code.
If you use the Rack API, you don't need to worry at the early stages which protocol you're using between your web server and your program because the whole point of APIs like Rack is that you can change it later.

DRb: how to check if remote object exists?

I've been toying around with DRb to use as my solution to communicate across multiple processes. I'm using the stardard process: one creates a service, registers it to a druby uri, and on the other process a DRbObject is created referencing that URI. So far so good. Let's say I kill the first process. Every subsequent method call on the remote object will culminate in a ECONNRefused exception. Which is only fair. But isn't there a way to see if the DRbObject is indeed registered in the given URI? I think testing it by forcing a ECONNRefused on every instance start to see if it exists is a bit silly.
Of course, other solutions involving resources other than DRb are always welcome, provided they indeed represent a plus.
You should check out ZeroMQ. It is somewhat more complex to set up than DRb but it handles all the presence/reconnection issues mostly transparently.
This may not be what you are looking for, but I have developed an IPC framework on top of DRb that hides all of the DRb stuff from the applications level. This includes client methods to find whatever services have registered with the server across the network. Probably too much overhead for you but maybe worth poking around in it. Anyway, you can check it out on Github.

Are there any web frameworks on top of EventMachine?

Are there any web frameworks on top of EventMachine? So far, I've found Fastr and Cramp. Both seem to be outdated.
Moreover, Googling how to setup Rails + EventMachine, returns a limited amount of results.
NodeJS is really nothing new. Evented I/O has been around for a very long time (Twisted for Python and EventMachine for Ruby). However, what attracts me to NodeJS, is the implementations that are built on top of it.
For example. NodeJS has TowerJS. Among plenty others. Perhaps, this is one of the many contributing reasons to its trending factor.
What I like most about TowerJS, is its Rails-like structure. Is there anything like it for EventMachine?
Goliath is an open source version of the non-blocking (asynchronous) Ruby web server framework.
You may find async sinatra interesting
Besides EventMachine and the others mentioned here, there's vert.x. I'm not sure how much of a "web framework" it is, but its site shows examples for a simple app like one might write in Sinatra.

High load RESTful API in Ruby (sync/async implementation)

I'm struggling with implementing a RESTful API that should return JSON response and should sustain very high load.
The highest load will be generated by 'read' part of the API and very little load will be generated by 'write' part of the API.
My first attempt was to write whole API using nodejs. I almost did it but faced very high duplication of models and logic between javascript and ruby, because the API is a part of a bigger system. I tried moving all logic into backend (mySql), but that idea turned out even more uglier.
My second attempt is to write the API in Ruby ecosystem in order to share models/logic and tests between all parts of the system.
I tried using Cramp and Goliath alone, but all that async stuff really complicated API implementation. I only need to have 2 read urls async because they generate the highest load and by going async all the way I was forced to implement the rest of API in async fashion, which didn't add any value.
My current attempt is to go hybrid: use Thin/Sinatra/Cramp cocktail. I'm instantiating Thin rack handle right in Ruby code and using rack builder I'm splitting API between Sinatra, which is taking sync implementation, and Cramp, which is implementing 2 urls in async way.
Is this is a good way to go? Or having Sinatra and Cramp in one web server (Thin) will get me even in more trouble by some reason?
update:
I'm trying solution with sole Sinatra mixed with rack/fiber_pool and em_mysql2. Seems I'm killing two goals - making API async with sync implementation. But I'm suffering from a bug which I think will be fixed quite soon.
Were will be any gotchas going this way?
I don't think it's a good idea to have sync (sinatra) and async (cramp) apps within the same thin process. If the sync part is relatively simple, I'd suggest implementing that in Cramp. A little biased here as I authored Cramp :)
In case you didn't know, Cramp has out of box support for AR/fiber pool - https://github.com/lifo/cramp/blob/master/examples/fibers/long_ar_query.ru
If you decide to use Cramp, I'm happy to help out with any issues as I've been working a lot on cramp recently and am quite pumped up! Just throw me an email!
I'm curious what async stuff you ran into with Goliath? In the common case there should be no async code visible to the end developer.
Is there something we can do better to make this less visible to the end user?

Resources