I'm trying to interact with Google's Calendar API. My tests so far show response times of 5-10 seconds to insert a single event, and I may need to export thousands of events at once [don't ask]. This seems likely to spam the heck out of my queues for unreasonable amounts of time. (95% of current jobs in this app finish in <300ms, so this will make it harder to allocate resources appropriately.)
I'm currently using Faraday in this app to call other, faster Google APIs. The Faraday wiki suggests using Typhoeus for parallel HTTP requests; however, using Typhoeus with Sidekiq was deemed "a bad idea" as of 2014.
Is Typhoeus still a bad idea? If so, is it reasonable to spawn N threads in a Sidekiq worker, make an HTTP request within each thread, and then wait for all threads to rejoin? Is there some other way to accomplish this extremely I/O-bound task without throwing more workers at the problem? Should I ask my manager to increase our Sidekiq Enterprise spend? ;) Or should I just throw these jobs in a low-priority queue and tell our users with ridiculous habits that they'll just have to wait?
It's reasonable to use threads within Sidekiq job threads. It's not reasonable to build your own threading infrastructure. You can use a reusable thread pool with the concurrent-ruby or parallel gems, you can use an http client which is thread-safe and allows concurrent requests, etc. HTTP.rb is a good one from Tony Arcieri but plain old net/http will work too:
https://github.com/httprb/http/wiki/Thread-Safety
Just remember that there's a few complexities: the job might be retried, how do you handle errors that the HTTP client raises? If you don't split these requests 1-to-1 with jobs, you might need to track each or idempotency becomes an issue.
And you are always welcome to increase your Sidekiq Enterprise thread count. :-D
Related
Rails app which handle and activation of a license using an external service, the external service sometime delays the handling of rails request to over 30s, which will then return an error to front end (I'm running heroku, so max is 30s).
I tried using ActiveJobs and the default rails async adapter (Rails 5), and I can see that is working in Heroku out of the box. I keep reading that I should be using another web process and for example redis, but if the background job should just be performed straight after the request is done and if is just hitting another API outside which may be slower, is it so bad to use the default async?
I can see that this is handle in an in-process thread but I don't see a reason for such small job to be having another web process.
I use the async adapter in production for sending emails. This is a very small job. An email could take up to 3 seconds to send.
The doc said it's a poor fit for production because it will drop pending jobs on restart. If I remember correctly, Heroku restarts dynos once a day.
If your job is pending during the restart, the job will be lost. For my case, a pending email during the restart is pretty slim. So far so good.
But if you have jobs taking 30 seconds, I'll use Resque or DelayedJob.
If for small background job in production, which does not require 100% persistence in case of failure/server restart, whose duration is relatively short and thus separate process would be an overkill, I'd recommend using Sucker Punch.
Sucker Punch gem is designed to handle exactly such case. It prepares execution thread pool for each Job you create, using the concurrent-ruby gem, which is (probably) the most robust concurrency library in Ruby. It also hooks on_exit to finish all the pending tasks, so I guess you can expect this gem to be more reliable than the AsyncJob.
One thing to note is that although Sucker Punch is supported on Active Job, the adapter is not well written. Or, at least, when you use Sucker Punch adapter, it's behavior would be just like that of async adapter. So, I'd recommend using bare Sucker Punch if you wanted something just a little more useful/robust than AsyncJob.
Environment
I'm installing Airbrake on Heroku for a Ruby web app (not Rails).
So Airbrake#notify for Airbrake version 5 for Ruby sends a notification asynchronously.
My worry is that if I don't use Sidekiq worker + Redis, then it might still be possible that calling Airbrake#notify might still slow down the app's response time depending on how it's used (whether in a Rails-like controller or some other part of the app).
Besides overcoming the potential issue mentioned above, the other advantage of using Sidekiq worker + Redis to call Airbrake#notify I can think of is that Redis has a couple of persistence strategies so if the app crashes I can backtrack and look over the backed up error notifications from the Sidekiq queue.
Whereas if I don't use Sidekiq + Redis and the app crashes, then there will be no backed up data....
Questions
Does that mean I don't need to use Sidekiq + Redis (or some other equivalent database)?
Am I understanding the issue correctly? I don't have a very complete understanding of "pooled connections" and asynchronous processing, so this makes understanding what to do here a bit challenging.
This is the class that sends async notices https://github.com/airbrake/airbrake-ruby/blob/master/lib/airbrake-ruby/async_sender.rb
It's using standard ruby threads to send messages, so no background service should be necessary
I'm working on a Ruby script that will be making hundreds of network requests (via open-uri) to various APIs and I'd like to do this in parallel since each request is slow, and blocking.
I have been looking at using Thread or Process to achieve this but I'm not sure which method to use.
With regard to network request, when should i use a Thread over Process, or does it not matter?
Before going into detail, there is already a library solving your problem. Typhoeus is optimized to run a large number of HTTP requests in parallel and is based on the libcurl library.
Like a modern code version of the mythical beast with 100 serpent
heads, Typhoeus runs HTTP requests in parallel while cleanly
encapsulating handling logic.
Threads will be run in the same process as your application. Since Ruby 1.9 native threads are used as the underlying implementation. Resources can be easily shared across threads, as they all can access the mutual state of the application. The problem, however, is that you cannot utilize the multiple cores of your CPU with most Ruby implementations.
Ruby uses the Global Interpreter Lock (GIL). GIL is a locking mechanism to ensure that the mutual state is not corrupted due to parallel modifications from different threads. Other Ruby implementations like JRuby, Rubinius or MacRuby offer an approach without GIL.
Processes run separately from each other. Processes do not share resources, which means every process has its own state. This can be a problem, if you want to share data across your requests. A process also allocates its own stack of memory. You could still share data by using a messaging bus like RabitMQ.
I cannot recommend to use either only threads or only processes. If you want to implement that yourself, you should use both. Fork for every n requests a new processes which then again spawns a number of threads to issue the HTTP requests. Why?
If you fork for every HTTP request another process, this will result in too many processes. Although your operating system might be able to handle this, the overhead is still tremendous. Some HTTP requests might finish very fast, so why bother with an extra process, just run them in another thread.
I have a Sinatra app I plan on hosting on Heroku.
This application, in part, scrapes a lot of information from other pages around the net and stores the information to a database. These scrapping operations are a slow process, so I need them to run in another thread/process separate from my Sinatra app.
My plan is just to have a button for each process that I can click and the scrapping will take place in the background.
I'm unsure what's the best way to do this, complicated by what Heroku will allow.
There's a gem called hirefire specifically for that:
HireFire automatically "hires" and "fires" (aka "scales") Delayed Job
and Resque workers on Heroku. When there are no queue jobs, HireFire
will fire (shut down) all workers. If there are queued jobs, then
it'll hire (spin up) workers. The amount of workers that get hired
depends on the amount of queued jobs (the ratio can be configured by
you). HireFire is great for both high, mid and low traffic
applications. It can save you a lot of money by only hiring workers
when there are pending jobs, and then firing them again once all the
jobs have been processed. It's also capable to dramatically reducing
processing time by automatically hiring more workers when the queue
size increases.
My server process is basically an API that responds to REST requests.
Some of these requests are for starting long running tasks.
Is it a bad idea to do something like this?
get "/crawl_the_web" do
Thread.new do
Crawler.new # this will take many many days to complete
end
end
get "/status" do
"going well" # this can be run while there are active Crawler threads
end
The server won't be handling more than 1000 requests a day.
Not the best idea....
Use a background job runner to run jobs.
POST /crawl_the_web should simply add a job to the job queue. The background job runner will periodically check for new jobs on the queue and execute them in order.
You can use, for example, delayed_job for this, setting up a single separate process to poll for and run the jobs. If you are on Heroku, you can use the delayed_job feature to run the jobs in a separate background worker/dyno.
If you do this, how are you planning to stop/restart your sinatra app? When you finally deploy your app, your application is probably going to be served by unicorn, passenger/mod_rails, etc. Unicorn will manage the lifecycle of its child processes and it would have no knowledge of these long-running threads that you might have launched and that's a problem.
As someone suggested above, use delayed_job, resque or any other queue-based system to run background jobs. You get persistence of the jobs, you get horizontal scalability (just launch more workers on more nodes), etc.
Starting threads during request processing is a bad idea.
Besides that you cannot control your worker threads (start/stop them in a controlled way), you'll quickly get into troubles if you start a thread inside request processing. Think about what happens - the request ends and the process gets prepared to serve the next request, while your worker thread still runs and accesses process-global resources like the database connection, open files, same class variables and global variables and so on. Sooner or later, your worker thread (or any library used from it) will affect the main thread somehow and break other requests and it will be almost impossible to debug.
You're really better off using separate worker processes. delayed_job for example is a really small dependency and easy to use.