Rails - Concurrency issue with puma workers - ruby

I have a Puma server configured to use two workers, each with 16 threads. And having config.threadsafe! disabled to allow threading using puma.
Now I have a code, which I doubt not using threadsafety even though I have used Mutex as a constant in there. I want this code to be executed by only one puma thread at a time to avoid concurrency issues, and uses Mutex for it.
Now, My question is,
Does Mutex works to inject threadsafety while using puma threads, on multiple workers? As I understand, worker is a separate process and so Mutex will not work.
If Mutex doesn't work as per above, then what could be the solution to enable threadsafety on perticular code?
Code example
class MyService
...
MUTEX = Mutex.new
...
def initialize
...
end
def doTask
MUTEX.synchronize do
...
end
end
end

The MUTEX thing didn't worked for me, so I need to find another approach. Please see the solution below.
The problem is, Diff. puma threads are making requests to external remote API at the same time and sometimes the remote API takes time to respond.
I wanted to restrict the number of total API requests, but it was not working because of above issue.
To resolve this,
I have created a DB table where I will create a new entry as in-pogress , when the request is sent to external API.
Once that API responds back, I will update the entry as processed
I am checking total requests having in-progress before making any new requests to the external API.
This way, I am able to restrict the total number of requests from my system to external API.

Related

Concurrent HTTP requests from within a single Sidekiq worker?

I'm trying to interact with Google's Calendar API. My tests so far show response times of 5-10 seconds to insert a single event, and I may need to export thousands of events at once [don't ask]. This seems likely to spam the heck out of my queues for unreasonable amounts of time. (95% of current jobs in this app finish in <300ms, so this will make it harder to allocate resources appropriately.)
I'm currently using Faraday in this app to call other, faster Google APIs. The Faraday wiki suggests using Typhoeus for parallel HTTP requests; however, using Typhoeus with Sidekiq was deemed "a bad idea" as of 2014.
Is Typhoeus still a bad idea? If so, is it reasonable to spawn N threads in a Sidekiq worker, make an HTTP request within each thread, and then wait for all threads to rejoin? Is there some other way to accomplish this extremely I/O-bound task without throwing more workers at the problem? Should I ask my manager to increase our Sidekiq Enterprise spend? ;) Or should I just throw these jobs in a low-priority queue and tell our users with ridiculous habits that they'll just have to wait?
It's reasonable to use threads within Sidekiq job threads. It's not reasonable to build your own threading infrastructure. You can use a reusable thread pool with the concurrent-ruby or parallel gems, you can use an http client which is thread-safe and allows concurrent requests, etc. HTTP.rb is a good one from Tony Arcieri but plain old net/http will work too:
https://github.com/httprb/http/wiki/Thread-Safety
Just remember that there's a few complexities: the job might be retried, how do you handle errors that the HTTP client raises? If you don't split these requests 1-to-1 with jobs, you might need to track each or idempotency becomes an issue.
And you are always welcome to increase your Sidekiq Enterprise thread count. :-D

What is the best way to use Redis in a Multi-threaded Rails environment? (Puma / Sidekiq)

I'm using Redis in my application, both for Sidekiq queues, and for model caching.
What is the best way to have a Redis connection available to my models, considering that the models that will be hitting Redis will be called both from my Web application (ran via Puma), and from background jobs inside Sidekiq?
I'm currently doing this in my initializers:
Redis.current = Redis.new(host: 'localhost', port: 6379)
And then simply use Redis.current.get / Redis.current.set (and similar) throughout the code...
This should be thread-safe, as far as I understand, since the Redis Client only runs one command at a time, using a Monitor.
Now, Sidekiq has its own connection pool to Redis, and recommends doing
Sidekiq.redis do |conn|
conn.get
conn.set
end
As I understand it, this would be better than the approach of just using Redis.current because you don't have multiple workers on multiple threads waiting on each other on a single connection when they hit Redis.
However, how can I make this connection that I get from Sidekiq.redis available to my models? (without having to pass it around as a parameter in every method call)
I can't set Redis.current inside that block, since it's global, and I'm back to everyone using the same connection (plus switching between them randomly, which might even be non-thread-safe)
Should I store the connection that I get from Sidekiq.Redis into a Thread-local variable, and use that thread-local variable everywhere?
In that case, what do I do in the "Puma" context? How do I set the thread-local variable?
Any thoughts on this are greatly appreciated.
Thank you!
You use a separate global connection pool for your application code. Put something like this in your redis.rb initializer:
require 'connection_pool'
REDIS = ConnectionPool.new(size: 10) { Redis.new }
Now in your application code anywhere, you can do this:
REDIS.with do |conn|
# some redis operations
end
You'll have up to 10 connections to share amongst your puma/sidekiq workers. This will lead to better performance since, as you correctly note, you won't have all the threads fighting over a single Redis connection.
All of this is documented here: https://github.com/mperham/sidekiq/wiki/Advanced-Options#connection-pooling

How can I efficiently poll a lot of servers?

I am looking for a good way to poll a lot of servers for their status through TCP. I am currently using synchronous code and the Minecraft Query Protocol, but whenever a server is offline the rest of the queue gets hold up.
Another problem I am experiencing with my current code is that some servers tend to block my server I use for polling in their firewall, and thus their servers appear offline on my serverlist.
I am using a Ruby rake task with an infinite loop in which every Minecraft server in my MongoDB database gets checked and updated every +- 10 minutes (I try to set this interval by letting the loop sleep (600/ s.count.to_i).ceil seconds.
Is there any way I can do this task efficiently (and prevent servers from blacklisting my IP in their firewall), preferably with Async code in Ruby?
You need to use non-blocking sockets to check - multithreading. The best thing to do is spawn several threads at once to check several servers at once - that way your main thread won't get held up.
This question contains a lot of information about multithreading in Ruby - you should be able to spawn multiple concurrent threads at once, or at least use non-blocking sockets.
Another point given by #Lie Ryan, you can use IO.Select to poll a array of servers, all at once. It will return an array of "online" servers when it's done - this could be more elegant than spawning multiple threads.

What's the best way to fetch a POP3 server for new mails every 15 minutes?

I'm developing an app that needs to fetch a POP3 account every 5-15 minutes to check for new email and process it. I have written all the code except for the part where it automatically runs every 5-15 minutes.
I'm using Sinatra, DataMapper and hosting on Heroku which means cron jobs are out of the question, because Heroku only provides hourly cron jobs at best.
I have looked into Delayed::Job which doesn't natively support Sinatra nor DataMapper but there are workarounds for both. Since my Ruby knowledge is limited I couldn't find a way to merge these two forks into one working Delayed::Job for Sinatra/DataMapper solution.
Initially I used Mailman to check for emails which has built-in polling and runs continuously, but since it's not Rack-based it doesn't run on Heroku.
Any pointers on where to go next? Before you say: a different webhost, I should add I really prefer to stick with Heroku because of its ease of use (except of course, for the above issue).
Heroku supports CloudMailin
A simple trick is to write your code contained in a loop, then sleep at the bottom of it for however long you want:
Untested sample code...
loop do
do_something_way_cool()
sleep 5 * 60 # it's in minutes
end
If it has to be contained in the main body of the app then use a Thread to wrap it so the thread does the work. You'll need to figure out your shared data structures to transfer the data out of the loop. Queue is your friend there.

Is it a bad idea to create worker threads in a server process?

My server process is basically an API that responds to REST requests.
Some of these requests are for starting long running tasks.
Is it a bad idea to do something like this?
get "/crawl_the_web" do
Thread.new do
Crawler.new # this will take many many days to complete
end
end
get "/status" do
"going well" # this can be run while there are active Crawler threads
end
The server won't be handling more than 1000 requests a day.
Not the best idea....
Use a background job runner to run jobs.
POST /crawl_the_web should simply add a job to the job queue. The background job runner will periodically check for new jobs on the queue and execute them in order.
You can use, for example, delayed_job for this, setting up a single separate process to poll for and run the jobs. If you are on Heroku, you can use the delayed_job feature to run the jobs in a separate background worker/dyno.
If you do this, how are you planning to stop/restart your sinatra app? When you finally deploy your app, your application is probably going to be served by unicorn, passenger/mod_rails, etc. Unicorn will manage the lifecycle of its child processes and it would have no knowledge of these long-running threads that you might have launched and that's a problem.
As someone suggested above, use delayed_job, resque or any other queue-based system to run background jobs. You get persistence of the jobs, you get horizontal scalability (just launch more workers on more nodes), etc.
Starting threads during request processing is a bad idea.
Besides that you cannot control your worker threads (start/stop them in a controlled way), you'll quickly get into troubles if you start a thread inside request processing. Think about what happens - the request ends and the process gets prepared to serve the next request, while your worker thread still runs and accesses process-global resources like the database connection, open files, same class variables and global variables and so on. Sooner or later, your worker thread (or any library used from it) will affect the main thread somehow and break other requests and it will be almost impossible to debug.
You're really better off using separate worker processes. delayed_job for example is a really small dependency and easy to use.

Resources