Ruby threading/forking with API (Sinatra) - ruby

I am using Sinatra gem for my API. What I want to do is when request is received process it, return the response and start new long running task.
I am newbie to Ruby, I have read about Threading but not sure what is the best way to accomplish my task.
Here my sinatra endpoint
post '/items' do
# Processing data
# Return response (body ...)
# Start long running task
end
I would be grateful for any advice or example.

I believe that better way to do it - is to use background jobs. While your worker executes some long-running tasks, it is unavailable for new requests. With background jobs - they do the work, while your web-worker can work with new request.
You can have a look at most popular backgroung jobs gems for ruby as a starting point: resque, delayed_jobs, sidekiq
UPD: Implementation depends on chosen gem, but general scheme will be like this:
# Controller
post '/items' do
# Processing data
MyAwesomeJob.enqueue # here you put your job into queue
head :ok # or whatever
end
In MyAwesomejob you implement your long-runnning task
Next, about Mongoid and background jobs. You should never use complex objects as job arguments. I don't know what kind of task you are implementing, but there is general answer - use simple objects.
For example, instead of using your User as argument, use user_id and then find it inside your job. If you will do it like that, you can use any DB without problems.

Agree with unkmas.
There are two ways to do this.
Threads or a background job gem like sidekiq.
Threads are perfectly fine if the processing times aren't that high and if you don't want to write code for the worker. But there is a strong possibility that you might run up too many threads if you don't use a threadpool or if you're expecting bursty http traffic.
The best way to do it is by using sidekiq or something similar. You could even have a job queue like beanstalkd in between and en-queue the job to it and return the response. You can have a worker reading from the queue and processing it later on.

Related

How can I manage multiple threads in Ruby on Rails (equivalent of a Java thread factory)?

I'm using Ruby on Rails 5 although I'm fairly a novice to Ruby/Rails. I have read about creating threads using
t = Thread.new {
sleep(rand(0)/10.0)
Thread.current["mycount"] = count
count += 1
}
However, I'm wondering if there is a standard way of managing a bunch of application-created threads in Ruby/Rails. I'm familiar with Java, which has a thread factory. That allows a certain number of threads to concurrently run while others must wait in a queue. I'm wondering how I would do something similar in Ruby/Rails.
Note that I'm not talking about the types of threads that are generated automatically when a web page is requested. I'm talking about threads that I (the application owner) creates.
I think https://github.com/ruby-concurrency/concurrent-ruby is the most used library with utils for concurrency in ruby.
It has a lot of useful things including Thread Pools (http://ruby-concurrency.github.io/concurrent-ruby/file.thread_pools.html) which I think is what you are looking for.
Keep in mind that MRI Ruby has GIL, so you only have parallelism if your threads are waiting for IO. For heavy computation you might want to use jRuby or look elsewhere :)
Especially if you're using ActiveRecord inside your threads, you need to be careful about concurrency issues with for example database connections leaking.
Usually if you want to launch a new thread, you want to do something asynchronously in the background without having the user's request hang on an expensive action. For this there are some great libraries like Sucker Punch and Sidekiq. I'd recommend using one of these instead of creating and managing threads manually.
Hope this helps

Is it safe to call the Sidekiq API from inside perform?

Nothing seems to prevent a perform method to use the Sidekiq API. It should be safe in read-only mode.
What if it calls a "write" methods ? Especially when this method acts on the current job itself.
We would like to reschedule a job without creating a new job because we need to track the job completion with the sidekiq-status gem from another worker.
Using MyWorker.perform_in or MyWorker.perform_at to reschedule the job from inside the worker creates a new job, making it difficult to track the total completion. We're thinking of using Sidekiq::ScheduledSet.new.find and the reschedule method but it seems awkward and potentially dangerous to reschedule a job that is about to complete.
Does Sidekiq and its API support this use case ?
You might be able to hack something together but it'll be really slow if you try to modify the Sets and Lists in Redis directly. They aren't designed to be used that way.
The official Sidekiq solution to this problem is a Batch.
https://github.com/mperham/sidekiq/wiki/Batches#status
You create a one-job batch. If the job needs to be rescheduled, it adds a new job to the Batch to be executed later. Your other worker just checks the status of the overall Batch and if it is 100% complete.

Using timeout in a sidekiq or activejob perform method

I'm planning to move from Heroku Scheduler to a custom clock process using the clockwork gem. Heroku Scheduler will kill a task if it didn't complete before the next scheduled one of the same type.
How do I achieve this in Sidekiq?
Given that Timeout is not thread safe. Is it a bad idea to do this in a Sidekiq worker?
class RunsTooLongWorker
include Sidekiq::Worker
sidekiq_options :retry => false
def perform(*args)
Timeout::timeout(2.hours) do
# do possibly long running task
end
end
end
If not what's the alternative? Let's say I want to run a job every 10 minutes but I don't want to have the same jobs running at the same time. How should I deal with that?
To answer this part of your question:
Given that Timeout is not thread safe. Is it a bad idea to do this in a Sidekiq worker?
Here is a great blog post written by Mike Perham about
why you shouldn't use Ruby's timeout module in sidekiq jobs
As for an alternative...It sounds like what you need most is to ensure that your jobs don't trip over each other. For this, I can think of two approaches:
Poor Man's approach: create an 'enqueued' attribute on the model of the object you're working on. When you begin whatever processing you're doing, mark that item as enqueued, and when you finish, mark it as 'not enqueued'. Make your every-ten-minute job scope to those not enqueued. Alternatively, create a new table with your job's name and its a status field, and query it for availability before re-executing your processing.
If you have something more complex going on, this sounds like a great case for Sidekiq Unique Jobs gem.. I think the 'while executing' approach is the one you want.
I guess the solution for your task will be to use whenever gem as it is suggested at Sidekiq wiki , and this is the wiki for whenever after installation it will create config/schedule.rbfile where you can define schedule of your jobs for example
every 3.hours do
runner "MyModel.some_process"
rake "my:rake:task"
command "/usr/bin/my_great_command"
end
it has three build in types as you can see from the snippet runner, rake and command but you can also define you owns, and it is explained in wiki as well.

Ruby on Rails, Resque

I have a resque job class that is responsible for producing a report on user activity. The class queries the database and then performs numerous calculations/data parsing to send out an email to certain people. My question is, should resque jobs like this, that have numerous method (200 lines or so of code), be filled with all class methods and respond to the single ResqueClass.perform method? Or, should I be instantiating a new instance of this resque class to represent the single report that is being produced? If both methods properly calculate the data and email it, is there a convention or best practice on how it should be handled for background jobs?
Thank You
Both strategies are valid. I generally approach this from the perspective of concurrency. While your job is running, the resque worker servicing your job is busy, so if you have N workers and N of these jobs running, you're going to have to wait until one is done before anything else in the queue gets processed.
Maybe that's ok - if you just have one report at a time then you in effect will dedicate one worker to running the report, your others can do other things. But if you have a pile of these and it takes a while, you might impact other jobs in your queue.
The downside is that if your report dies, you may need logic to pick up where you left off. If you instantiate the report once per user, you'd simply need to retry the failed jobs - no "where was I" logic is required.

How to run multiple threads at the same time in ruby while working with a file?

I've been messing around with Ruby and threading a little bit today. I have a list of proxies that I want to check. Assuming a timeout of 10 seconds going through a very large list of proxies will take many hours if I write something that goes like:
proxies.each do |proxy|
check_proxy(proxy)
end
My first problem with trying to figure out threads is how to START multiple at the same exact time. I found a neat little snippet of code online:
for page in pages
threads << Thread.new(page) { |myPage|
puts "Fetching: #{myPage}\n"
doc = Hpricot(open(myPage.to_s)).to_s
puts "Got #{myPage}: #{doc.size}"
}
end
Seems to work nicely as far as starting them all at the same time. So now I can... start checking all 7 thousand records at the same time?
How do I go to a file, take out a line for each thread, run a batch of like 20 and repeat the process?
Can I run a while loop that in turn starts 20 threads at the same (which remove lines from a file) and keeps going until the file is blank?
I'm a little weak on the logic of what I'm supposed to do.
Thanks guys!
PS.
Another thought: Will there be file access issues if 20 workers are constantly messing with it randomly? What would be a good way around that if this is so?
The keyword you are after is threadpool. You can either try to find one for Ruby (I am sure there's couple at least on Github), or roll your own.
Here's a simple implementation here on SO.
Re: the file access, IMO you shouldn't let workers alter the file directly, but do it in your main thread. You don't want to allow simultaneous edits there.
Try to use gem DelayJob:
https://github.com/tobi/delayed_job
You don't need to generate that many Threads in order to do this work. In fact generating a lot of Threads can decrease the overall performance of your application. If you handle checking each proxy asynchronously, without blocking, you can get by with far fewer threads.
You'd create a file manager thread to process the file. Each line gets added as a request to an array(request queue). On the other end of the request queue you can use eventmachine to send the requests without blocking. eventmachine would also be used to receive the responses and handle the timeout. The response can then be placed on another array(response queue) which your file manager thread polls. The file manager thread pulls the responses from the response queue and resolves if the proxy exists or not.
This gets you down to just creating two threads. One issue that you will have is limiting the number of requests that have been sent since this model will be able to send out all of the requests in less than a second and flood the nearest router. In my experience you should be able to have around 500 outstanding requests at any one time.
There is more than one way to solve this problem asynchronously but hopefully the above is enough to help get you started with non-blocking I/O.

Resources