concurrent access to api with limits - ruby

This question not about ruby only.
I have many workers running to creates many connections to external API. This API has a limit.
At now I use sidekiq and redis for limiting access.
i.e. at every access to API that rated I run worker.
Then worker is started it check when was last time to access to API, if is earlier than that API allows it, the worker is rescheduled, else it touch time in redis and run request.
ex:
def run_or_schedule
limiter = RedisLimiter.new(account_token)
if limiter.can_i_run?
limiter.like
run
else
ApiWorker.perform_at(limiter.next_like, *params)
end
end
Problem is that I create many request and they many times rescheduled.
Maybe can someone can recommend a better solution?
Maybe any design patterns exist for this?

One alternative to the polling approach you are using would be to have a supervisor.
So instead of having each worker handle the question itself, you have another object/worker/process which decides when is time for the next worker to run.
If the API imposes a time-limit between requests you could have this supervisor execute as often as the time limit allows.
If the limit is more complex (e.g. total number of requests per X interval) you could have the supervisor running constantly and upon hitting a limit block (sleep) for that amount of interval. Upon resuming it could continue with the next worker from the queue.
One apparent advantage of this approach is that you could skip the overhead related to each individual worker being instantiated, checking itself whether it should run, and being rescheduled if it shouldn't.

You can use a queue, and a dedicated thread that sends events (when there any waiting) at the maximum rate allowable.
Say you can send one API call every second, you can do the following:
class APIProxy
def like(data)
#like_queue << data
end
def run
Thread.new do
#like_queue = Queue.new
loop do
actual_send_like #like_queue.pop
sleep 1
end
end
end
private
def actual_send_like(data)
# use the API you need
end
end

Related

Do a set number of requests per hour in Ruby?

We've use an API that impose a rate-limit per hour.
I wonder what'll be the best way to do a set number of requests per hour to the API for our own scripts. I.e.: Making 10 request per hour to not exceed our allowance and avoids overcharges.
I was thinking just using sleep(60*6) in my loop but API calls can take minutes, so it might be doing a lot less requests than allowed.
What will be the best practice to spread out our requests?
Edit:
I ended doing something like this, what do you guys think?
while(queue.size > 0) do
Thread.new {
element = queue.pop
# do the rate limited API calls and things
}
sleep(60*6)
end
consider rack attack middleware
to sum it up - you keep somewhere (in memory, in database like redis) number of requestes executed by specific client (know by IP, identity or in any other form) within given time.

Ruby on Rails, Resque

I have a resque job class that is responsible for producing a report on user activity. The class queries the database and then performs numerous calculations/data parsing to send out an email to certain people. My question is, should resque jobs like this, that have numerous method (200 lines or so of code), be filled with all class methods and respond to the single ResqueClass.perform method? Or, should I be instantiating a new instance of this resque class to represent the single report that is being produced? If both methods properly calculate the data and email it, is there a convention or best practice on how it should be handled for background jobs?
Thank You
Both strategies are valid. I generally approach this from the perspective of concurrency. While your job is running, the resque worker servicing your job is busy, so if you have N workers and N of these jobs running, you're going to have to wait until one is done before anything else in the queue gets processed.
Maybe that's ok - if you just have one report at a time then you in effect will dedicate one worker to running the report, your others can do other things. But if you have a pile of these and it takes a while, you might impact other jobs in your queue.
The downside is that if your report dies, you may need logic to pick up where you left off. If you instantiate the report once per user, you'd simply need to retry the failed jobs - no "where was I" logic is required.

How can I make resque worker process other jobs while current job is sleeping?

Each task I have work in a short bursts, then sleep for about an hour and then work again and so on until the job is done. Some jobs may take about 10 hours to complete and there is nothing I can do about it.
What bothers me is that while job is sleeping resque worker would be busy, so if I have 4 workers and 5 jobs the last job would have to wait 10 hours until it can be processed, which is grossly unoptimal since it can work while any other worker is sleeping. Is there any way to make resque worker to process other job while current job is sleeping?
Currently I have a worker similar to this:
class ImportSongs
def self.perform(api_token, songs)
api = API.new api_token
songs.each_with_index do |song, i|
# make current worker proceed with another job while it's sleeping
sleep 60*60 if i != 0 && i % 100 == 0
api.import_song song
end
end
end
It looks like the problem you're trying to solve is API rate limiting with batch processing of the import process.
You should have one job that runs as soon as it's enqueued to enumerate all the songs to be imported. You can then break those down into groups of 100 (or whatever size you have to limit it to) and schedule a deferred job using resque-scheduler in one hour intervals.
However, if you have a hard API rate limit and you execute several of these distributed imports concurrently, you may not be able to control how much API traffic is going at once. If you have that strict of a rate limit, you may want to build a specialized process as a single point of control to enforce the rate limiting with it's own work queue.
With resque-scheduler, you'll be able to repeat discrete jobs at scheduled or delayed times as an alternative to a single, long running job that loops with sleep statements.

How to run multiple threads at the same time in ruby while working with a file?

I've been messing around with Ruby and threading a little bit today. I have a list of proxies that I want to check. Assuming a timeout of 10 seconds going through a very large list of proxies will take many hours if I write something that goes like:
proxies.each do |proxy|
check_proxy(proxy)
end
My first problem with trying to figure out threads is how to START multiple at the same exact time. I found a neat little snippet of code online:
for page in pages
threads << Thread.new(page) { |myPage|
puts "Fetching: #{myPage}\n"
doc = Hpricot(open(myPage.to_s)).to_s
puts "Got #{myPage}: #{doc.size}"
}
end
Seems to work nicely as far as starting them all at the same time. So now I can... start checking all 7 thousand records at the same time?
How do I go to a file, take out a line for each thread, run a batch of like 20 and repeat the process?
Can I run a while loop that in turn starts 20 threads at the same (which remove lines from a file) and keeps going until the file is blank?
I'm a little weak on the logic of what I'm supposed to do.
Thanks guys!
PS.
Another thought: Will there be file access issues if 20 workers are constantly messing with it randomly? What would be a good way around that if this is so?
The keyword you are after is threadpool. You can either try to find one for Ruby (I am sure there's couple at least on Github), or roll your own.
Here's a simple implementation here on SO.
Re: the file access, IMO you shouldn't let workers alter the file directly, but do it in your main thread. You don't want to allow simultaneous edits there.
Try to use gem DelayJob:
https://github.com/tobi/delayed_job
You don't need to generate that many Threads in order to do this work. In fact generating a lot of Threads can decrease the overall performance of your application. If you handle checking each proxy asynchronously, without blocking, you can get by with far fewer threads.
You'd create a file manager thread to process the file. Each line gets added as a request to an array(request queue). On the other end of the request queue you can use eventmachine to send the requests without blocking. eventmachine would also be used to receive the responses and handle the timeout. The response can then be placed on another array(response queue) which your file manager thread polls. The file manager thread pulls the responses from the response queue and resolves if the proxy exists or not.
This gets you down to just creating two threads. One issue that you will have is limiting the number of requests that have been sent since this model will be able to send out all of the requests in less than a second and flood the nearest router. In my experience you should be able to have around 500 outstanding requests at any one time.
There is more than one way to solve this problem asynchronously but hopefully the above is enough to help get you started with non-blocking I/O.

Tell Merb not to timeout

after posting a question related to nginx, I'm a bit further with my investigations: The problem is, that the merb framework timeouts after about 30 seconds. If i tell the underlying nginx-server not to timeout, merb does, and I can't find a way to tell it not to; I need to do requests that take up to some minutes.
Any hints? Thanks a lot.
-- UPDATE --
Seems that mongrel behind merb is causing the error. Is there any way to change the mongrel-timeout running with merb?
Perhaps a different approach would yield better results - rather than workaround the timeouts, how about maximizing throughput by deferring the execution of the task?
Some approaches for long-running tasks are to either use run_later or exec a separate worker process to complete the task ...
def run_in_background(r)
Thread.new do
response = IO.popen(r) do |f|
f.read
end
end
end
In both cases you should return 202 (Accepted) as the status code and a URL where the calling application can get status updates.
I use this approach to handle requests which cause background batch processes to execute. Each writes it's start-time, progress and completion-time to a database (you could easily use a file). When the URL is invoked, I fetch the progress from the database and provide that back to the calling process.

Resources