How to schedule a job that runs every 2 minutes in Ruby - ruby

I have code generally that does this:
<every 2 minutes>
try
<reap crops>
<sow seeds via some api>
catch Exception => e
<tell neighbor to take care of crops and this must happen>
and say I want to do eventually do this in multiple fields simultaneously every 2 minutes and I’m only in Ruby (not rails), what’s the easiest way to do this? Two approaches I’ve considered at using sidekiq scheduler or using the Thread class. What are advantages or disadvantages to both approaches? Take note that if the api fails, I need to get into the catch clause otherwise a lot of money is lost.
If I wanted to write this as a recurring piece of work that runs every 2 minutes (and this does not need user input), what's the best way to write this in Ruby?

#Jwan622 did you found a solution already?
I would go for sidekiq, as it offers many features like scheduled jobs, different retry options, a web UI etc.
You also need to think about situations like service restarts or deployments. I fear a solution based on threads will not be reliable (out of the box) in these situations.
Disclaimer: I maintain with Sidekiq::Undertaker an open-source plugin for Sidekiq, which allows retrying dead jobs. I'm not involved in the main Sidekiq project and I don't get any affiliate fees.

Related

Is it a reasonable trade-off to use `sleep` in an asynchronous job?

Assume I have a scenario where I am processing a background job in a worker. It simply receives a URL for a file (image, video, pdf, ..) hosted on a remote CDN and the worker does its work as:
Some processing on the file content in-memory
Then calls a 3rd party API to retrieve a signed valid URL for uploading the content to that same 3rd party.
Uploads the content to the 3rd party API – the response contains a unique file ID
Sends a message to a user through the 3rd party API with the unique file ID received earlier
Now, the problem is between step (3) and (4). The constraint here is that the 3rd party API needs few seconds to process the file (step 3) before we actually send a message containing the file ID we just uploaded (step 4).
One more assumption here is that I need to make sure all 4 steps execute in one go, as in, not to have any partial failure opportunities.
Possible approaches
The most naive way to go is by using sleep 5 between step (3) and (4), it might hurt / hard fail since I am not exactly sure how many seconds does the 3rd party API needs for processing, but according to my trials, 5 seconds sleep seemed alright.
I could do an in-process exponential retry for 3 (or X) times for step (3), catch an exception from the 3rd party and attempt to do step (4) when step (3) is successful – this is what I have now, it works alright.
I could perhaps either use a job scheduler or a ruby concurrency library to do step (4) in a delayed fashion. I don't appreciate this path as it feels like it is favouring complexity.
This piece of logic is built in Ruby, though the question might not be very Ruby specific and can be applicable in other languages, I would like to hear what Ruby folks think.
The API docs you linked to say:
Attention! Some time needed by a server to process an uploaded file.
File should be sent to a chat after a short timeout (a couple of
seconds)
I would usually advise against something of this nature, but since your vendor specifically says "timeout", sleep is the best option.
I'd try doing delayed task, as it will allow thread to continue working (so thread pool won't need to create new threads (they are quite expensive from memory side), your thread may continue doing useful job without need of context switch (which is expensive from CPU usage side), ...).
As for purity of solution, asynchronous programming should not involve any blocking tasks (we are actually fighting against blocking using asynchronous programming), so this is one more reason to use delayed task.
If application does not involve achieving highest performance (does Ruby performance oriented language?), so sleep may really be easiest, but not most optimal solution.

Scheduling tasks/messages for later processing/delivery

I'm creating a new service, and for that I have database entries (Mongo) that have a state field, which I need to update based on a current time, so, for instance, the start time was set to two hours from now, I need to change state from CREATED -> STARTED in database, and there can be multiple such states.
Approaches I've thought of:
Keep querying database entries that are <= current time and then change their states accordingly. This causes extra reads for no reason and half the time empty reads, and it will get complicated fast with more states coming in.
I write a job scheduler (I am using go, so that'd be not so hard), and schedule all the jobs, but I might lose queue data in case of a panic/crash.
I use some products like celery, have found a go implementation for it https://github.com/gocelery/gocelery
Another task scheduler I've found is on Google Cloud https://cloud.google.com/solutions/reliable-task-scheduling-compute-engine, but I don't want to get stuck in proprietary technologies.
I wanted to use some PubSub service for this, but I couldn't find one that has delayed messages (if that's a thing). My problem is mainly not being able to find an actual name for this problem, to be able to search for it properly, I've even tried searching Microsoft docs. If someone can point me in the right direction or if any of the approaches I've written are the ones I should use, please let me know, that would be a great help!
UPDATE:
Found one more solution by Netflix, for the same problem
https://medium.com/netflix-techblog/distributed-delay-queues-based-on-dynomite-6b31eca37fbc
I think you are right in that the problem you are trying to solve is the job or task scheduling problem.
One approach that many companies use is the system you are proposing: jobs are inserted into a datastore with a time to execute at and then that datastore can be polled for jobs to be run. There are optimizations that prevent extra reads like polling the database at a regular interval and using exponential back-off. The advantage of this system is that it is tolerant to node failure and the disadvantage is added complexity to the system.
Looking around, in addition to the one you linked (https://github.com/gocelery/gocelery) there are other implementations of this model (https://github.com/ajvb/kala or https://github.com/rakanalh/scheduler were ones I found after a quick search).
The other approach you described "schedule jobs in process" is very simple in go because goroutines which are parked are extremely cheap. It's simple to just spawn a goroutine for your work cheaply. This is simple but the downside is that if the process dies, the job is lost.
go func() {
<-time.After(expirationTime.Sub(time.Now()))
// do work here.
}()
A final approach that I have seen but wouldn't recommend is the callback model (something like https://gitlab.com/andreynech/dsched). This is where your service calls to another service (over http, grpc, etc.) and schedules a callback for a specific time. The advantage is that if you have multiple services in different languages, they can use the same scheduler.
Overall, before you decide on a solution, I would consider some trade-offs:
How acceptable is job loss? If it's ok that some jobs are lost a small percentage of the time, maybe an in-process solution is acceptable.
How long will jobs be waiting? If it's longer than the shutdown period of your host, maybe a datastore based solution is better.
Will you need to distribute job load across multiple machines? If you need to distribute the load, sharding and scheduling are tricky things and you might want to consider using a more off-the-shelf solution.
Good luck! Hope that helps.

Laravel Raffle Project. Is a Queue the best way to achieve this?

I'm creating a raffle site as a small side project. It will handle multiple raffles each with an end time. At the end of each raffle a single winner is chosen.
Are Laravel Jobs the best way to go with this? Do I just create a single forever-repeating job to check if any raffles have ended and need a winner?
If not, what would be the best way to go?
I don't think that forever-repeating scripts are generally a good idea.
I just create a single forever-repeating job
This is almost never a good idea. It has its applications in legacy code bases but websockets and events are best considered for this job. Also, you have the benefit of using a really good framework like Laravel, so take advantage of it
Websockets
If you want people to be notified in real time in the browser.
If you have all your users subscribe to a websocket channel when they load the page, you can easily send a message to a websocket server to all subscribed clients (ie browsers) to let them know who the winner is.
Then, in your client side code (Javascript), you can parse that message to determine who the winner is and render a pop up that let's the user know.
Events
If you don't mind a bit of a delay, most definitely use events for this.
At the end of every action that might potentially end a raffle (ie, a name is chosen at random by a computer - function chooseName()). Fire an event that notifies all participants in the raffle.
https://laravel.com/docs/5.2/events
NB: I've listed the above two as separate issues, but actually, the could be used together. For example, in the event that a name is chosen at random, determine if the raffle is over and notify clients via a websocket connection.
Why I wouldn't use delayed Jobs
The crux of the reason - maintainability
Imagine a scenario where something extends the time of your raffle by a week. This could've happened because a raffle was cheated on or whatever (can't really think of all the use cases in that area).
Now, your job has a set delay in place - is it really a good programming principle to have to change two things when only one scenario changed? Nope. Having something like an event in place - onRaffleEnd - explicitly looks for the occurrence of an event. Laravel doesn't care when that event happens.
Using delayed Jobs can work - it's just not a good programming use case in your scenario and limits what you're able to do in the longer run. It will force you to make more considerations when unforeseen circumstances come along as well as when you want to change things. This also decentralizes the logic related to your raffle. Whilst decoupling code is good practice, having logic sit in completely different places makes maintenance a nightmare.

How do I wait for both threads to finish and files to be ready for reading without polling?

What I'd like to achieve is as follows (pseudocode):
f, t = select(files, threads)
if f
<read from files>
elsif t
<do something else>
end
Where select is a method similar to IO.select. But it seems unlikely to be possible.
The big picture is I'm trying to write a program which has to perform several types of jobs. The idea was to pass job data using database. But also inform the program about new jobs using pipes (by sending just type of the job). So that it wouldn't need to poll for jobs. So I was planning to create a loop waiting for either new notifications from pipes, or for worker threads to finish. After thread finishes I check if there were at least one notification about this particular type of job and run the worker thread again if needed. I'm not really sure what's is the best route to take here, so if you've got suggestions I'd like to hear them out.
Don't reinvent the wheel mate :) check out https://github.com/eventmachine/eventmachine (IO lib based on reactor pattern like node.js etc) or (perhaps preferably) https://github.com/celluloid/celluloid-io (IO lib based on actor pattern, better docs and active maintainers)
OPTION 1 - use EM or Celluloid to handle non-blocking sockets
EM and Celluloid are quite different, EM is reactor pattern ("same thing" as node.js, with a threadpool as workaround for blocking calls) and Celluloid is actor pattern (an actor thread pool).
Both can do non-blocking IO to/from a lot of sockets and delegate work to a lot of threads, depending on how you go about to do it. Both libs are very robust, efficient and battle tested, EM has more history but seems to have fallen slightly out of maintenance (https://www.youtube.com/watch?v=mPDs-xQhPb0), celluloid has nicer API and more active community (http://www.youtube.com/watch?v=KilbFPvLBaI).
Best advice I can give is to play with code samples that both projects provide and see what feels best. I'd go with celluloid for a new project, but that's a personal opinion - you may find that EM has more IO-related features (such as handling files, keyboard, unix sockets, ...)
OPTION 2 - use background job queues
I may have been misguided by the low level of your question :) Have you considered using some of the job queues available under ruby? There's a TON of decent and different options available, see https://www.ruby-toolbox.com/categories/Background_Jobs
OPTION 3 - DIY (not recommended)
There is a pure ruby implementation of EM, it uses IO selectables to handle sockets so it offers a pattern for what you're trying to do, check it out: https://github.com/eventmachine/eventmachine/blob/master/lib/em/pure_ruby.rb#L311 (see selectables handling).
However, given the amount of other options, hopefully you shouldn't need to resort to such low level coding.

Web crawler in Ruby: How to achieve the best perfomance?

I'm writing a web-crawler that should be able to parse multiple pages at the same time. I use Nokogiri for parsing which is quiet good and solve all my tasks, but I don't know how to achieve better perfomance.
I use threads to make many open-uri requests at the same time and it makes the process quicker, but it seems that it's still far from the potential that I can achieve from a single server. Should I use multiple processes? What are the limits of the threads and processes that can be launched for a single ruby application?
By the other words: how to achieve the best performance in this case.
I really like Typhoeus and Hydra for handling multiple requests at once.
Typhoeus is the http client side, and Hydra is the part that handles multiple requests. The examples are good so go through them and see.
While it sounds like you're not looking for something quite so complex I found this thesis an interesting read awhile ago: Building blocks of a scalable webcrawler - Marc Seeger.
In terms of threading/process limits Ruby has very low threading potential. Standard Ruby (MRI/YARV) and Rubinius don't support simultaneous thread execution, unless using an extension specifically built to support it. Depending on how much of your performance trouble is in the IO and how much is in the processing I could suggest using EventMachine.
Multi process however Ruby works very well, as long as you've got a good manager/database for all the processes to communicate with then running multiple processes should scale as well as your processing power allows.
Hey another way is to use a combination of Nokogiri and IronWorker (IronMQ and IronCache).
See a full blog entry on the Topic here
We use a combination of ActiveMQ/Active Messaging, Event Machine, and multi-threading for this problem. We start off with a big list of URL's to fetch. We then break them down into batches of 100 URL's per batch. Each batch is then pushed into ActiveMQ. Then, we have an array of poller/consumer processes listening to the queue. These consumers can all be on one computer, or they can be spread across multiple computers. The array of consumers can grow arbitrarily large to support as much parallelism as we want. The consumers use Active Messaging, which is a nice Ruby integration with ActiveMQ.
When a consumer receives a message to process a batch of 100 URL's, it kicks off Event Machine to create a thread pool that can process multiple messages in multiple threads. Like you, we use Nokogiri to process each URL.
So, there are three levels of parallelism:
1) Multiple concurrent requests per consumer process, supported by Event Machine and threads.
2) Multiple consumer processes per computer.
3) Multiple computers.
If you want something easy go for http://anemone.rubyforge.org/
If you want something fast, code something with eventmachine/em-http-request
I found redis to be a great multi purpose tool for queue management, caching and so on. You could also use specialized things like beanstalkd/active mq/... but at least in my use case, I didn't really find them to be a big advantage compared to redis.
Especially the load on the backend system could be a bottleneck, so chose your database carefully and pay attention to what you save

Resources