Is it safe to call the Sidekiq API from inside perform? - ruby

Nothing seems to prevent a perform method to use the Sidekiq API. It should be safe in read-only mode.
What if it calls a "write" methods ? Especially when this method acts on the current job itself.
We would like to reschedule a job without creating a new job because we need to track the job completion with the sidekiq-status gem from another worker.
Using MyWorker.perform_in or MyWorker.perform_at to reschedule the job from inside the worker creates a new job, making it difficult to track the total completion. We're thinking of using Sidekiq::ScheduledSet.new.find and the reschedule method but it seems awkward and potentially dangerous to reschedule a job that is about to complete.
Does Sidekiq and its API support this use case ?

You might be able to hack something together but it'll be really slow if you try to modify the Sets and Lists in Redis directly. They aren't designed to be used that way.
The official Sidekiq solution to this problem is a Batch.
https://github.com/mperham/sidekiq/wiki/Batches#status
You create a one-job batch. If the job needs to be rescheduled, it adds a new job to the Batch to be executed later. Your other worker just checks the status of the overall Batch and if it is 100% complete.

Related

Spring Boot, Cron job synchronization

In my Spring Boot application, based on the Cron job(runs every 5 minutes) I need to process 2000 products in my database.
Right now the process time of these 2000 products takes more than 5 minutes. I ran into the issue where the second Cron job runs when the first one is not completed yet.
Is there in Spring/Cron out of the box functionality that will allow to synchronize these jobs and wait for the previous job completion before starting the next one?
Please advise how to properly implement such kind of system. Anyway, the following technologies are also available Neo4j, MongoDB, Kafka. Please advise how to properly design/implement this functionality using the Spring/Cron separately or even together with the mentioned technologies.
1) You may try to use #Scheduled(fixedDelay = 5*60*1000). It will guarantee that next invocation will happen strictly in 5 minutes after previous one is finished. But this may break your scheduling requirements
2) You can limit the underlying ThreadExecutor's pool size to 1 thread, so next invocation will have to wait until previous is finished, but this, again, can break the logic, since it would affect all periodic tasks invoked by #Scheduled
3) You can use Quartz instead of spring's native #Scheduled. It's more complicated to configure, but allows to achieve the desired behaviour via #DisallowConcurrentExecution annotation or via setting JobDetail::isConcurrentExectionDisallowed in your job details

Ruby threading/forking with API (Sinatra)

I am using Sinatra gem for my API. What I want to do is when request is received process it, return the response and start new long running task.
I am newbie to Ruby, I have read about Threading but not sure what is the best way to accomplish my task.
Here my sinatra endpoint
post '/items' do
# Processing data
# Return response (body ...)
# Start long running task
end
I would be grateful for any advice or example.
I believe that better way to do it - is to use background jobs. While your worker executes some long-running tasks, it is unavailable for new requests. With background jobs - they do the work, while your web-worker can work with new request.
You can have a look at most popular backgroung jobs gems for ruby as a starting point: resque, delayed_jobs, sidekiq
UPD: Implementation depends on chosen gem, but general scheme will be like this:
# Controller
post '/items' do
# Processing data
MyAwesomeJob.enqueue # here you put your job into queue
head :ok # or whatever
end
In MyAwesomejob you implement your long-runnning task
Next, about Mongoid and background jobs. You should never use complex objects as job arguments. I don't know what kind of task you are implementing, but there is general answer - use simple objects.
For example, instead of using your User as argument, use user_id and then find it inside your job. If you will do it like that, you can use any DB without problems.
Agree with unkmas.
There are two ways to do this.
Threads or a background job gem like sidekiq.
Threads are perfectly fine if the processing times aren't that high and if you don't want to write code for the worker. But there is a strong possibility that you might run up too many threads if you don't use a threadpool or if you're expecting bursty http traffic.
The best way to do it is by using sidekiq or something similar. You could even have a job queue like beanstalkd in between and en-queue the job to it and return the response. You can have a worker reading from the queue and processing it later on.

Run non-blocking series of jobs

A certain number of jobs needs to be executed in a sequence, such that result of one job is input to another. There's also a loop in one part of job chain. Currently, I'm running this sequency using wait for completition, but I'm going to start this sequence from web service, so I don't want to get stuck waiting for response. I wan't to start the sequence and return.
How can I do that, considering that job's depend on each other?
The typical approach I follow is to use Oozie work flow to chain the sequence of jobs with passing the dependent inputs to them accordingly.
I used a shell script to invoke the oozie job .
I am not sure about the loops within the oozie workflow. but the below link speaks about the way to implement loops within the workflow.Hope it might help you.
http://zapone.org/bernadette/2015/01/05/how-to-loop-in-oozie-using-sub-workflow/
Apart from this the JobControl class is also a good option if the jobs need to be in sequence and it requires less efforts to implement.It would be easy to do loop since it would be fully done with Java code.
http://gandhigeet.blogspot.com/2012/12/hadoop-mapreduce-chaining.html
https://cloudcelebrity.wordpress.com/2012/03/30/how-to-chain-multiple-mapreduce-jobs-in-hadoop/

what is a `Scheduler` in RxJS

I'v seen the term Scheduler very frequently in the documentation.
But, what does this term mean? I even don't know how to use a so called Scheduler. The official documentation didn't tell me what a Scheduler exactly is. Is this just a common concept or a specific concept in RxJS?
Rx schedulers provide an abstraction that allows work to be scheduled to run, possibly in the future, without the calling code needing to be aware of the mechanism used to schedule the work.
Whenever an Rx method needs to generate a notification, it schedules the work on a scheduler. By supplying a scheduler to the Rx method instead of using the default, you can subtly control how those notifications are sent out.
In server-side implementations of Rx (such as Rx.NET), schedulers play an important role. They allow you to schedule heavy duty work on the thread pool or dedicated threads, and run the final subscription on the UI thread so you can update your UI.
When using RxJs, it is actually pretty rare that you need to worry about the scheduler argument to most methods. Since JavaScript is essentially single-threaded, there are not a lot of options for scheduling and the default schedulers are usually the right choice.
The only real choices are:
immediateScheduler - Runs the work synchronously and immediately. Sort of like not using a scheduler at all. Work scheduled thus is guaranteed to run synchronously.
currentThreadScheduler - Similar to immediateScheduler in that the work is run immediately. However, it does not run work recursively. So, if the work is running and schedules more work, then that additional work is put into queue to be run after the current work finishes. Thus work sometimes runs synchronously and sometimes asynchronously. This scheduler is useful to avoid stack overflows or infinite recursion. For example Rx.Observable.of(42).repeat().subscribe() would cause infinite recursion if it ran on the immediate scheduler, but since return runs on the currentThread scheduler by default, infinite recursion is avoided.
timeoutScheduler - The only scheduler that supports work scheduled to be run in the future. Essentially uses setTimeout to schedule all work (though if you schedule the work to be run "now", then it uses other faster asynchronous methods to schedule the work). Any work scheduled on this scheduler is guaranteed to be run asynchronously.
There may be some more now, such as a scheduler that schedules work on the browser animation frames, etc.
If you are trying to write testable code, then you almost always want to supply the scheduler argument. This is because in your unit tests, you will be creating testScheduler instances, which will let your unit test control the clock used by your Rx code (and thus control the exact timing of the operations).

Is hadoop's job ThreadSafe?

Anyone knows if org.apache.hadoop.mapreduce.Job is thread-safe? In my application I create a thread for each job, and then waitForCompletion. And I have another monitor thread that checks every job's state with isComplete.
Is that safe? Are jobs thread-safe? Documentation doesn't seem to mention anything about it...
Thanks
Udi
Unlike the others, I also use threads to submit jobs in parallel and wait for their completion. You just have to use a job class instance per thread. If you share same job instances over multiple threads, you have to take care of the synchronization by yourself.
Why would you want to write a separate thread for each job? What exactly is your use case?
You can run multiple jobs in your Hadoop cluster. Do you have dependencies between the multiple jobs?
Suppose you have 10 jobs running. 1 job fails then would you need to re-run the 9 successful tasks.
Finally, job tracker will take care of scheduling multiple jobs on the Hadoop cluster. If you do not have dependencies then you should not be worried about thread safety. If you have dependencies then you may need to re-think your design.
Yes they are.. Actually the files is split in blocks and each block is executed on a separate node. all the map tasks run in parallel and then are fed to the the reducer after they are done. There is no question of synchronization as you would think about in multi threaded program. In multi threaded program all the threads are running on the same box and since they share some of the data you have to synchronize them
Just in case you need another kind of parallelism on the map task level, you should override run() method in your mapper and work with multiple threads there. Default implementation calls setup(), then map() times number of records to process, and finally it calls cleanup() method once.
Hope this helps someone!
If you are checking whether the jobs have finished I think you are a bit confused about how Map reduce works. You ought to be letting Hadoop do that for itself.

Resources