All,
I have some client code that needs to execute tasks that might be long running. A user will want to upload a video for processing which could take a long time and/or possibly fail. Another user might want to upload a small pic which could finish quickly or also fail. In all cases I need to be able to update the client code with some sort progress as the job moves along. Is there a Spring 4 solution for this kind of pattern. I have found many pub/sub solutions but they were all several years old. I am hoping this type of problem is now common enough to have a structured solution.
Related
I have code generally that does this:
<every 2 minutes>
try
<reap crops>
<sow seeds via some api>
catch Exception => e
<tell neighbor to take care of crops and this must happen>
and say I want to do eventually do this in multiple fields simultaneously every 2 minutes and I’m only in Ruby (not rails), what’s the easiest way to do this? Two approaches I’ve considered at using sidekiq scheduler or using the Thread class. What are advantages or disadvantages to both approaches? Take note that if the api fails, I need to get into the catch clause otherwise a lot of money is lost.
If I wanted to write this as a recurring piece of work that runs every 2 minutes (and this does not need user input), what's the best way to write this in Ruby?
#Jwan622 did you found a solution already?
I would go for sidekiq, as it offers many features like scheduled jobs, different retry options, a web UI etc.
You also need to think about situations like service restarts or deployments. I fear a solution based on threads will not be reliable (out of the box) in these situations.
Disclaimer: I maintain with Sidekiq::Undertaker an open-source plugin for Sidekiq, which allows retrying dead jobs. I'm not involved in the main Sidekiq project and I don't get any affiliate fees.
Context: in my country there will be a new way to Instantly Payment previewed for November. Basically, the Central Bank will provide two endpoints: (1) one POST endpoint which we post a single money transfer and (2) one GET endpoint where we get the result of a money transfer sent before and it can be completely out of order. It will answer back only on Money Transfer result and in its header will inform if there is another result we must GET. It never informs how many results are available. If there is a result it gives back on Get response and only inform if it is the last one or there is remaining ones for next GET.
Top limitation: from the moment final user clicks Transfer button in his/her mobile app until final result showing in his mobile screen if it was successful or failed is 10 seconds.
Strategy: I want a schedule which triggers each second or even less than a second a Get to Central Bank. The Scheduler will basically evoke a simple function which
Calls the Get endpoint
Pushes it to a Kafka or persist in database and
If in the answer headers it is informed more results are available, start same function again.
Issue: Since we are Spring users/followers, I though my decision was between Spring Batch versus org.springframework.scheduling.annotation.SchedulingConfigurer/TaskScheduler. I have used successfully Spring Batch for while but never for a so short period trigger (never used for 1 second period). I stumbled in discussion that drove me to think if in my case, a very simple task but with very short period, I should consider Spring Cloud Data Flow or Spring Cloud Task instead of Spring Batch.
According to this answer "... Spring Batch is ... designed for the building of complex compute problems ... You can orchestrate Spring Batch jobs with Spring Scheduler if you want". Based on that, it seems I shouldn't use Spring Batch because it isn't complex my case. The challenge design decision is more regard a short period trigger and triggering another batch from current batch instead of transformation, calculation or ETL process. Nevertheless, as far as I can see Spring Batch with its tasklet is well-designed for restarting, resuming and retrying and fits well a scenario which never finishes while org.springframework.scheduling seems to be only a way to trigger an event based on period configuration. Well, this is my filling based on personal uses and studies.
According to an answer to someone asking about orchestration for composed tasks this answer "... you can achieve your design goals using Spring Cloud Data Flow along with the Spring Cloud Task/Spring Batch...". In my case, I don't see composed tasks. In my case, the second trigger doesn't depend on result from previous one. It sounds more as "chained" tasks instead of "composed". I have never used Spring Cloud Data Flow but it seems a nice candidate for Manage/View/Console/Dashboards the triggered task. Nevertheless, I didn't find anywhere informing limitations or rule of thumbs for short periods triggers and "chained" triggers.
So my straight question is: what is the current recommend Spring members for a so short period trigger? Assuming Spring Cloud Data Flow is used for manager/dashboard what is the trigger member from Spring recommended in so short trigger scenarios? It seems Spring Cloud Task is designed for calling complex functions and Spring Batch seems to add too much than I need and org.springframework.scheduling.* missing integration with Spring Cloud Data Flow. As an analogy and not as comparison, in AWS, the documentation clear says "don't use CloudWatch for less than one minute. If you want less than one minute, start CloudWatch for each minute that start another scheduler/cron each second". There might be a well-know rule of thumb for a simple task that needs to be trigger each second or even less than one second and take advantage of Spring family approach/concerns/experience.
This may be stupid answer. Why do you need scheduler here?. Wouldn't a never ending job will achieve the goal here?
You start a job, it does a GET request, push the result to kafka,
If the GET response indicated, it had more results, it immediately does a GET again, push the result to kafka
If the GET response indicated, there are no more results, sleep for 1 second, do the GET request again.
Assume I have a scenario where I am processing a background job in a worker. It simply receives a URL for a file (image, video, pdf, ..) hosted on a remote CDN and the worker does its work as:
Some processing on the file content in-memory
Then calls a 3rd party API to retrieve a signed valid URL for uploading the content to that same 3rd party.
Uploads the content to the 3rd party API – the response contains a unique file ID
Sends a message to a user through the 3rd party API with the unique file ID received earlier
Now, the problem is between step (3) and (4). The constraint here is that the 3rd party API needs few seconds to process the file (step 3) before we actually send a message containing the file ID we just uploaded (step 4).
One more assumption here is that I need to make sure all 4 steps execute in one go, as in, not to have any partial failure opportunities.
Possible approaches
The most naive way to go is by using sleep 5 between step (3) and (4), it might hurt / hard fail since I am not exactly sure how many seconds does the 3rd party API needs for processing, but according to my trials, 5 seconds sleep seemed alright.
I could do an in-process exponential retry for 3 (or X) times for step (3), catch an exception from the 3rd party and attempt to do step (4) when step (3) is successful – this is what I have now, it works alright.
I could perhaps either use a job scheduler or a ruby concurrency library to do step (4) in a delayed fashion. I don't appreciate this path as it feels like it is favouring complexity.
This piece of logic is built in Ruby, though the question might not be very Ruby specific and can be applicable in other languages, I would like to hear what Ruby folks think.
The API docs you linked to say:
Attention! Some time needed by a server to process an uploaded file.
File should be sent to a chat after a short timeout (a couple of
seconds)
I would usually advise against something of this nature, but since your vendor specifically says "timeout", sleep is the best option.
I'd try doing delayed task, as it will allow thread to continue working (so thread pool won't need to create new threads (they are quite expensive from memory side), your thread may continue doing useful job without need of context switch (which is expensive from CPU usage side), ...).
As for purity of solution, asynchronous programming should not involve any blocking tasks (we are actually fighting against blocking using asynchronous programming), so this is one more reason to use delayed task.
If application does not involve achieving highest performance (does Ruby performance oriented language?), so sleep may really be easiest, but not most optimal solution.
I'm creating a new service, and for that I have database entries (Mongo) that have a state field, which I need to update based on a current time, so, for instance, the start time was set to two hours from now, I need to change state from CREATED -> STARTED in database, and there can be multiple such states.
Approaches I've thought of:
Keep querying database entries that are <= current time and then change their states accordingly. This causes extra reads for no reason and half the time empty reads, and it will get complicated fast with more states coming in.
I write a job scheduler (I am using go, so that'd be not so hard), and schedule all the jobs, but I might lose queue data in case of a panic/crash.
I use some products like celery, have found a go implementation for it https://github.com/gocelery/gocelery
Another task scheduler I've found is on Google Cloud https://cloud.google.com/solutions/reliable-task-scheduling-compute-engine, but I don't want to get stuck in proprietary technologies.
I wanted to use some PubSub service for this, but I couldn't find one that has delayed messages (if that's a thing). My problem is mainly not being able to find an actual name for this problem, to be able to search for it properly, I've even tried searching Microsoft docs. If someone can point me in the right direction or if any of the approaches I've written are the ones I should use, please let me know, that would be a great help!
UPDATE:
Found one more solution by Netflix, for the same problem
https://medium.com/netflix-techblog/distributed-delay-queues-based-on-dynomite-6b31eca37fbc
I think you are right in that the problem you are trying to solve is the job or task scheduling problem.
One approach that many companies use is the system you are proposing: jobs are inserted into a datastore with a time to execute at and then that datastore can be polled for jobs to be run. There are optimizations that prevent extra reads like polling the database at a regular interval and using exponential back-off. The advantage of this system is that it is tolerant to node failure and the disadvantage is added complexity to the system.
Looking around, in addition to the one you linked (https://github.com/gocelery/gocelery) there are other implementations of this model (https://github.com/ajvb/kala or https://github.com/rakanalh/scheduler were ones I found after a quick search).
The other approach you described "schedule jobs in process" is very simple in go because goroutines which are parked are extremely cheap. It's simple to just spawn a goroutine for your work cheaply. This is simple but the downside is that if the process dies, the job is lost.
go func() {
<-time.After(expirationTime.Sub(time.Now()))
// do work here.
}()
A final approach that I have seen but wouldn't recommend is the callback model (something like https://gitlab.com/andreynech/dsched). This is where your service calls to another service (over http, grpc, etc.) and schedules a callback for a specific time. The advantage is that if you have multiple services in different languages, they can use the same scheduler.
Overall, before you decide on a solution, I would consider some trade-offs:
How acceptable is job loss? If it's ok that some jobs are lost a small percentage of the time, maybe an in-process solution is acceptable.
How long will jobs be waiting? If it's longer than the shutdown period of your host, maybe a datastore based solution is better.
Will you need to distribute job load across multiple machines? If you need to distribute the load, sharding and scheduling are tricky things and you might want to consider using a more off-the-shelf solution.
Good luck! Hope that helps.
I need to write a servlet that will return to the user a csv that holds some statistics.
I know how to return just the file, but how can I do it while showing a progress bar of the file creation process?
I am having trouble understanding how can I do something ajaxy to show the progress of the file creation, while creating the file at the same time - if I create a servlet that will return the completion percentage, how can it keep the same file it is creating while returning a response every x seconds to the browser to show the progress.
There's two fundamentally different approaches. One is true asynchronous delivery using an approach such as Comet. You can see some descriptions in articles such as this. I would use this approach where the data your are delivering is naturally incremental - for example live measurements from instrumentation. Some Java App Servers have nice integration between their JMS message systems and comet to the browser.
The other approach is that you have a polling mechanism. The JavaScript in the browser makes periodic calls to the server to get status (and maybe the next chunk of data). The advantage of this approach is that you are using a very standard programming model, less new stuff to learn. For many cases, such as "are there new answers for the Stack Overflow question I'm working on?" this is quite sufficient.
Your challenge may be to determine any useful progress information. How would you know how far through the generation of the CSV file you are?
If you are firing off a long running request from a servlet it's quite likely that you will effectivley spin off a worker thread to do that work. (Maybe using JMS, maybe using asynch workers) and immediately return a response to the browser saying "Understood, I'm thinking". This ensures that you are not vulnerable to and Http response timeouts. The problem then is how to determine the current progress. Unless the "worker" doing the work has some way to communicate its partial progress you have nothing useful to say. This kind of thing tend to be very application-specific. Some tasks very naturally have progress points (consider printing we know how many pages to do and how many printed) others don't (consider determining if a number is prime - yes or no, no useful intermediate stages perhaps)