'Fire & Forget' call from Sinatra - ruby

I am writing an end-point using Sinatra where I will be receiving raw pdfs from the client and need to process the pdf for internal use. Now the pdf processing takes a while and I do not necessarily want client to wait till the processing is finished and risking a timeout (504). Instead the would like to invoke another method that handles pdf processing while I respond back to the client with appropriate code. What is the best way to implement that using Sinatra?

So there's a few parts to this, so let me break down the various steps that are going to happen:
Client uploads a PDF file: depending on the size of the PDF and the speed of their connection, this could take a while. While you're waiting for the upload your web process is busy receiving the data and is unable to process any other requests for any other clients.
You then need to process the uploaded file, store it somewhere, possibly manipulate it somehow. If you do all that as part of the request process then there is yet more time you're tied up dealing with this one request and unable to serve other clients.
The typical way to solve the latter of those problems, manipulating or processing an uploaded asset, is to use a background job queue such as Sidekiq (http://sidekiq.org). You store the required data somewhere, keep enough information to know what to work on (e.g., the database ID of a model that has stored the relevant information, a filename, etc.), and then pass all of that required information into a background job. You then have separate worker processes that pick up that work and complete it, but they aren't part of your web process so they aren't blocking other clients from receiving information.
This still leaves us with the problem of handling large uploads, fortunately that has a solution too. Take advantage of all of the web capacity Amazon has and have the clients upload the file direct to S3, when it's complete they can post just the filename to you, and you can then queue that up into your worker from the previous step and have it all happen in the background. This blog post has a good explanation of how to wire it together using Paperclip http://blog.littleblimp.com/post/53942611764/direct-uploads-to-s3-with-rails-paperclip-and

Related

Transfer file takes too much time

I have an empty API in laravel code with nginx & apache server. Now the problem is that the API takes a lot of time if I try with different files and the API responds quickly if I try with blank data.
Case 1 : I called the API with a blank request, that time response time will be only 228ms.
Case 2 : I called the API with a 5MB file request, then file transfer taking too much time. that's why response time will be too long that is 15.58s.
So how can we reduce transfer start time in apache or nginx server, Is there any server configuration or any other things that i missed up ?
When I searched on google it said keep all your versions up-to-date and use php-fpm, but when I configure php-fpm and http2 protocol on my server I noticed that it takes more time than above. All server versions are up-to-date with the current version.
This has more to do with the fact one request has nothing to process so the response will be prompt, whereas, the other request requires actual processing and so a response will take as long as the server requires to process the content of your request.
Depending on the size of the file and your server configuration, you might hit a limit which will result in a timeout response.
A solution to the issue you're encountering is to chunk your file upload. There are a few packages available so that you don't have to write that functionality yourself, an example of such a package is the Pionl Laravel Chunk Upload.
An alternative solution would be to offload the file processing to a Queue.
Update
When I searched on google about chunking it's not best solution for
small file like 5-10 MB. It's a best solution for big files like
50-100 MB. So is there any server side chunking configuration or any
other things or can i use this library to chunking a small files ?
According to the library document this is a web library. What should I
use if my API is calling from Android and iOS apps?
True, chunking might not be the best solution for smaller files but it is worthwhile knowing about. My recommendation would be to use some client-side logic to determine if sending the file in chunks is required. On the server use a Queue to process the file upload in the background allowing the request to continue processing without waiting on the upload and a response to be sent back to the client (iOS/Android app) in a timely manner.

How to handle long time processing request

I have a spring boot rest API deployed on AWS Elastic Beanstalk and I am trying to upload pictures through it.
This is what I did : Upload a zip file through a file input from the browser, get the zip file on the server, go through all the files and upload each one on AWS S3.
It works fine but I ran into a problem: When I try to upload lots of pictures, I get an HTTP error (504 Gateway Timeout). I found out this is because the server takes too much time to respond, and I am trying to figure how to set a higher timeout for the requests (didn't find yet).
But in the mean time I am asking myself if it is the best solution.
Wouldn't it be better to end the request directly after receiving the zip file, make the uploads to S3 and after that notify the user that the uploads are done ? Is there even a way to do that ? Is there a good practice for this ? (operation that takes lots of time to process).
I know how to do the process asynchronously but I would really like to know how to notify the user after it completes.
Wouldn't it be better to end the request directly after receiving the zip file, make the uploads to S3 and after that notify the user that the uploads are done ?
Yes, asynchronous processing of the uploaded images in the zip file would be better.
Is there even a way to do that ? Is there a good practice for this ? (operation that takes lots of time to process).
Yes there is a better way. To keep everything within EB, you could look at Elastic Beanstalk worker environment. The worker environment is ideal for processing your images.
In this solution, your web based environment would store the images uploaded in S3 and submit it names along with other identifying information to an SQS queue. The queue is an entry point for the worker environment.
Your workers would process the images from the queue independently from the web environment. In the meantime, the web environment would have to check for the results and notify your users once the images get processed.
The EB also supports linking different environments. Thus you could establish a link between web and worker environments for easier integration.

How to pass a video file to worker function in Heroku (Flask)

Context
I have a web dyno which receives a video file with the intention of applying a computer vision algorithm to return an analysis. The algorithm takes about 10 seconds to run. My current method is to process it with the web dyno. The whole thing is pretty fast. The user doesn't have to wait any more than a minute.
What's not working
But of course, tying up the web dyno is a bad idea. And some users have gotten timeouts... So I tried implementing redis to pass the job to a worker dyno.
#application.route('/video', methods=['POST'])
#cross_origin()
def video():
video_file = request.files['videoData']
job = q.enqueue_call(
func=run_analysis, args=(video_file), result_ttl=5000
)
return json.dumps(job.id)
But this gives me an error
TypeError: cannot serialize '_io.BufferedRandom' object and I understand why.
In my dev env, I can save the video to the filesystem and pass the file path only, but this doesn't work in production as the web dyno's file system is ephemeral and the worker won't see the file.
So I'm looking for the fastest way to get the video file across. Speed is of the essence here, as the user is waiting for their video to be processed.
What I've tried
I've tried S3 (uploading direct from client and downloading in the worker) but it made the whole process way slower. First of all, it takes longer to upload to S3 than to my Heroku endpoint. Second of all, I have to then download it to the worker which takes a while as well. I don't really need to keep the file, so it's a very inefficient work around.
Heroku dynos are completely isolated containers, that is why they cannot share file system as you want. But if you'll host them on another hosting like DigitalOcean or Amazon, you'll be able to access files you stored by Flask from the other workers instantly (or almost instantly - don't forget to create a copy of temp-file, Flask or WSGI should delete it after response sent).
Another option is to find the fastest way of "transporting" video-data (not always file) to a worker. You can do it using:
queue - put whole file to a queue. Method is not recommended, but still ok, if video files are really small.
in-memory database - save file to some in-memory database (e.g. Redis). They have a lot of mechanisms to quickly transport data between servers or processes (cons: expensive on Heroku)
database - save file to a general purpose database (e.g. Postgresql), that will do the same as Redis, but it is able to work with bigger data cheaper, though a bit slower.
WebSockets or even Unix-socket - you can have one worker to which Flask can connect to, send file and return http-response. That "listener" will actually start task. It can either save video to a file and provide path to еру next worker (but it should always be on the same dyno as the rest workers are) or provide data directly using args, "fork", threading, subprocessing, etc.
You can start with Flask-SocketIO.
But remember, that you need to configure server-to-server connection - between webapp and worker that should list it in loop or separate thread. Not Javascript in browser, which potentially is also option - start task and upload file directly to worker.
P.S. there are no 30sec. timeouts on Heroku for WebSockets :)

How to allow sinatra poll for data smartly

I am wanting to design an application where the back end is constantly polling different sensors while the front end (sinatra) allows for this data to be viewed either via json api, or by simply displaying the results in html.
What considerations should I take to develop such an application and how should I structure the application for best scaling and ease of maintenance.
My first thought is to simply let sinatra poll the sensors every time it receives a request to the proper end points, but this seems like it could bog down quiet fast especially seeing how that some sensors only update themselves every couple seconds.
My second thought is to have a background process (or thread) poll the sensors and store the values for sinatra. When a request is received sinatra can then simply poll the background process for a cached value (or pull it from the threaded code) and present it to the client.
I like the second thought more, but I am not sure how I would develop the "background application" so that sinatra could poll it for data to present to the client. The other option would be for sinatra to thread the sensor polling code so that it can simply grab values from it inside the same process rather than requesting it from another process.
Due note that this application will also be responsible for automation of different relays and such based off the sensors and sinatra is only responsible for relaying the status of the sensors to the user. I think separating the backend (automation + sensor information) in a background process/daemon from the frontend (sinatra) would be ideal, but I am not sure how I would fetch the data for sinatra.
Anyone have any input on how I could structure this? If possible I would also appreciate a sample application that simply displays the idea that I could adopt and modify.
Thanks
Edit::
After a bit more research I have discovered drb (distributed ruby http://ruby-doc.org/stdlib-1.9.3/libdoc/drb/rdoc/DRb.html) which allows you to make remote calls on objects over the network. This may be a suitable solution to this problem as the daemon can automate the relays, read the sensors and store the values in class objects, and then present the class objects over drb so that sinatra can call the getters on the remote object to obtain up to date data from the daemon. This is what I initially wanted to attempt to do.
What do you guys think? Is this advisable for such an application?
I have decided to go with Sinatra, DRB, and Daemons to meet the requirements I have stated above.
The web front end will run in its own process and only serve up statistical information via DRB interactions with the backend. This will allow quick response times for the clients and allow me to separate front end code from backend code.
The backend will run in its own process and constantly poll the sensors for updates and store them as class objects with getters so that Sinatra can fetch the information over DRB when required. It will also use the gathered information for automation that is project specific.
Finally the backend and frontend will be wrapped with a Daemons wrapper so that the project will have the capabilities of starting, restarting, stopping, run status, and automatic restarting of the Daemons if it crashes or quits for what ever reason.
Source information:
http://phrogz.net/drb-server-for-long-running-web-processes
http://ruby-doc.org/stdlib-1.9.3/libdoc/drb/rdoc/DRb.html
http://www.sinatrarb.com/
https://github.com/thuehlinger/daemons

async execution of tasks for a web application

A web application I am developing needs to perform tasks that are too long to be executed during the http request/response cycle. Typically, the user will perform the request, the server will take this request and, among other things, run some scripts to generate data (for example, render images with povray).
Of course, these tasks can take a long time, so the server should not hang for the scripts to complete execution before sending the response to the client. I therefore need to perform the execution of the scripts async, and give the client a "the resource is here, but not ready" and probably tell it a ajax endpoint to poll, so it can retrieve and display the resource when ready.
Now, my question is not relative to the design (although I would very much enjoy any hints on this regard as well). My question is: does a system to solve this issue already exists, so I do not reinvent the square wheel ? If I had to, I would use a process queue manager to submit the task and put a HTTP endpoint to shoot out the status, something like "pending", "aborted", "completed" to the ajax client, but if something similar already exists specifically for this task, I would mostly enjoy it.
I am working in python+django.
Edit: Please note that the main issue here is not how the server and the client must negotiate and exchange information about the status of the task.
The issue is how the server handles the submission and enqueue of very long tasks. In other words, I need a better system than having my server submit scripts on LSF. Not that it would not work, but I think it's a bit too much...
Edit 2: I added a bounty to see if I can get some other answer. I checked pyprocessing, but I cannot perform submission of a job and reconnect to the queue at a later stage.
You should avoid re-inventing the wheel here.
Check out gearman. It has libraries in a lot of languages (including python) and is fairly popular. Not sure if anyone has any out of the box ways to easily connect up django to gearman and ajax calls, but it shouldn't be do complicated to do that part yourself.
The basic idea is that you run the gearman job server (or multiple job servers), have your web request queue up a job (like 'resize_photo') with some arguments (like '{photo_id: 1234}'). You queue this as a background task. You get a handle back. Your ajax request is then going to poll on that handle value until it's marked as complete.
Then you have a worker (or probably many) that is a separate python process connect up to this job server and registers itself for 'resize_photo' jobs, does the work and then marks it as complete.
I also found this blog post that does a pretty good job summarizing it's usage.
You can try two approachs:
To call webserver every n interval and inform a job id; server processes and return some information about current execution of that task
To implement a long running page, sending data every n interval; for client, that HTTP request will "always" be "loading" and it needs to collect new information every time a new data piece is received.
About second option, you can to learn more by reading about Comet; Using ASP.NET, you can do something similiar by implementing System.Web.IHttpAsyncHandler interface.
I don't know of a system that does it, but it would be fairly easy to implement one's own system:
create a database table with jobid, jobparameters, jobresult
jobresult is a string that will hold a pickle of the result
jobparameters is a pickled list of input arguments
when the server starts working on a job, it creates a new row in the table, and spwans a new process to handle that, passing that process the jobid
the task handler process updates the jobresult in the table when it has finished
a webpage (xmlrpc or whatever you are using) contains a method 'getResult(jobid)' that will check the table for a jobresult
if it finds a result, it returns the result, and deletes the row from the table
otherwise it returns an empty list, or None, or your preferred return value to signal that the job is not finished yet
There are a few edge-cases to take care of so an existing framework would clearly be better as you say.
At first You need some separate "worker" service, which will be started separately at powerup and communicated with http-request handlers via some local IPC like UNIX-socket(fast) or database(simple).
During handling request cgi ask from worker state or other data and replay to client.
You can signal that a resource is being "worked on" by replying with a 202 HTTP code: the Client side will have to retry later to get the completed resource. Depending on the case, you might have to issue a "request id" in order to match a request with a response.
Alternatively, you could have a look at existing COMET libraries which might fill your needs more "out of the box". I am not sure if there are any that match your current Django design though.
Probably not a great answer for the python/django solution you are working with, but we use Microsoft Message Queue for things just like this. It basically runs like this
Website updates a database row somewhere with a "Processing" status
Website sends a message to the MSMQ (this is a non blocking call so it returns control back to the website right away)
Windows service (could be any program really) is "watching" the MSMQ and gets the message
Windows service updates the database row with a "Finished" status.
That's the gist of it anyways. It's been quite reliable for us and really straight forward to scale and manage.
-al
Another good option for python and django is Celery.
And if you think that Celery is too heavy for your needs then you might want to look at simple distributed taskqueue.

Resources