Heroku - letting users download files from tmp - ruby

Let me start by saying I understand that heroku's dynos are temporary and unreliable. I only need them to persist for at most 5 minutes, and from what I've read that generally won't be an issue.
I am making a tool that gathers files from websites and zips the up for download. My tool does everything and creates the zip - I'm just stuck at the last part: providing the user with a way to download the file. I've tried direct links to the file location, and http GET requests, and Heroku didn't like either. I really don't want to have to set up AWS just to host a file that only needs to persist for a couple of minutes.. Is there another way to download files stored on /tmp?

As far as I know, you have absolutely no guarantee that a request goes to the same dyno as the previous request.
The best way to do this would probably be to either host the file somewhere else, like S3, or to send it immediately in the same request.
If you're generating the file in a background worker, then it most definitely won't work. Every process runs on a separate dyno.
See How Heroku Works for more information on their backend.

Related

How to handle long time processing request

I have a spring boot rest API deployed on AWS Elastic Beanstalk and I am trying to upload pictures through it.
This is what I did : Upload a zip file through a file input from the browser, get the zip file on the server, go through all the files and upload each one on AWS S3.
It works fine but I ran into a problem: When I try to upload lots of pictures, I get an HTTP error (504 Gateway Timeout). I found out this is because the server takes too much time to respond, and I am trying to figure how to set a higher timeout for the requests (didn't find yet).
But in the mean time I am asking myself if it is the best solution.
Wouldn't it be better to end the request directly after receiving the zip file, make the uploads to S3 and after that notify the user that the uploads are done ? Is there even a way to do that ? Is there a good practice for this ? (operation that takes lots of time to process).
I know how to do the process asynchronously but I would really like to know how to notify the user after it completes.
Wouldn't it be better to end the request directly after receiving the zip file, make the uploads to S3 and after that notify the user that the uploads are done ?
Yes, asynchronous processing of the uploaded images in the zip file would be better.
Is there even a way to do that ? Is there a good practice for this ? (operation that takes lots of time to process).
Yes there is a better way. To keep everything within EB, you could look at Elastic Beanstalk worker environment. The worker environment is ideal for processing your images.
In this solution, your web based environment would store the images uploaded in S3 and submit it names along with other identifying information to an SQS queue. The queue is an entry point for the worker environment.
Your workers would process the images from the queue independently from the web environment. In the meantime, the web environment would have to check for the results and notify your users once the images get processed.
The EB also supports linking different environments. Thus you could establish a link between web and worker environments for easier integration.

Download or Backup a generated file from PCF automatically

We have a microservices app that are running on PCF.
Some of the microservices are able to generate log files in its log folder.
Is there a way to automate to download these log files and save it to a shared folder or remote container (like google drive and the like)?
Your suggestion and advice is highly appreciated.
Thank you.
In the perfect world, you would not write things to the local filesystem that you need to keep. It's OK to write cached filed or artifacts you can simply recreate, but you shouldn't put anything important there.
https://docs.cloudfoundry.org/devguide/deploy-apps/prepare-to-deploy.html#filesystem
The local file system exposed to your app is ephemeral and it's not safe to store important things there even for a short period of time. You could certainly try to set up a process that runs periodically and sends log files out of your container to somewhere else. However, when your app crashes you're going to lose log messages, probably the important ones that say why your app crashed, because your sync process isn't going to have time to run before the container is cleaned up.
What you want to do instead is to configure your applications to write their logs to STDOUT or STDERR.
https://docs.cloudfoundry.org/devguide/deploy-apps/streaming-logs.html#writing
Anything written to STDOUT/STDERR is automatically captured by the platform and sent out the log stream for your app. You can then send your log stream to a variety of durable locations.
https://docs.cloudfoundry.org/devguide/services/log-management.html
Most applications can easily be configured to write to STDOUT/STDERR. You've tagged spring-boot on this post, so I assume your apps are running Spring Boot. By default, Spring Boot should log to STDOUT/STDERR so there shouldn't be anything you need to do.
What might be happening though is that your app developers have specifically configured the app to send logs to a file. Look in the src/main/resources/application.properties or application.yml file of your application for the properties logging.file.path or logging.file.name. If present, comment out or remove them. That should make your logs to STDOUT/STDERR.
https://docs.spring.io/spring-boot/docs/current/reference/html/spring-boot-features.html#boot-features-logging-file-output

How to pass a video file to worker function in Heroku (Flask)

Context
I have a web dyno which receives a video file with the intention of applying a computer vision algorithm to return an analysis. The algorithm takes about 10 seconds to run. My current method is to process it with the web dyno. The whole thing is pretty fast. The user doesn't have to wait any more than a minute.
What's not working
But of course, tying up the web dyno is a bad idea. And some users have gotten timeouts... So I tried implementing redis to pass the job to a worker dyno.
#application.route('/video', methods=['POST'])
#cross_origin()
def video():
video_file = request.files['videoData']
job = q.enqueue_call(
func=run_analysis, args=(video_file), result_ttl=5000
)
return json.dumps(job.id)
But this gives me an error
TypeError: cannot serialize '_io.BufferedRandom' object and I understand why.
In my dev env, I can save the video to the filesystem and pass the file path only, but this doesn't work in production as the web dyno's file system is ephemeral and the worker won't see the file.
So I'm looking for the fastest way to get the video file across. Speed is of the essence here, as the user is waiting for their video to be processed.
What I've tried
I've tried S3 (uploading direct from client and downloading in the worker) but it made the whole process way slower. First of all, it takes longer to upload to S3 than to my Heroku endpoint. Second of all, I have to then download it to the worker which takes a while as well. I don't really need to keep the file, so it's a very inefficient work around.
Heroku dynos are completely isolated containers, that is why they cannot share file system as you want. But if you'll host them on another hosting like DigitalOcean or Amazon, you'll be able to access files you stored by Flask from the other workers instantly (or almost instantly - don't forget to create a copy of temp-file, Flask or WSGI should delete it after response sent).
Another option is to find the fastest way of "transporting" video-data (not always file) to a worker. You can do it using:
queue - put whole file to a queue. Method is not recommended, but still ok, if video files are really small.
in-memory database - save file to some in-memory database (e.g. Redis). They have a lot of mechanisms to quickly transport data between servers or processes (cons: expensive on Heroku)
database - save file to a general purpose database (e.g. Postgresql), that will do the same as Redis, but it is able to work with bigger data cheaper, though a bit slower.
WebSockets or even Unix-socket - you can have one worker to which Flask can connect to, send file and return http-response. That "listener" will actually start task. It can either save video to a file and provide path to еру next worker (but it should always be on the same dyno as the rest workers are) or provide data directly using args, "fork", threading, subprocessing, etc.
You can start with Flask-SocketIO.
But remember, that you need to configure server-to-server connection - between webapp and worker that should list it in loop or separate thread. Not Javascript in browser, which potentially is also option - start task and upload file directly to worker.
P.S. there are no 30sec. timeouts on Heroku for WebSockets :)

Best practice when using a Rails app to overwrite a file that the app relies on

I have a Rails app that reads from a .yml file each time that it performs a search. (This is a full text search app.) The .yml file tells the app which url it should be making search requests to because different version of the search index reside on different servers, and I occasionally switch between indexes.
I have an admin section of the app that allows me to rewrite the aforementioned .yml file so that I can add new search urls or remove unneeded ones. While I could manually edit the file on the server, I would prefer to be able to also edit it in my site admin section so that when I don't have access to the server, I can still make any necessary changes.
What is the best practice for making edits to a file that is actually used by my app? (I guess this could also apply to, say, an app that had the ability to rewrite one of its own helper files, post-deployment.)
Is it a problem that I could be in the process of rewriting this file while another user connecting to my site wants to perform a search? Could I make their search fail if I'm in the middle of a write operation? Should I initially write my new .yml file to a temp file and only later replace the original .yml file? I know that a write operation is pretty fast, but I just wanted to see what others thought.
UPDATE: Thanks for the replies everyone! Although I see that I'd be better off using some sort of caching rather than reading the file on each request, it helped to find out what the best way to actually do the file rewrite is, given that I'm specifically looking to re-read it each time in this specific case.
If you must use a file for this then the safe process looks like this:
Write the new content to a temporary file of some sort.
Use File.rename to atomically replace the old file with the new one.
If you don't use separate files, you can easily end up with a half-written broken file when the inevitable problems occur. The File.rename class method is just a wrapper for the rename(2) system call and that's guaranteed to be atomic (i.e. it either fully succeeds or fully fails, it won't leave you in an inconsistent in-between state).
If you want to replace /some/path/f.yml then you'd do something like this:
begin
# Write your new stuff to /some/path/f.yml.tmp here
File.rename('/some/path/f.yml.tmp', '/some/path/f.yml')
rescue SystemCallError => e
# Log an error, complain loudly, fall over and cry, ...
end
As others have said, a file really isn't the best way to deal with this and if you have multiple servers, using a file will fail when the servers become out of sync. You'd be better off using a database that several servers can access, then you could:
Cache the value in each web server process.
Blindly refresh it every 10 minutes (or whatever works).
Refresh the cached value if connecting to the remote server fails (with extra error checking to avoid refresh/connect/fail loops).
Firstly, let me say that reading that file on every request is a performance killer. Don't do it! If you really really need to keep that data in a .yml file, then you need to cache it and reload only after it changes (based on the file's timestamp.)
But don't check the timestamp every on every request - that's almost as bad. Check it on a request if it's been n minutes since the last check. Probably in a before_filter somewhere. And if you're running in threaded mode (most people aren't), be careful that you're using a Mutex or something.
If you really want to do this via overwriting files, use the filesystem's locking features to block other threads from accessing your configuration file while it's being written. Maybe check out something like this.
I'd strongly recommend not using files for configuration that needs to be changed without re-deploying the app though. First, you're now requiring that a file be read every time someone does a search. Second, for security reasons it's generally a bad idea to allow your web application write access to its own code. I would store these search index URLs in the database or a memcached key.
edit: As #bioneuralnet points out, it's important to decide whether you need real-time configuration updates or just eventual syncing.

Best approach to collecting log files from remote machines?

I have over 500 machines distributed across a WAN covering three continents. Periodically, I need to collect text files which are on the local hard disk on each blade. Each server is running Windows server 2003 and the files are mounted on a share which can be accessed remotely as \server\Logs. Each machine holds many files which can be several Mb each and the size can be reduced by zipping.
Thus far I have tried using Powershell scripts and a simple Java application to do the copying. Both approaches take several days to collect the 500Gb or so of files. Is there a better solution which would be faster and more efficient?
I guess it depends what you do with them ... if you are going to parse them for metrics data into a database, it would be faster to have that parsing utility installed on each of those machines to parse and load into your central database at the same time.
Even if all you are doing is compressing and copying to a central location, set up those commands in a .cmd file and schedule it to run on each of the servers automatically. Then you will have distributed the work amongst all those servers, rather than forcing your one local system to do all the work. :-)
The first improvement that comes to mind is to not ship entire log files, but only the records from after the last shipment. This of course is assuming that the files are being accumulated over time and are not entirely new each time.
You could implement this in various ways: if the files have date/time stamps you can rely on, running them through a filter that removes the older records from consideration and dumps the remainder would be sufficient. If there is no such discriminator available, I would keep track of the last byte/line sent and advance to that location prior to shipping.
Either way, the goal is to only ship new content. In our own system logs are shipped via a service that replicates the logs as they are written. That required a small service that handled the log files to be written, but reduced latency in capturing logs and cut bandwidth use immensely.
Each server should probably:
manage its own log files (start new logs before uploading and delete sent logs after uploading)
name the files (or prepend metadata) so the server knows which client sent them and what period they cover
compress log files before shipping (compress + FTP + uncompress is often faster than FTP alone)
push log files to a central location (FTP is faster than SMB, the windows FTP command can be automated with "-s:scriptfile")
notify you when it cannot push its log for any reason
do all the above on a staggered schedule (to avoid overloading the central server)
Perhaps use the server's last IP octet multiplied by a constant to offset in minutes from midnight?
The central server should probably:
accept log files sent and queue them for processing
gracefully handle receiving the same log file twice (should it ignore or reprocess?)
uncompress and process the log files as necessary
delete/archive processed log files according to your retention policy
notify you when a server has not pushed its logs lately
We have a similar product on a smaller scale here. Our solution is to have the machines generating the log files push them to a NAT on a daily basis in a randomly staggered pattern. This solved a lot of the problems of a more pull-based method, including bunched-up read-write times that kept a server busy for days.
It doesn't sound like the storage servers bandwidth would be saturated, so you could pull from several clients at different locations in parallel. The main question is, what is the bottleneck that slows the whole process down?
I would do the following:
Write a program to run on each server, which will do the following:
Monitor the logs on the server
Compress them at a particular defined schedule
Pass information to the analysis server.
Write another program which sits on the core srver which does the following:
Pulls compressed files when the network/cpu is not too busy.
(This can be multi-threaded.)
This uses the information passed to it from the end computers to determine which log to get next.
Uncompress and upload to your database continuously.
This should give you a solution which provides up to date information, with a minimum of downtime.
The downside will be relatively consistent network/computer use, but tbh that is often a good thing.
It will also allow easy management of the system, to detect any problems or issues which need resolving.
NetBIOS copies are not as fast as, say, FTP. The problem is that you don't want an FTP server on each server. If you can't process the log files locally on each server, another solution is to have all the server upload the log files via FTP to a central location, which you can process from. For instance:
Set up an FTP server as a central collection point. Schedule tasks on each server to zip up the log files and FTP the archives to your central FTP server. You can write a program which automates the scheduling of the tasks remotely using a tool like schtasks.exe:
KB 814596: How to use schtasks.exe to Schedule Tasks in Windows Server 2003
You'll likely want to stagger the uploads back to the FTP server.

Resources