I am going to make UGC project. I have a Main Server (Application) & 4 Encoder Servers (fro converting videos) and a Storage server (for hosting videos).
I want to use database driver in laravel queue and my target database is jobs. for each uploaded videos I have 5 certain jobs that convert video to 240p, 360p, 480p, 720p & and 1080p.
but jobs Does not specify belongs to which Encoder servers. for example a video uploaded in Encoder Server #4 , but Encoder Server #2 try to start job and get failed because files are in Encoder Server #4
how can I solve this chanlenge?
As #apokryfos says, upload the file to a shared storage like Amazon S3 / Google Cloud Storage/ Digital Ocean spaces) .. whatever..
Have the processing job download it from a central store, process it, and upload the result to another central storage.
If you bind a single job to a single worker, (encoder server) as you call it,
It does not make any sense, and it will never be scalable, and are pretty much doomed, to run into issues.
Doing it this way, you can just scale up the number of workers, once you need it.
you could even auto scale them.
Consider using a kubernetes deployment, to allow easy (auto) scaling.
Related
I'm developing a app in GCP that process a video file in Cloud Run for get the frames and storage the frames in other Bucket.
But my Cloud Run application need to download the whole file and save it in the container instance, and process the video frame by frame.
So, i changed to process the video in parallel by each instance process in block of frames, but continue downloading the whole video file in each container.
This process is relative fast for short videos, but large videos of 20 GB or more can take a long time and the resources is a bit large.
So, my idea is to only download 500 MB (or less) of the video by container and process this fragment only.
So, my question is:
How can optimize the download of the video for only download the frames necessary without download the whole video?
Can donwload the video of Cloud Storage in streaming by only the need block of frames?
I have a spring boot rest API deployed on AWS Elastic Beanstalk and I am trying to upload pictures through it.
This is what I did : Upload a zip file through a file input from the browser, get the zip file on the server, go through all the files and upload each one on AWS S3.
It works fine but I ran into a problem: When I try to upload lots of pictures, I get an HTTP error (504 Gateway Timeout). I found out this is because the server takes too much time to respond, and I am trying to figure how to set a higher timeout for the requests (didn't find yet).
But in the mean time I am asking myself if it is the best solution.
Wouldn't it be better to end the request directly after receiving the zip file, make the uploads to S3 and after that notify the user that the uploads are done ? Is there even a way to do that ? Is there a good practice for this ? (operation that takes lots of time to process).
I know how to do the process asynchronously but I would really like to know how to notify the user after it completes.
Wouldn't it be better to end the request directly after receiving the zip file, make the uploads to S3 and after that notify the user that the uploads are done ?
Yes, asynchronous processing of the uploaded images in the zip file would be better.
Is there even a way to do that ? Is there a good practice for this ? (operation that takes lots of time to process).
Yes there is a better way. To keep everything within EB, you could look at Elastic Beanstalk worker environment. The worker environment is ideal for processing your images.
In this solution, your web based environment would store the images uploaded in S3 and submit it names along with other identifying information to an SQS queue. The queue is an entry point for the worker environment.
Your workers would process the images from the queue independently from the web environment. In the meantime, the web environment would have to check for the results and notify your users once the images get processed.
The EB also supports linking different environments. Thus you could establish a link between web and worker environments for easier integration.
Context
I have a web dyno which receives a video file with the intention of applying a computer vision algorithm to return an analysis. The algorithm takes about 10 seconds to run. My current method is to process it with the web dyno. The whole thing is pretty fast. The user doesn't have to wait any more than a minute.
What's not working
But of course, tying up the web dyno is a bad idea. And some users have gotten timeouts... So I tried implementing redis to pass the job to a worker dyno.
#application.route('/video', methods=['POST'])
#cross_origin()
def video():
video_file = request.files['videoData']
job = q.enqueue_call(
func=run_analysis, args=(video_file), result_ttl=5000
)
return json.dumps(job.id)
But this gives me an error
TypeError: cannot serialize '_io.BufferedRandom' object and I understand why.
In my dev env, I can save the video to the filesystem and pass the file path only, but this doesn't work in production as the web dyno's file system is ephemeral and the worker won't see the file.
So I'm looking for the fastest way to get the video file across. Speed is of the essence here, as the user is waiting for their video to be processed.
What I've tried
I've tried S3 (uploading direct from client and downloading in the worker) but it made the whole process way slower. First of all, it takes longer to upload to S3 than to my Heroku endpoint. Second of all, I have to then download it to the worker which takes a while as well. I don't really need to keep the file, so it's a very inefficient work around.
Heroku dynos are completely isolated containers, that is why they cannot share file system as you want. But if you'll host them on another hosting like DigitalOcean or Amazon, you'll be able to access files you stored by Flask from the other workers instantly (or almost instantly - don't forget to create a copy of temp-file, Flask or WSGI should delete it after response sent).
Another option is to find the fastest way of "transporting" video-data (not always file) to a worker. You can do it using:
queue - put whole file to a queue. Method is not recommended, but still ok, if video files are really small.
in-memory database - save file to some in-memory database (e.g. Redis). They have a lot of mechanisms to quickly transport data between servers or processes (cons: expensive on Heroku)
database - save file to a general purpose database (e.g. Postgresql), that will do the same as Redis, but it is able to work with bigger data cheaper, though a bit slower.
WebSockets or even Unix-socket - you can have one worker to which Flask can connect to, send file and return http-response. That "listener" will actually start task. It can either save video to a file and provide path to еру next worker (but it should always be on the same dyno as the rest workers are) or provide data directly using args, "fork", threading, subprocessing, etc.
You can start with Flask-SocketIO.
But remember, that you need to configure server-to-server connection - between webapp and worker that should list it in loop or separate thread. Not Javascript in browser, which potentially is also option - start task and upload file directly to worker.
P.S. there are no 30sec. timeouts on Heroku for WebSockets :)
I am transferring around 31 TB of data that consists of 4500 files, file sizes range from 69MB to 25GB, from a remote server to a s3 bucket. I am using s4cmd put to do this and put it in a bash script upload.sh:
#!/bin/bash
FILES="/path/to/*.fastq.gz"
for i in $FILES
do
echo "$i"
s4cmd put --sync-check -c 10 $i s3://bucket-name/directory/
done
Then I use qsub to submit the job:
qsub -cwd -e error.txt -o output.txt -l h_vmem=10G -l mem_free=8G -l m_mem_free=8G -pe smp 10 upload.sh
This is taking way too long - it took 10 hours to upload ~20 files. Can someone suggest alternatives or modifications to my command?
Thanks!
Your case may belong to the situation when copying the data onto physical media and shipping it by regular mail is faster and cheaper than transferring the data over the internet. AWS supports such a "protocol" and has a special name for it - AWS Snowball.
Snowball is a petabyte-scale data transport solution that uses secure
appliances to transfer large amounts of data into and out of the AWS
cloud. Using Snowball addresses common challenges with large-scale
data transfers including high network costs, long transfer times, and
security concerns. Transferring data with Snowball is simple, fast,
secure, and can be as little as one-fifth the cost of high-speed
Internet.
With Snowball, you don’t need to write any code or purchase any
hardware to transfer your data. Simply create a job in the AWS
Management Console and a Snowball appliance will be automatically
shipped to you*. Once it arrives, attach the appliance to your local
network, download and run the Snowball client to establish a
connection, and then use the client to select the file directories
that you want to transfer to the appliance. The client will then
encrypt and transfer the files to the appliance at high speed. Once
the transfer is complete and the appliance is ready to be returned,
the E Ink shipping label will automatically update and you can track
the job status via Amazon Simple Notification Service (SNS), text
messages, or directly in the Console.
* Snowball is currently available in select regions. Your location will be verified once a job has been created in the AWS Management
Console.
The capacity of their smaller device is 50TB, a good fit for your case.
There is also a similar service AWS Import/Export disk, where you ship your own hardware (hard drives), instead of their special device:
To use AWS Import/Export Disk:
Prepare a portable storage device (see the Product Details page for supported devices).
Submit a Create Job request. You’ll get a job ID with a digital signature used to authenticate your device.
Print out your pre-paid shipping label.
Securely identify and authenticate your device. For Amazon S3, place the signature file on the root directory of your device. For
Amazon EBS or Amazon Glacier, tape the signature barcode to the
exterior of the device.
Attach your pre-paid shipping label to the shipping container and ship your device along with its interface connectors, and power supply
to AWS.
When your package arrives, it will be processed and securely
transferred to an AWS data center, where your device will be attached
to an AWS Import/Export station. After the data load completes, the
device will be returned to you.
I have over 500 machines distributed across a WAN covering three continents. Periodically, I need to collect text files which are on the local hard disk on each blade. Each server is running Windows server 2003 and the files are mounted on a share which can be accessed remotely as \server\Logs. Each machine holds many files which can be several Mb each and the size can be reduced by zipping.
Thus far I have tried using Powershell scripts and a simple Java application to do the copying. Both approaches take several days to collect the 500Gb or so of files. Is there a better solution which would be faster and more efficient?
I guess it depends what you do with them ... if you are going to parse them for metrics data into a database, it would be faster to have that parsing utility installed on each of those machines to parse and load into your central database at the same time.
Even if all you are doing is compressing and copying to a central location, set up those commands in a .cmd file and schedule it to run on each of the servers automatically. Then you will have distributed the work amongst all those servers, rather than forcing your one local system to do all the work. :-)
The first improvement that comes to mind is to not ship entire log files, but only the records from after the last shipment. This of course is assuming that the files are being accumulated over time and are not entirely new each time.
You could implement this in various ways: if the files have date/time stamps you can rely on, running them through a filter that removes the older records from consideration and dumps the remainder would be sufficient. If there is no such discriminator available, I would keep track of the last byte/line sent and advance to that location prior to shipping.
Either way, the goal is to only ship new content. In our own system logs are shipped via a service that replicates the logs as they are written. That required a small service that handled the log files to be written, but reduced latency in capturing logs and cut bandwidth use immensely.
Each server should probably:
manage its own log files (start new logs before uploading and delete sent logs after uploading)
name the files (or prepend metadata) so the server knows which client sent them and what period they cover
compress log files before shipping (compress + FTP + uncompress is often faster than FTP alone)
push log files to a central location (FTP is faster than SMB, the windows FTP command can be automated with "-s:scriptfile")
notify you when it cannot push its log for any reason
do all the above on a staggered schedule (to avoid overloading the central server)
Perhaps use the server's last IP octet multiplied by a constant to offset in minutes from midnight?
The central server should probably:
accept log files sent and queue them for processing
gracefully handle receiving the same log file twice (should it ignore or reprocess?)
uncompress and process the log files as necessary
delete/archive processed log files according to your retention policy
notify you when a server has not pushed its logs lately
We have a similar product on a smaller scale here. Our solution is to have the machines generating the log files push them to a NAT on a daily basis in a randomly staggered pattern. This solved a lot of the problems of a more pull-based method, including bunched-up read-write times that kept a server busy for days.
It doesn't sound like the storage servers bandwidth would be saturated, so you could pull from several clients at different locations in parallel. The main question is, what is the bottleneck that slows the whole process down?
I would do the following:
Write a program to run on each server, which will do the following:
Monitor the logs on the server
Compress them at a particular defined schedule
Pass information to the analysis server.
Write another program which sits on the core srver which does the following:
Pulls compressed files when the network/cpu is not too busy.
(This can be multi-threaded.)
This uses the information passed to it from the end computers to determine which log to get next.
Uncompress and upload to your database continuously.
This should give you a solution which provides up to date information, with a minimum of downtime.
The downside will be relatively consistent network/computer use, but tbh that is often a good thing.
It will also allow easy management of the system, to detect any problems or issues which need resolving.
NetBIOS copies are not as fast as, say, FTP. The problem is that you don't want an FTP server on each server. If you can't process the log files locally on each server, another solution is to have all the server upload the log files via FTP to a central location, which you can process from. For instance:
Set up an FTP server as a central collection point. Schedule tasks on each server to zip up the log files and FTP the archives to your central FTP server. You can write a program which automates the scheduling of the tasks remotely using a tool like schtasks.exe:
KB 814596: How to use schtasks.exe to Schedule Tasks in Windows Server 2003
You'll likely want to stagger the uploads back to the FTP server.