I have an Android and IOS app that uploads images (about 15,000 per minute) to a AWS S3 bucket, everything is all right, but i need to process those images in a web app that is used from 2 to 50 different users called 'Monitores' , when this kind of user logins and begin to process the images the app scan the S3 bucket for the filenames, something like:
$recibidos = Storage::disk('s3recibidos');
$total_archivos = $recibidos->allfiles();
this generates an array with the files are stored in the time the route is invoked, if i use this with one user for process there is no problem, because the process is one time only, but what if i have 2 or more users trigger this process? the process retrieves no the exact list but i think many of the un processed files will be duplicated.
The process of the filenames is to store in a database and to move to a subdirectory.
For example:
I have 1000 files in the AWS S3 bucket and user1 invoke the process so the array will have 1000 filenames to process, right now the time to process those files is about 3 min, so before the process finish 1000 new files was added to the AWS S3 bucket this files are not in the user1 array, then user2 logins and begins to process, so right now the AWS S3 has new files and old files, then when get the new array gets some old filenames (the ones are not process), in fact when user2 process the files some of this was not available, because the user1 process was made the job.
I need help in this two things:
1.- How to deal with the process.
2.- How can i use wildcards, because one of the final process changes the filename of the files in S3, so the filename list that i need to process has its exepecific format.
Thanks for any advice
I'm a little confused about your process, but let's assume:
You have a large number of incoming images
You need to perform some operation on each of those instances
There are two recommended approaches to do this:
Option 1: Serverless
Configure the Amazon S3 bucket to trigger an AWS Lambda function whenever a new object is created in the bucket
Create an AWS Lambda function as a worker -- it receives information about each file, then processes the file
AWS Lambda will automatically scale to run multiple Lambda functions in parallel. The default is up to 1000 concurrent Lambda functions, but this can be increased upon request.
Option 2: Traditional
Create an Amazon SQS queue to store details of images to process
Configure the Amazon S3 bucket to send an event to the SQS queue whenever a new object is created in the bucket
Use Amazon EC2 instance(s) to run multiple workers
Each worker reads the file information from the queue, processes the image, then deletes the message from the queue. It then repeats, pulling the next message from the queue.
Scale the number of EC2 instances and/or workers as necessary
Both of these approaches have workers operating on one image files at a time, so you do not have the problem of maintaining lists while images being continually added. They are also highly scalable with no code changes.
Related
I'm building an application in Laravel (v9) where users upload videos, and these get converted to MP4 (showing progress percentage), thumbnail gets created… etc
Once the video is uploaded, I dispatch a new job in the background that runs all my FFMPEG commands, and marks the video as ready on the database once FFMPEG has finished.
However, if there are multiple users uploading multiple videos, this leaves them waiting, as Laravel’s queue executes each job one by one.
How can I make it so that videos get converted immediately without waiting for the previous job to finish?
You're always probably going to want to use a queue, but you could look into increasing the number of queue workers that are running at any given time. Take a look at the Laravel docs on running your queue via Supervisor and consider setting the numprocs value high enough to support the concurrent load you need to handle.
The caveat is that each queue worker will need CPU/memory, so if you set the number of concurrent workers too high, it may exceed your server's capacity.
You can use this article on php-fpm tuning to help figure out your server capacity needs. The article is focused on tuning web servers, but you can use the same technique to determine how much memory your queue workers are using, and from there determine how many workers you can reasonably run at once.
One other option would be to look at Sidecar to run your ffmpeg processes in AWS Lambdas rather than relying on a queue at all. This project may help you get started…
I have a lambda attached to an s3 bucket which would contain historical data(20-25MB). This data would contain folders by months and each month would have over 400k records in txt file. The lambda triggers for every S3EventNotification and would parse the file line by line and save in the DynamoDB table. I need to save historical data in DynamoDB table before launching to prod. Is it better to write a script than running the lambda?
Some of the research that I've done is as file size may be large, lambda can timeout. Also, memory usage is restricted to 512Mb for the lambda.
Your question is a little unclear, but I think you are asking how to run the Lambda function again over all objects that are already in the S3 bucket.
The easiest method is to Use AWS Lambda with Amazon S3 batch operations - AWS Lambda:
You can use Amazon S3 batch operations to invoke a Lambda function on a large set of Amazon S3 objects. Amazon S3 tracks the progress of batch operations, sends notifications, and stores a completion report that shows the status of each action.
This way, the Lambda function can be triggered again as if the objects had been freshly uploaded. Note that multiple objects might be passed to a single Lambda function invocation, so ensure that the function is looping through the event['Records'] list, rather than merely processing event['Records'][0].
If you fear that the Lambda function might timeout, you can increase the timeout to a maximum of 15 minutes. Allocating more memory to a function will also allocate more CPU, which might make it run faster (but costs also increase). After processing a file, be sure to delete it from /tmp/ to avoid hitting the limit.
However, if objects are bigger than 512MB or take longer than 15 minutes to process, then using an AWS Lambda function is not appropriate.
I want to use a queue for file uploads. Users can upload files. Each file will have around 500 rows. Now I want to implement this logic:
Maximum of 5 files can be processed at the same time. The remaining files should be in the queue.
Each file should have 5 processes, so 5 rows will be inserted into databases at the same time. Shortly, there are will be a maximum of 25
processes (5 processes in every 5 files).
Now I am adding all files to one queue. Files processing one by one. Shortly first-come, first out. 2nd file needs to wait to finish 1st file.
How can I implement this? Or do you have any other suggestions?
What exactly is the difference between processing a file, and inserting rows into the DB?
If you want to run multiple workers for the same queue, you can simply start more workers using php artisan queue:work and additionally use flags to specify the queues --queue=process-files for example. See the documentation.
In a production environment, consider to configure a supervisor to run a specific amount of workers on a queue using numprocs directive.
Do I understand correctly you want to run 25 queue workers per user? That does not seem right. Instead, you should consider creating queues for fast/slow jobs.
We have a spark streaming app deployed in a YARN ec2 cluster with 1 name node and 2 data nodes. We submit the app with 11 executors with 1 core and 588 MB of RAM each.
The app streams from a directory in S3 which is constantly being written; this is the line of code that achieves that:
val ssc = new StreamingContext(sparkConf, Seconds(10))
val lines = ssc.fileStream[LongWritable, Text, TextInputFormat](Settings.S3RequestsHost , (f:Path)=> true, true )
//some maps and other logic here
ssc.start()
ssc.awaitTermination()
The purpose of using fileStream instead of textFileStream is to customize the way that spark handles existing files when the process starts. We want to process just the new files that are added after the process launched and omit the existing ones. We configured a batch duration of 10 seconds.
The process goes fine while we add a small number of files to s3, let's say 4 or 5. We can see in the streaming UI how the stages are executed successfully in the executors, one for each file that is processed. But sometimes when we try to add a larger number of files, we face a strange behavior; the application starts streaming files that have already been streamed.
For example, I add 20 files to s3. The files are processed in 3 batches. The first batch processes 7 files, the second 8 and the third 5. No more files are added to S3 at this point, but spark start repeating these phases endlessly with the same files!
Any thoughts what can be causing this?
I've posted a Jira ticket for this issue:
https://issues.apache.org/jira/browse/SPARK-3553
Note the sentence "The files must be created in the dataDirectory by atomically moving or renaming them into the data directory" from the Spark Streaming Programming Guide. The entire file must appear all at once, rather than creating the file empty and appending to it.
One approach is to get cloudberry to put the files somewhere else, and then run a script periodically that either moves or renames the files into the directory you've attached your streaming app to.
We have a system that receives archives on a specified directory and on a regular basis it launches a mapreduce job that opens the archives and processes the files within them. To avoid re-processing the same archives the next time, we're hooked into the close() method on our RecordReader to have it deleted after the last entry is read.
The problem with this approach (we think) is that if a particular mapping fails, the next mapper that makes another attempt at it finds that the original file has been deleted by the record reader from the first one and it bombs out. We think the way to go is to hold off until all the mapping and reducing is complete and then delete the input archives.
Is this the best way to do this?
If so, how can we obtain a listing of all the input files found by the system from the main program? (we can't just scrub the whole input dir, new files may be present)
i.e.:
. . .
job.waitForCompletion(true);
(we're done, delete input files, how?)
return 0;
}
Couple comments.
I think this design is heartache-prone. What happens when you discover that someone deployed a messed up algorithm to your MR cluster and you have to backfill a month's worth of archives? They're gone now. What happens when processing takes longer than expected and a new job needs to start before the old one is completely done? Too many files are present and some get reprocessed. What about when the job starts while an archive is still in flight? Etc.
One way out of this trap is to have the archives go to a rotating location based on time, and either purge the records yourself or (in the case of something like S3) establish a retention policy that allows a certain window for operations. Also whatever the back end map reduce processing is doing could be idempotent: processing the same record twice should not be any different than processing it once. Something tells me that if you're reducing your dataset, that property will be difficult to guarantee.
At the very least you could rename the files you processed instead of deleting them right away and use a glob expression to define your input that does not include the renamed files. There are still race conditions as I mentioned above.
You could use a queue such as Amazon SQS to record the delivery of an archive, and your InputFormat could pull these entries rather than listing the archive folder when determining the input splits. But reprocessing or backfilling becomes problematic without additional infrastructure.
All that being said, the list of splits is generated by the InputFormat. Write a decorator around that and you can stash the split list wherever you want for use by the master after the job is done.
The simplest way would probably be do a multiple input job, read the directory for the files before you run the job and pass those instead of a directory to the job (then delete the files in the list after the job is done).
Based on the situation you are explaining I can suggest the following solution:-
1.The process of data monitoring I.e monitoring the directory into which the archives are landing should be done by a separate process. That separate process can use some metadata table like in mysql to put status entries based on monitoring the directories. The metadata entries can also check for duplicacy.
2. Now based on the metadata entry a separate process can handle the map reduce job triggering part. Some status could be checked in metadata for triggering the jobs.
I think you should use Apache Oozie to manage your workflow. From Oozie's website (bolding is mine):
Oozie is a workflow scheduler system to manage Apache Hadoop jobs.
...
Oozie Coordinator jobs are recurrent Oozie Workflow jobs triggered by time (frequency) and data availabilty.