Split S3 files line by line and put in other bucket - amazon-data-pipeline

I have moderately sized files in S3 (which i got from the EMR) and each line has a key and object which i would like to put in separate bucket.
Is there anyway to implement this using data pipeline?

If you just want to copy this data into another s3 bucket, you can use CopyActivity in datapipeline to do this. Your pipeline will have a S3DataNode as an input to CopyActivity and an output S3DataNode. You can get started by trying out one of the templates available in AWS datapipeline console and edit your pipeline to have the objects I have mentioned above.

Related

How to delete all objects from a S3 folder of various sub folder with Specific name which are older than n days using AWS Ruby SDK

I have a requirement to delete files with prefix application.log which are older than 5 days in an S3 folder.
The file is present inside log-bucket/main-shell/apps/app-main-shell-55f79d74fc-4sx6c/helpkit.
Is there a way where we can list and delete files recursively using AWS Ruby SDK?
Rather than writing your own code you can setup AWS3 life cycle with prefix by using RUBY SDK.
In life cycle mentioned after 5th day delete data from particular path.
Below are the reference Links to configure S3 cycle and Ruby SDK.
https://docs.aws.amazon.com/sdkforruby/api/Aws/S3/BucketLifecycle.html
https://docs.aws.amazon.com/AmazonS3/latest/dev/lifecycle-configuration-examples.html
If you want to perform a 1-time clean-up of objects under a single key prefix, you can use the batch operations on the objects collection.
s3 = Aws::S3::Resource.new()
s3.bucket('bucket-name').objects(prefix: 'some/key/prefix/').batch_delete!
This will list objects with the given key prefix and then issue a batch delete for each page of results. The more objects with the given prefix, the more api calls. It should be 2 requests (1x list, 1x batch delete) per 1k objects to delete.
Please note, this is a destructive operation. Make sure you key prefix is correct before you issue the batch delete.
If you want to do this on a frequent basis, then I would use a bucket lifecycle configuration.

Spark Dataset Parquet partition on s3 creating temporary folder

Spark(version=2.2.0) there is not DirectParquetOutputCommitter. As an alternative, I can use
dataset
.option("mapreduce.fileoutputcommitter.algorithm.version", "2")//magic here
.parquet("s3a://...")
to avoid creating _temporary folder on S3.
Everything works fine until I set a partitionBy to my Dataset
dataset
.partitionBy("a", "b")
.option("mapreduce.fileoutputcommitter.algorithm.version", "2")//magic stop working creating _temporary on S3
.parquet("s3a://...")
Also tried adding but didn't work
spark.conf.set("mapreduce.fileoutputcommitter.algorithm.version", "2")
With partitionBy at Spark Dataset, It's going to create _temporary and move files which becomes a very slow operation.
There is any alternative or missing configuration?
Hadoop 3.1's s3a will have a zero rename committer built in, (va HADOOP-13786). Until then, you can make use of its precursor, which is from netflix
Note that "algorithm 2" isn't a magic step eliminating the _temp dir, just renaming task output direct to the destination when the individual tasks commit. Still prone to errors if there's a delayed consistency in the directory listing, and still O(data). You cannot safely use either the v1 or v2 committers directly with S3, not with the S3A connector as shipping in Hadoop 2.x
Alternatives (in order of recommendation + ease - top is best):
Use Netflix's S3Committer: https://github.com/rdblue/s3committer/
Write to HDFS, then copy to S3 (e.g. via s3distcp)
Don't use the partitionBy, but instead iterate over all the partition permutations and write the results dynamically to each partitioned directory
Write a custom file committer

Spark saveAsTextFile writes empty file - <directory>_$folder$ to S3

rdd.saveAsTextFile("s3n://bucket-name/path) is creating an empty file with folder name as - [folder-name]_$folder$
Seems like this empty file in used by hadoop-aws jar (of org.apache.hadoop) to mimick S3 filesystem as hadoop filesystem.
But, my application writes thousands of files to S3. As saveAsTextFile creates folder (from the given path) to write the data (from rdd) my application ends up creating thousands of these empty files - [directory-name]_$folder$.
Is there a way to make rdd.saveAsTextFile not to write these empty files?
Stop using s3n, switch to s3a. It's faster and actually supported. that will make this issue go away, along with the atrocious performance problems reading large Parquet/ORC files.
Also, if your app is creating thousands of small files in S3, you are creating future performance problems: listing and opening files on S3 is slow. Try to combine source data into larger columnar-formatted files & use whatever SELECT mechanism your framework has to only read the bits you want

Get the size of a folder in Amazon s3 using laravel

I want to get the size of a folder without looping through all the files in laravel. The folder is in Amazon S3. My current code is:
$files = Storage::allFiles($dir);
foreach ($files as $file) {
$size+= Storage::size($file);
}
I want to avoid the looping is there any way to accomplish this.
Using listContents you can get an array of files including filesizes and then you can map that array into a total size.
$disk = Storage::disk('s3');
$size = array_sum(array_map(function($file) {
return (int)$file['size'];
}, array_filter($disk->listContents('your_folder', true /*<- recursive*/), function($file) {
return $file['type'] == 'file';
})));
The other option you have, if you can deal with a day old stats, is the newly released 'S3 Storage Inventory' feature.
S3 can put out a daily (or weekly) file that has an inventory of all of your objects in the folder, including size:
http://docs.aws.amazon.com/AmazonS3/latest/dev/storage-inventory.html
Amazon S3 inventory is one of the tools Amazon S3 provides to help
manage your storage. You can simplify and speed up business workflows
and big data jobs using the Amazon S3 inventory, which provides a
scheduled alternative to the Amazon S3 synchronous List API operation.
Amazon S3 inventory provides a comma-separated values (CSV) flat-file
output of your objects and their corresponding metadata on a daily or
weekly basis for an S3 bucket or a shared prefix (that is, objects
that have names that begin with a common string).
You can configure what object metadata to include in the inventory,
whether to list all object versions or only current versions, where to
store the inventory list flat-file output, and whether to generate the
inventory on a daily or weekly basis. You can have multiple inventory
lists configured for a bucket. For information about pricing, see
Amazon S3 Pricing.
There is no way to compute the size of a folder without recursively looping through it.
A quick command line solution is using du.
du -hs /path/to/directory will output the disk usage.
-h is to get the numbers "human readable", e.g. get 140M instead of 143260 (size in KBytes)
-s is for summary (otherwise you'll get not only the size of the folder but also for everything in the folder separately)
Referenced: https://askubuntu.com/questions/1224/how-do-i-determine-the-total-size-of-a-directory-folder-from-the-command-line
Amazon CloudWatch provides automatic metrics for the number of objects stored in a bucket and the storage space occupied. I'm not sure how often these metrics are updated, but that would be the simplest to use. However, this measures the whole bucket rather than just a particular folder.
See: Amazon Simple Storage Service Metrics and Dimensions

How can I place an image back in the same S3 bucket after running AWS Lambda function?

I currently have two buckets in S3 - let's call them photos and photos-thumbnails. Right now, when a user uploads an image from our iOS app, we directly upload that photo to the photos bucket, which triggers a lambda function that resizes the photo into a thumbnail and uploads the thumbnail into the photos-thumbnails bucket.
I now want to include some image compression for the images in the photos bucket before a thumbnail is created in the original bucket (photos). However, if I set the compression lambda function to be triggered whenever an object is created in the photos bucket, it will wind up in a never-ending loop of the user uploading the original photo, triggering the compression and placing back in the same bucket, triggering compression again, etc.
Is there a way I can intercept this before it becomes a recursive call for image compression? Or is the only way to create a third bucket?
A third bucket would probably be the best. If you really want to use the same bucket, just choose some criteria controlling whether the image in photos should be modified or not (perhaps image file size or something), then ensure that images that have been processed once fall below the threshold. The lambda will still run twice, but the second time it will examine the image and find it has already been processed and thus not process it again. To my knowledge there is no way to suppress the second run of the lambda.
Another option might be to filter based on how the object is created. The following event types can be used in S3. Use one for what your users upload (maybe POST?) and the other for what your lambda does (maybe PUT?)
s3:ObjectCreated:Put
s3:ObjectCreated:Post
s3:ObjectCreated:Copy
s3:ObjectCreated:CompleteMultipartUpload
A third bucket would work, or essentially the same thing, rename the file with a prefix after compressing and then check for that prefix before reprocessing the file.
If you name the outputs of your function in a predictable way, you can just filter any files that were created by your function at the start of the function.
However, as was mentioned previously, using a different bucket for the output would be simpler.

Resources