Aws Datapipeline: List content of Output Bucket in ShellCommandActivity

Aws Datapipeline: List content of Output Bucket in ShellCommandActivity - bash

How can I list the files which are contained in my output Bucket in a Shell Script?
ls ${OUTPUT1_STAGING_DIR}
does not work, as I get the message that there's no file or directory by this name.
I am sure there is an easy way to do this but I can't seem to find a solution.

From my experience using DataPipeline, that is not supported. You can only read from input bucket directory. The output bucket directory is just a place where you can write files to that will later on be copied over into S3.

Related

How to execute a bash file containing curl instructions within a Google storage bucket and directly copy the contents to the bucket?

I have a bash script file where each line is a curl command to download a file. This bash file is in a Google bucket.
I would like to execute the file either directly from the storage and copy its downloaded contents there or execute it locally and directly copy its content to the bucket.
Basically, I do not want to have these fils on my local machine.. I have tried things along these lines but it either failed or simply downloaded everything locally.
gsutil cp gs://bucket/my_file.sh - | bash gs://bucket/folder_to_copy_to/
Thank you!

To do so, the bucket needs to be mounted on the pod (the pod would see it as a directory).
If the bucket supports NFS, you would be able to mount it as shown here.
Also, there is another way as shown in this question.
otherwise, you would need to copy the script to the pod, run it, then upload the generated files to the bucket, and lastly clean everything up.
The better option is to use a filestore which can be easily mounted using CSI drivers as mentioned here.

AWS S3 download all files with same name with shell

There are files from an AWS s3 bucket that I would like to download, they all have the same name but are in different subfolders. There are no credentials required to download and connect to this bucket. I would like to download all the files called "B01.tif" in s3://sentinel-cogs/sentinel-s2-l2a-cogs/7/V/EG/, and save them with the name of the subfolder they are in (for example: S2A_7VEG_20170205_0_L2AB01.tif).
Path example:
s3://sentinel-cogs/sentinel-s2-l2a-cogs/7/V/EG/2017/2/S2A_7VEG_20170205_0_L2A/B01.tif
I was thinking of using a bash script that prints the output of ls to download the file with cp, and save it on my pc with a name generated from the path.
Command to use ls:
aws s3 ls s3://sentinel-cogs/sentinel-s2-l2a-cogs/7/V/EG/2017/2/ --no-sign-request
Command to download a single file:
aws s3 cp s3://sentinel-cogs/sentinel-s2-l2a-cogs/7/V/EG/2017/2/S2A_7VEG_20170205_0_L2A/B01.tif --no-sign-request B01.tif
Attempt to download multiple files:
VAR1=B01.tif
for a in s3://sentinel-cogs/sentinel-s2-l2a-cogs/7/V/EG/:
for b in s3://sentinel-cogs/sentinel-s2-l2a-cogs/7/V/EG/2017/:
for c in s3://sentinel-cogs/sentinel-s2-l2a-cogs/7/V/EG/2017/2/:
NAME=$(aws s3 ls s3://sentinel-cogs/sentinel-s2-l2a-cogs/7/V/EG/$a$b$c | head -1)
aws s3 cp s3://sentinel-cogs/sentinel-s2-l2a-cogs/7/V/EG/$NAME/B01.tif --no-sign-request $NAME$VAR1
done
done
done
I don't know if there is a simple way to go automatically through every subfolder and save the files directly. I know my ls command is broken, because if there are multiple subfolders it will only take the first one as a variable.

It's easier to do this in a programming language rather than as a Shell script.
Here's a Python script that will do it for you:
import boto3
BUCKET = 'sentinel-cogs'
PREFIX = 'sentinel-s2-l2a-cogs/7/V/EG/'
FILE='B01.tif'
s3_resource = boto3.resource('s3')
for object in s3_resource.Bucket(BUCKET).objects.filter(Prefix=PREFIX):
if object.key.endswith(FILE):
target = object.key[len(PREFIX):].replace('/', '_')
object.Object().download_file(target)

I want to get the latest file names under each directory of gcs

I want to know the path to the latest file under each directory using gsutil ls.
Executing the command in a loop like this is very slow.
I want the final output to be
How can I do this?
I want to know the path to the latest file under each directory using gsutil ls.
shell script
for dir in dir_list[#];do
file+=$(gsutil ls -R ${dir} | tail -n 1);
done
Running the command in a loop process is very slow.
I want the final output to be
Is there another way?
results image
gs://bucket/dir_a/latest.txt
gs://bucket/dir_b/latest.txt
gs://bucket/dir_c/latest.txt
gs://bucket/dir_d/latest.txt

There isn't other strategy for a good reason: directory doesn't exist. So, you need to scan all the files, get the metadata, get this one which is the last, and do that for each "similar prefix".
A prefix is what you call directories "/path/to/prefix/". That's why you can only perform search by prefix in GCS not by file pattern.
So, you can imagine to build a custom app which, for each different prefix (directory), create a concurrent process (fork) dedicated to this prefix. Like that you can perform parallelization. It's not so simple to write but you can!

How do I use grep command to search in .bz2.gz.bz2 file?

Basically I have .bz2.gz.bz2 file which on extraction gives a .bz2.gz file and on again extraction gives .bz2 file. In this .bz2 file, is my txt file which I want to search on using grep command. I have searched for this but I got bzgrep command which will only search in bz2 file and not the corresponding .gz.bz2 file and give me no results.
Is there a command in unix system which will recursively search in a zipped archive for zipped archive and return results only when it finds the txt file inside it?
P.S: the txt file may be deep in the archive to level 10 max. I want the command to recursively find the txt file and search for the required string. And there will be no other than an archive inside the archive until the txt file level.

I'm not sure I fully understand but maybe this will help:
for i in /path/to/each/*.tar.bz2; do
tar -xvjf "$i" -C /path/to/save/in
rm $i
done
extract all `tar.bz2` and save them in directory then remove the `.bz2`

Thnx for sharing your question.
There are a couple of strange things with it though:
It makes no sense to have a .bz2.gz.bz2 file, so have you created this file yourself? If so, I'd advise you to reconsider doing so in that manner.
Also, you mention there is a .bz2 that would apparently contain different archives, but a .bz2 can only contain one single file by design. So if it contains archives it is probably a .tar.bz2 file in which the tar-file holds the actual archives.
In answer to your question, why can't you write a simple shell script that will unpack your .bz2.gz.bz2 into a .bz2.gz and then into a .bz2 file and then execute your bzgrep command on that file?
I do not understand where it is exactly that you seem to get stuck..

Remove directory level when transferring from HDFS to S3 using S3DistCp

I have a Pig script (using a slightly modified MultiStorage) that transforms some data. Once the script runs, I have data in the following format on HDFS:
/tmp/data/identifier1/indentifier1-0,0001
/tmp/data/identifier1/indentifier1-0,0002
/tmp/data/identifier2/indentifier2-0,0001
/tmp/data/identifier3/indentifier3-0,0001
I'm attempting to use S3DistCp to copy these files to S3. I am using the --groupBy .*(identifier[0-9]).* option to combine files based on the identifier. The combination works, but when copying to S3, the folders are also copied. The end output is:
/s3bucket/identifier1/identifier1
/s3bucket/identifier2/identifier2
/s3bucket/identifier3/identifier3
Is there a way to copy these files without that first folder? Ideally, my output in S3 would look like:
/s3bucket/identifier1
/s3bucket/identifier2
/s3bucket/identifier3
Another solution I've considered is to use HDFS commands to pull those files out of their directories before copying to S3. Is that a reasonable solution?
Thanks!

The solution I've arrived upon is to use distcp to bring these files out of the directories before using s3distcp:
hadoop distcp -update /tmp/data/** /tmp/grouped
Then, I changed the s3distcp script to move data from /tmp/grouped into my S3 bucket.

Using distcp before s3distcp is really expensive. One other option you have is to create a manifest file with all your files in it and give its path to s3distcp. In this manifest you can define the "base name" of each file. If you need an example of a manifest file just run s3distcp on any folder with argument --outputManifest.
more information can be found here

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Aws Datapipeline: List content of Output Bucket in ShellCommandActivity - bash

How can I list the files which are contained in my output Bucket in a Shell Script? ls ${OUTPUT1_STAGING_DIR} does not work, as I get the message that there's no file or directory by this name. I am sure there is an easy way to do this but I can't seem to find a solution.

From my experience using DataPipeline, that is not supported. You can only read from input bucket directory. The output bucket directory is just a place where you can write files to that will later on be copied over into S3.

Related

How to execute a bash file containing curl instructions within a Google storage bucket and directly copy the contents to the bucket?

AWS S3 download all files with same name with shell

I want to get the latest file names under each directory of gcs

How do I use grep command to search in .bz2.gz.bz2 file?

Remove directory level when transferring from HDFS to S3 using S3DistCp

Categories

Resources