I'd like to ask for help with changing uppercase file names in AWS S3 to lowercase.
I got two files, one's a list of file names from an AWS S3 bucket with upper case letter like so (lets call it uppercase.txt):
ABc.txT
aBCd.pHp
AbCdE.jpg
and a second file with a list of the translation of the names to lower case (lowercase.txt, easily done with tr '[:upper:]' '[:lower:]'):
abc.txt
abcd.php
abcde.jpg
I tried a bunch of for loops, the command I wish to repeat is 'aws s3 mv $first_list_value $second_list_value .
Tried this:
for i in `cat uppercase_file.txt`; do aws s3 mv $i `cat lowercase_file.txt`; done
no dice :-( The AWS S3 API is limited and doesn't take well to most linux commands.
Yelp?
Something like this should work:
paste uppercase_file.txt lowercase_file.txt | while read uc lc
do
aws s3 mv $uc $lc
done
Related
I'm working with some files that are organized within a folder (named RAW) that contain several other folders with different names, all of them containing files ended by a string like _1 or _2 with the extension (.fq.gz in this case). Below I try to include a schedule for guidance.
RAW/
FOLDER1/
FILE_qwer_1.fa.gz
FILE_qwer_2.fa.gz
FOLDER2/
FILE_tyui_1.fa.gz
FILE_tyui_2.fa.gz
OTHER1/
FILE_asdf_1.fa.gz
FILE_asdf_2.fa.gz
...
So I am basically running a loop over all those directories under RAW and run a script that will create an output file, say out.
What I'm trying to accomplish is to name that out file as the folder it belongs to under $RAW (e.g. FOLDER1.eg after processing FILE_qwer_1.fa.gz and FILE_qwer_2.fa.gz above)
The loop below will work actually, but as you can imagine, it depends on how many folders I am working below the root /, as the option -f is hard-coded for the cut command.
for file1 in ${RAW}/*/*_1.fq.gz; do
file2="${file1/_1/_2}"
out="$(echo $file1 | cut -d '/' -f2)"
bash script_to_be_run.sh $file1 $file2 $out
done
Ideally, the variable out should be named as the replacement of the first * character of the glob used in the loop (e.g. FOLDER1.eg in the first iteration) followed by a custom extension, but I do not really know how to do it, nor if it is possible.
You can use ${var#prefix} to remove a prefix from the start of a variable.
for file1 in ${RAW}/*/*_1.fq.gz; do
file2="${file1/_1/_2}"
out="$(dirname "${file1#$RAW/}")" # cuts the $RAW from the beginning of the dirs
bash script_to_be_run.sh "$file1" "$file2" "$out"
done
(It's a good idea to quote variable expansions in case they contain spaces or other special character: "$file1" is safer than $file1.)
I am attempting to use either rsync or cp in a for loop to copy files matching a list of 200 of names stored on new lines in a .txt file that match filenames with the .pdbqt extension that are in a series of subdirectories with one parent folder. The .txt file looks as follows:
file01
file02
file08
file75
file45
...
I have attempted to use rsync with the following command:
rsync -a /home/ubuntu/Project/files/pdbqt/*/*.pdbqt \
--files-from=/home/ubuntu/Project/working/output.txt \
/home/ubuntu/Project/files/top/
When I run the rsync command I receive:
rsync error: syntax or usage error (code 1) at options.c(2346) [client=3.1.2]
I have written a bash script as follows in an attempt to get that to work:
#!/bin/bash
for i in "$(cat /home/ubuntu/Project/working/output.txt | tr '\n' '')"; do
cp /home/ubuntu/Project/files/pdbqt/*/"$i".pdbqt /home/ubuntu/Project/files/top/;
done
I understand cat isn't a great command to use but I could not figure out an alternate solution to it, as I am still new to using bash. Running that I get the following error:
tr: when not truncating set1, string2 must be non-empty
cp: cannot stat '/home/ubuntu/Project/files/pdbqt/*/.pdbqt': No such file or directory
I assume that the cp error is thrown as a result of the tr error but I am not sure how else to get rid of the \n that is read from the new line separated list.
The expected results are that from the subdirectories in /pdbqt/ with the 12000 .pdbqt files the 200 files from the output.txt list would be copied from those subdirectories into the /top/ directory.
for loops are good when your data is already in shell variables. When reading in data from a file, while ... read loops work better. In your case, try:
while IFS= read -r file; do cp -i -- /home/ubuntu/Project/files/pdbqt/*/"$file".pdbqt /home/ubuntu/Project/files/top/; done </home/ubuntu/Project/working/output.txt
or, if you find the multiline version more readable:
while IFS= read -r file
do
cp -i -- /home/ubuntu/Project/files/pdbqt/*/"$file".pdbqt /home/ubuntu/Project/files/top/
done </home/ubuntu/Project/working/output.txt
How it works
while IFS= read -r file; do
This starts a while loop reading one line at a time. IFS= tells bash not to truncate white space from the line and -r tells read not to mangle backslashes. The line is stored in the shell variable called file.
cp -i -- /home/ubuntu/Project/files/pdbqt/*/"$file".pdbqt /home/ubuntu/Project/files/top/
This copies the file. -i tells cp to ask before overwriting an existing file.
done </home/ubuntu/Project/working/output.txt
This marks the end of the while loop and tells the shell to get the input for the loop from /home/ubuntu/Project/working/output.txt
Do dirs in Project/files/pdbqt/* or files *.pdbqt have dashes (-) in the name?
The error is showing the line in rsync source code options.c
"Your options have been rejected by the server.\n"
which makes me think that it's interpreting inodes (files/dirs) in your glob as rsync options.
for i in $( < /home/ubuntu/Project/working/output.txt LC_CTYPE=C tr '\n' ' ' )
do
cp /home/ubuntu/Project/files/pdbqt/*/"${i}.pdbqt" /home/ubuntu/Project/files/top/
done
I think your cat tr is missing a space
cat /home/ubuntu/Project/working/output.txt | tr '\n' ' '
John1024's use of while and read are better than mine.
Your are thinking correctly to think rsync. rsync provides the option --files-from="yourfile" that will rsync all the files in your textfile (relative to the base directory you specify next) to the destination (either host:/dest/path or locally with /dest/path alone)
You will want to specify the --no-R to tell rsync not to use relative filenames since --files-from= takes the base path as the next argument. For example, to transfer all files in your text file to some remote host where the location of the files specified are in the current directory, you could use:
rsync -uai --no-R --files-from="textfile" ./ host:/dest/path
Where the command essentially specifies you read the names to transfer from textfile where the files will be found under ./ (the current directory) and you will transfer the files to host:/dest/path on the host you specify. You can see man 1 rsync for full details.
I am listing all the files in s3 bucket and writing it in a text file. For example, my bucket has the following list of files:
text.zip
fixed.zip
hello.zip
good test.zip
I use the following code:
fileList=$(aws s3 ls s3://$inputBucketName/ | awk '{print $4}')
if [ ! -z "$fileList" ]
then
$AWS_CLI s3 ls s3://$inputBucketName/ | awk '{print $1,$2,$4}' > s3op.txt
sort -k1,1 -k2 s3op.txt > s3op_srt.txt
awk '{print $3}' s3op_srt.txt > filesOrder.txt
fi
cat filesOrder.txt;
After this when I iterate the files from the file I created (I will delete the files in S3 at the end of the loop, so the file won't be processed again):
fileName=`head -1 filesOrder.txt`
the files are listed like below:
text.zip
fixed.zip
hello.zip
good
So the problem is that, the list is not able to list the files with spaces correctly.
As the file name is returned as "good" and not as "good test.zip", it is not able to delete the file from S3.
Expected Result is
text.zip
fixed.zip
hello.zip
good test.zip
I used following command to delete files in S3:
aws s3 rm s3://$inputBucketName/$fileName
Put the full file path under double quotes.
For example:
aws s3 rm "s3://test-bucket/good test.zip"
In your case, it would be:
aws s3 rm "s3://$inputBucketName/$fileName"
Here even if the fileName has spaces, it'll be deleted.
How do I pass the value of the variable to a command in a bash script?
Specifically, I want to create a AWS S3 bucket with (partially) random name. This is what I got so far:
#!/bin/bash
random_id=$(cat /dev/urandom | tr -dc 'a-z0-9' | fold -w 8 | head -n 1)
bucket_name=s3://mybucket-$random_id
echo Bucket name: ${bucket_name}
aws s3 mb ${bucket_name}
The output I get:
Bucket name: s3://mybucket-z4nnli2k
Parameter validation failed:cket-z4nnli2k
": Bucket name must match the regex "^[a-zA-Z0-9.\-_]{1,255}$"
The bucket name is generated correctly, but aws s3 mb ${bucket_name} fails. If I just run aws s3 mb s3://mybucket-z4nnli2k then the bucket is created, so I assume that aws s3 mb ${bucket_name} is not the correct way to pass the value of the bucket_name to the aws s3 mb command.
It must be something obvious but I have almost zero experience with shell scripts and can't figure it out.
How do I pass the value of bucket_name to the aws s3 mb command?
Thanks to the comments above, this is what I got working:
#!/bin/bash
random_id=$(cat /dev/urandom | tr -dc 'a-z0-9' | fold -w 8 | head -n 1)
bucket_name=s3://mybucket-$random_id
echo Bucket name: ${bucket_name}
aws s3 mb "${bucket_name}"
I also had to run dos2unix on the script, apparently there were bad line breaks.
Is there an easy/efficient way of get the duration of about 20k videos stored in a S3 Bucket?
Right now, I tried mounting the bucket in OS X using expandrive and running a bash script using mediainfo but I always get a "Argument list too long" error.
This is the script
#! /bin/bash
# get video length of file.
for MP4 in `ls *mp4`
do
mediainfo $MP4 | grep "^Duration" | head -1 | sed 's/^.*: \([0-9][0-9]*\)mn *\([0-9][0-9]*\)s/00:\1:\2/' >> results.txt
done
# END
ffprobe can read videos from various sources. HTTP is also supported - this should help you as it lifts the burden to transfer all files to your computer.
ffprobe -i http://org.mp4parser.s3.amazonaws.com/examples/Cosmos%20Laundromat%20faststart.mp4
Even if your S3 bucket is not public you can easily generate signed URLs which allow time limited access to an object if security is of concern.
Use Bucket GET to get all files in the bucket and then execute the ffprobe with appropriate filtering on all files.
This answers your question but the problem you are having is well explained by Rambo Ramone's answer.
Try using xargs instead of the for loop. The backtics run the command and insert its output at this spot. 20K files are probably too much for your shell.
ls *.mp4 | xargs mediainfo | grep "^Duration" | head -1 | sed 's/^.*: \([0-9][0-9]*\)mn *\([0-9][0-9]*\)s/00:\1:\2/' >> results.txt
If mounting the S3 bucket and running mediainfo against a video file to retrieve video metadata (including the duration header) results in a complete download of the video from S3 then that is probably a bad way to do this. Especially if you're going to do it again and again.
For new files being uploaded to S3, I would pre-calculate the duration (using mediainfo or whatever) and upload the calculated duration as S3 object metadata.
Or you could use a Lambda function that executes when a video is uploaded and have it read the relevant part of the video file, extract the duration header, and store it back in the S3 object metadata. For existing files, you could programmatically invoke the Lambda function against the existing S3 objects. Or you could simply do the upload process again from scratch, triggering the Lambda.