Move files in S3 to folders based on filename - bash

I have s3 folder where files are staged from an application.
I need to move these files based on a specified folder structure using the filenames.
The files are named in a particular format:
s3://bucketname/staging/file1_YYYY_MM_DD_HH_MM_SS
s3://bucketname/staging/file1_YYYY_MM_DD_HH_MM_SS
I need to move them to s3 folders of this format:
s3://bucketname/file1/YYYY/MM/DD
I have the following code now to store all the filenames present in the staging folder in a file.
path=s3://bucketname/staging
count=`s3cmd ls $path | wc -l`
echo $count
if [[ $count -gt 0 ]]; then
list_files_to_move_s3=$(s3cmd ls -r $path | awk '{print $4}' > files_in_bucket.txt)
echo "exists"
else
echo "do not exist"
fi
I now need to read the filenames and move the files accordingly.
Can you please help.

You can parse the contents of files_in_bucket.txt with sed to produce the output you want:
---> cat tests3.txt
s3://bucketname/staging/file1_YYYY_MM_DD_HH_MM_SS
s3://bucketname/staging/file1_YYYY_MM_DD_HH_MM_SS
---> sed -r "s|^(s3://.*)/.*/(.*)_(.*)_(.*)_(.*)_.*_.*_.*$|\1/\2/\3/\4/\5|g" tests3.txt
s3://bucketname/file1/YYYY/MM/DD
s3://bucketname/file1/YYYY/MM/DD
--->
What's happening there is it's parsing out each line from the file tests3.txt, with each bit inside parentheses saved as a "variable" (I'm not sure what the correct term is for sed, but you get the idea) which can then be referenced in the substitution string as \1, \2, \3, etc. So it's picking out the first bit, including up until the first slash, skipping the "staging" bit, and then picking out the file and date portions of the file name.
Note that this assumes a very standardized layout of the filenames and your desired output.
Let me know if you have any questions about this or need further help.

Related

Bash to rename multiple files to append different folder names

I am currently analysing genomes from SPADESs.
I currently have 500+ directories from SPADES named EC18PR-0001, EC18PR-0002, ECPK-0001 ECPK-0002 etc. And inside each directory is a contig file named 'contigs.fasta'.
I was trying to find a way to go through each directory and append each individual directory name to the 'contigs.fasta' file so it would be like: EC18PR-0001-contigs.fasta.
This loop doesn't seem to work:
for file in *EC18
do
sample=${file/.fasta} perl -ane
'if(/\>/){$a++;print ">NODE_$a\n"}else{print;}' ${sample}.fasta >
/pathway/where/files/are/SPADEs/${sample}.fasta
done
This might work:
for file in EC18*/*; do
if [[ $file =~ contigs.fasta ]];then
echo $(echo $file | sed 's#/#-#g')
fi
done

Naming an output file after the input directory

I'm working with some files that are organized within a folder (named RAW) that contain several other folders with different names, all of them containing files ended by a string like _1 or _2 with the extension (.fq.gz in this case). Below I try to include a schedule for guidance.
RAW/
FOLDER1/
FILE_qwer_1.fa.gz
FILE_qwer_2.fa.gz
FOLDER2/
FILE_tyui_1.fa.gz
FILE_tyui_2.fa.gz
OTHER1/
FILE_asdf_1.fa.gz
FILE_asdf_2.fa.gz
...
So I am basically running a loop over all those directories under RAW and run a script that will create an output file, say out.
What I'm trying to accomplish is to name that out file as the folder it belongs to under $RAW (e.g. FOLDER1.eg after processing FILE_qwer_1.fa.gz and FILE_qwer_2.fa.gz above)
The loop below will work actually, but as you can imagine, it depends on how many folders I am working below the root /, as the option -f is hard-coded for the cut command.
for file1 in ${RAW}/*/*_1.fq.gz; do
file2="${file1/_1/_2}"
out="$(echo $file1 | cut -d '/' -f2)"
bash script_to_be_run.sh $file1 $file2 $out
done
Ideally, the variable out should be named as the replacement of the first * character of the glob used in the loop (e.g. FOLDER1.eg in the first iteration) followed by a custom extension, but I do not really know how to do it, nor if it is possible.
You can use ${var#prefix} to remove a prefix from the start of a variable.
for file1 in ${RAW}/*/*_1.fq.gz; do
file2="${file1/_1/_2}"
out="$(dirname "${file1#$RAW/}")" # cuts the $RAW from the beginning of the dirs
bash script_to_be_run.sh "$file1" "$file2" "$out"
done
(It's a good idea to quote variable expansions in case they contain spaces or other special character: "$file1" is safer than $file1.)

Comparing two directories to produce output

I am writing a Bash script that will replace files in folder A (source) with folder B (target). But before this happens, I want to record 2 files.
The first file will contain a list of files in folder B that are newer than folder A, along with files that are different/orphans in folder B against folder A
The second file will contain a list of files in folder A that are newer than folder B, along with files that are different/orphans in folder A against folder B
How do I accomplish this in Bash? I've tried using diff -qr but it yields the following output:
Files old/VERSION and new/VERSION differ
Files old/conf/mime.conf and new/conf/mime.conf differ
Only in new/data/pages: playground
Files old/doku.php and new/doku.php differ
Files old/inc/auth.php and new/inc/auth.php differ
Files old/inc/lang/no/lang.php and new/inc/lang/no/lang.php differ
Files old/lib/plugins/acl/remote.php and new/lib/plugins/acl/remote.php differ
Files old/lib/plugins/authplain/auth.php and new/lib/plugins/authplain/auth.php differ
Files old/lib/plugins/usermanager/admin.php and new/lib/plugins/usermanager/admin.php differ
I've also tried this
(rsync -rcn --out-format="%n" old/ new/ && rsync -rcn --out-format="%n" new/ old/) | sort | uniq
but it doesn't give me the scope of results I require. The struggle here is that the data isn't in the correct format, I just want files not directories to show in the text files e.g:
conf/mime.conf
data/pages/playground/
data/pages/playground/playground.txt
doku.php
inc/auth.php
inc/lang/no/lang.php
lib/plugins/acl/remote.php
lib/plugins/authplain/auth.php
lib/plugins/usermanager/admin.php
List of files in directory B (new/) that are newer than directory A (old/):
find new -newermm old
This merely runs find and examines the content of new/ as filtered by -newerXY reference with X and Y both set to m (modification time) and reference being the old directory itself.
Files that are missing in directory B (new/) but are present in directory A (old/):
A=old B=new
diff -u <(find "$B" |sed "s:$B::") <(find "$A" |sed "s:$A::") \
|sed "/^+\//!d; s::$A/:"
This sets variables $A and $B to your target directories, then runs a unified diff on their contents (using process substitution to locate with find and remove the directory name with sed so diff isn't confused). The final sed command first matches for the additions (lines starting with a +/), modifies them to replace that +/ with the directory name and a slash, and prints them (other lines are removed).
Here is a bash script that will create the file:
#!/bin/bash
# Usage: bash script.bash OLD_DIR NEW_DIR [OUTPUT_FILE]
# compare given directories
if [ -n "$3" ]; then # the optional 3rd argument is the output file
OUTPUT="$3"
else # if it isn't provided, escape path slashes to underscores
OUTPUT="${2////_}-newer-than-${1////_}"
fi
{
find "$2" -newermm "$1"
diff -u <(find "$2" |sed "s:$2::") <(find "$1" |sed "s:$1::") \
|sed "/^+\//!d; s::$1/:"
} |sort > "$OUTPUT"
First, this determines the output file, which either comes from the third argument or else is created from the other inputs using a replacement to convert slashes to underscores in case there are paths, so for example, running as bash script.bash /usr/local/bin /usr/bin would output its file list to _usr_local_bin-newer-than-_usr_bin in the current working directory.
This combines the two commands and then ensures they are sorted. There won't be any duplicates, so you don't need to worry about that (if there were, you'd use sort -u).
You can get your first and second files by changing the order of arguments as you invoke this script.

How do change all filenames with a similar but not identical structure?

Due to a variety of complex photo library migrations that had to be done using a combination of manual copying and importing tools that renamed the files, it seems I wound up with a ton of files with a similar structure. Here's an example:
2009-05-05 - 2009-05-05 - IMG_0486 - 2009-05-05 at 10-13-43 - 4209 - 2009-05-05.JPG
What it should be:
2009-05-05 - IMG_0486.jpg
The other files have the same structure, but obviously the individual dates and IMG numbers are different.
Is there any way I can do some command line magic in Terminal to automatically rename these files to the shortened/correct version?
I assume you may have sub-directories and want to find all files inside this directory tree.
This first code block (which you could put in a script) is "safe" (does nothing), but will help you see what would be done.
datep="[0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9]"
dir="PUT_THE_FULL_PATH_OF_YOUR_MAIN_DIRECTORY"
while IFS= read -r file
do
name="$(basename "$file")"
[[ "$name" =~ ^($datep)\ -\ $datep\ -\ ([^[:space:]]+)\ -\ $datep.*[.](.+)$ ]] || continue
date="${BASH_REMATCH[1]}"
imgname="${BASH_REMATCH[2]}"
ext="${BASH_REMATCH[3],,}"
dir_of_file="$(dirname "$file")"
target="$dir_of_file/$date - $imgname.$ext"
echo "$file"
echo " would me moved to..."
echo " $target"
done < <(find "$dir" -type f)
Make sure the output is what you want and are expecting. I cannot test on your actual files, and if this script does not produce results that are entirely satisfactory, I do not take any responsibility for hair being pulled out. Do not blindly let anyone (including me) mess with your precious data by copy and pasting code from the internet if you have no reliable, checked backup.
Once you are sure, decide if you want to take a chance on some guy's code written without any opportunity for testing and replace the three consecutive lines beginning with echo with this :
mv "$file" "$target"
Note that file names have to match to a pretty strict pattern to be considered for processing, so if you notice that some files are not being processed, then the pattern may need to be modified.
Assuming they are all the exact same structure, spaces and everything, you can use awk to split the names up using the spaces as break points. Here's a quick and dirty example:
#!/bin/bash
output=""
for file in /path/to/files/*; do
unset output #clear variable from previous loop
output="$(echo $file | awk '{print $1}')" #Assign the first field to the output variable
output="$output"" - " #Append with [space][dash][space]
output="$output""$(echo $file | awk '{print $5}')" #Append with IMG_* field
output="$output""." #Append with period
#Use -F '.' to split by period, and $NF to grab the last field (to get the extension)
output="$output""$(echo $file | awk -F '.' '{print $NF}')"
done
From there, something like mv /path/to/files/$file /path/to/files/$output as a final line in the file loop will rename the file. I'd copy a few files into another folder to test with first, since we're dealing with file manipulation.
All the output assigning lines can be consolidated into a single line, as well, but it's less easy to read.
output="$(echo $file | awk '{print $1 " - " $5 "."}')""$(echo $file | awk -F '.' '{print $NF}')"
You'll still want a file loop, though.
Assuming that you want to convert the filename with the first date and the IMG* name, you can run the following on the folder:
IFS=$'\n'
for file in *
do
printf "mv '$file' '"
printf '%s' $(cut -d" " -f1,4,5 <<< "$file")
printf "'.jpg"
done | sh

How can I remove hidden characters after a file extension in a variable

When I do
echo $filename
I get
Pew Pew.mp4
However,
echo "${#filename}"
Returns 19
How do I delete all characters after the file extension? It needs to work no matter what the file extension is because the file name in the variable will not always match *.mp4
You should try to find out why you have such strange files before fixing it.
Once you know, you can rename files.
When you just want to rename 1 file, just use the command
mv "Pew Pew.mp4"* "Pew Pew.mp4"
Cutting off the complete extension (with filename=${filename%%.*}) won't help you if you want to use the stripped extension (mp4 or jpg or ...).
EDIT:
I think OP want a work-around so I give another try.
When you have a a short list of extensions, you can try
for ext in mpeg mpg jpg avo mov; do
for filename in *.${ext}*; do
mv "${filename%%.*}.${ext}"* "${filename%%.*}.${ext}"
done
done
You can try strings to get the readable string.
echo "${filename}" | strings | wc
# Rename file
mv "${filename}" "$(echo "${filename}"| strings)"
EDIT:
strings gives more than 1 line as a result and unwanted spaces. Since Pew Pew has a space inside, I hope that all spaces, underscores and minus-signs are in front of the dot.
The newname can be constructed with something like
tmpname=$(echo "${filename}"| strings | head -1)
newname=${tmpname% *}
# or another way
newname=$(echo "${filename}"| sed 's/[[:alnum:]_- ]*\.[[:alnum:]]*\).*/\1/')
# or another (the best?) way (hoping that the first unwanted character is not a space)
newname="${filename%%[^[:alnum:]\.-_]*}"
# resulting in
mv "${filename}" "${filename%%[^[:alnum:]\.-_]*}"

Resources