Effeciantly moving half a million files based on extention in bash - bash

Scenario:
With Locky virus on the rampage the computer center I work for have found the only method of file recovery is using tools like Recuva now the problem with that is it dumps all the recovered files into a single directory. I would like to move all those files based on there file extensions into categories. All JPG in one all BMP in another ... etc. i have looked around Stackoverflow and based off of various other questions and responses I managed to build a small bash script (sample provided) that kinda does that however it takes forever to finish and i think i have the extensions messed up.
Code:
#!/bin/bash
path=$2 # Starting path to the directory of the junk files
var=0 # How many records were processed
SECONDS=0 # reset the clock so we can time the event
clear
echo "Searching $2 for file types and then moving all files into grouped folders."
# Only want to move Files from first level as Directories are ok were they are
for FILE in `find $2 -maxdepth 1 -type f`
do
# Split the EXT off for the directory name using AWK
DIR=$(awk -F. '{print $NF}' <<<"$FILE")
# DEBUG ONLY
# echo "Moving file: $FILE into directory $DIR"
# Make a directory in our path then Move that file into the directory
mkdir -p "$DIR"
mv "$FILE" "$DIR"
((var++))
done
echo "$var Files found and orginized in:"
echo "$(($diff / 3600)) hours, $((($diff / 60) % 60)) minutes and $(($diff % 60)) seconds."
Question:
How can i make this more efficient while dealing with 500,000+ files? The find takes forever to grab a list of files and in the loop its attempting to create a directory (even if that path is already there). I would like to more efficiently deal with those two particular aspects of the loop if at possible.

The bottleneck of any bash script is usually the number of external processes you start. In this case, you can vastly reduce the number of calls to mv you make by recognizing that a large percentage of the files you want to move will have a common suffix like jpg, etc. Start with those.
for ext in jpg mp3; do
mkdir -p "$ext"
# For simplicity, I'll assume your mv command supports the -t option
find "$2" -maxdepth 1 -name "*.$ext" -exec mv -t "$ext" {} +
done
Use -exec mv -t "$ext" {} + means find will pass as many files as possible to each call to mv. For each extension, this means one call to find and a minimum number of calls to mv.
Once those files are moved, then you can start analyzing files one at a time.
for f in "$2"/*; do
ext=${f##*.}
# Probably more efficient to check in-shell if the directory
# already exists than to start a new process to make the check
# for you.
[[ -d $ext ]] || mkdir "$ext"
mv "$f" "$ext"
done
The trade-off occurs in deciding how much work you want to do beforehand identifying the common extensions to minimize the number of iterations of the second for loop.

Related

BASH: copy only images from directory, not copying folder structure and rename copied files in sequential order

I have found an old HDD which was used in the family computer back in 2011. There are a lot of images on it which I would love to copy to my computer and print out in a nice photobook as a surprise to my parents and sister.
However, I have a problem: These photos have been taken with older cameras. Which means that I have a lot of photos with names such as: 01, 02, etc. These are now in hunderds of sub-folders.
I have already tried the following command but I still get exceptions where the file cannot be copied because one with the same name already exists.
Example: cp: cannot create regular file 'C:/Users/patri/Desktop/Fotoboek/battery.jpg': File exists
The command I execute:
$ find . -type f -regex '.*\(jpg\|jpeg\|png\|gif\|bmp\|mp4\)' -exec cp --backup=numbered '{}' C:/Users/patri/Desktop/Fotoboek \;
I had hoped that the --backup=numbered would solve my problem. (I thought that it would add either a 0,1,2 etc to the filename if it already exists, which it unfortunately doesn't do successfully).
Is there a way to find only media files such as images and videos like I have above and make it so that every file copied gets renamed to a sequential number? So the first copied image would have the name 0, then the 2nd 1, etc.
** doesn't do successfully ** is not a clear question. If I try your find command on sample directories on my system (Linux Mint 20), it works just fine. It creates files with ~1~, ~2~, ... added to the filename (mind you after the extension).
If you want a quick and dirty solution, you could do:
#!/bin/bash
counter=1
find sourcedir -type f -print0 | while IFS= read -r -d '' file
do
filename=$(basename -- "$file")
extension="${filename##*.}"
fileonly="${filename%.*}"
cp "$file" "targetdir/${fileonly}_${counter}.$extension"
(( counter += 1 ))
done
In this solution the counter is incremented every time a file is copied. The numbers are not sequential for each filename.
Yes I know it is an anti-pattern, and not ideal but it works.
If you want a "more evolved" version of the previous, where the numbers are sequential, you could do:
#!/bin/bash
find sourcedir -type f -print0 | while IFS= read -r -d '' file
do
filename=$(basename -- "$file")
extension="${filename##*.}"
fileonly="${filename%.*}"
counter=1
while [[ -f "targetdir/${fileonly}_${counter}.$extension" ]]
do
(( counter += 1 ))
done
cp "$file" "targetdir/${fileonly}_${counter}.$extension"
done
This version increments the counter every time a file is found to exist with that counter. Ex. if you have 3 a.jpg files, they will be named a_1.jpg, a_2.jpg, a_3.jpg

How to check if there are jpg in a folder and then sort them by date in other folders in Bash?

I am making a Bash script to order the photos that are entering a folder at different times and days (not every day there are photos) as follows. The photos must be moved to a folder called PhotosOrder where for each day there is a folder with the date. The task is executed in a synology server and later it is synchronized with syncthing to a windows server. First I must say that I generalize it since I must execute it in many different folders and I am duplicating the script for each one. That surely is optimizable but we will get to that after it works. The script must check if there are jpg and lists them in an auxiliary variable arr Checks that this list is not empty in an if, if it is it does nothing but if there is jpg then it makes:
Creates the folder for the current day.
It counts the number of photos that there are because as at different times different people put photos I want to avoid none being overwritten.
It moves the photos renaming them taking into account the previous number and the parameters of the name that I set at the beginning.
I have to say that I can't delete the empty folders afterward because if I don't delete a folder that syncthing uses later to synchronize (I synchronize that folder with a folder on another server). So far an alternative script works for me that creates a folder every day whether or not there are photos and moves them (if there are any) but then I have to delete the empty folders by hand. If I tell the script to delete those empty folders then it deletes the folder that syncthing uses and it doesn't sync with the other server anymore (besides that I don't think it's optimal either). Hence the if loop to check if there are photos before doing anything.
The script I have for now is this one:
this one:
#!/bin/sh
#values that change from each other
FOLDER="/volume1/obraxx/jpg/"
OBRA="-obraxx-"
#Create jpg listing in variable arr:
arr=$$(ls -1 /volume1/obraxx/jpg/*.jpg 2>/dev/null)
#if the variable is not empty, the if is executed:
if [[ !(-z $arr) ]]; then.
#Create the folder of the day
d="$(date +"%Y-%m-%d")"
mkdir -p "$FOLDER"/PhotosOrdered/"$d"
DESTINATION="$FOLDER/PhotosOrder/$d/"
#Count existing photos:
a=$$(ls -1 $FOLDER | wc -l)
#Move and rename the photos to the destination folder.
for image in $arr; do
NEW="$PICTURE$a"
mv -n $image $DESTINATION$(date +"%Y%m%d")$NEW.jpg
let a++
done
fi
The shebang line should look like #!/bin/bash, not #!/bin/sh.
Your usage of arrays has syntax problems.
You should not parse the output of ls.
You are counting the existing photos in the source folder. It should be
the destination folder.
You are putting the current date in both folder name and the
file name. (I do not know if this is the requirement.)
The variable OBRA is defined but not used.
The variable PICTURE is not defined.
It is not recommended to use uppercases for user's variables because
they may conflict with the system variables.
Then would you please try the following:
#!/bin/bash
prefix="picture" # new file name before the number
src="/volume1/obraxx/jpg/" # source directory
# array "ary" is assigned to the list of jpg files in the source directory
mapfile -d "" -t ary < <(find "$src" -maxdepth 1 -type f -name "*.jpg" -print0)
(( ${#ary[#]} == 0 )) && exit # if the list is empty, do nothing
# first detect the maximum file number in the destination directory
d=$(date +%Y-%m-%d)
dest="$src/PhotosOrder/$d/" # destination directory
mkdir -p "$dest"
for f in "$dest"*.jpg; do
if [[ -f $f ]]; then # check if the file exists
n=${f//*$prefix/} # remove substings before prefix inclusive
n=${n%.jpg} # remove suffix leaving a file number
if (( n > max )); then
max=$n
fi
fi
done
a=$(( max + 1 )) # starting (non-overwriting) number in the destination
# move jpg files renaming
for f in "${ary[#]}"; do
new="$prefix$a.jpg"
mv -n -- "$f" "$dest$new"
(( a++ )) # increment the file number
done

Bash scanning for filenames containing keywords and move them

I'm looking to find a way to constantly scan a folder tree for new subfolders containing MKV/MP4 files. If that file contains a keyword and ends in MP4 or MKV, it'll be moved to a defined location matching that keyword. As a bonus, it would delete the folder and all it's leftover contents where the file resided previosly. The idea would be to have this run in the background and sort everything where it belongs and clean up after itself if possible.
example:
Media\anime\Timmy\Timmy_S1E1\Timmy_S1E1_720p.mkv #Found Keyword Timmy, allowed filetype
Move to destination:
Media\series\Timmy\
Delete subfolder:
Media\anime\Timmy\Timmy_S1E1\
I would either do separate scripts for each keyword, or, if possible, have the script match each keyword with a destination
#!/bin/bash
#!/bin/sh
#!/etc/shells/bin/bash
while true
do
shopt -s globstar
start_dir="//srv/MEDIA2/shows"
for name in "$start_dir"/**/*.*; do
# search the directory recursively
done
sleep 300
done
This could be done by:
creating a script that does what you want to do, once.
run the script from cron, at a certain interval. Say a couple minutes, or a couple hours, depends on the volume of files you receive.
no need for a continually running daemon.
Ex:
#!/bin/bash
start_dir="/start/directory"
if [[ ! -d "$start_dir" ]]
then
echo "ERROR: start_dir ($start_dir) not found."
exit 1
fi
target_dir="/target/directory"
if [[ ! -d "$target_dir" ]]
then
echo "ERROR: target_dir ($target_dir) not found."
exit 1
fi
# Move all MP4 and MKV files to the target directory
find "$start_dir" -type f \( -name "*keyword*.MP4" -o -name "*keyword*.MKV" \) -print0 | while read -r -d $'\0' file
do
# add any processing here...
filename=$(basename "$file")
echo "Moving $filename to $target_dir..."
mv "$file" "$target_dir/$filename"
done
# That being done, all that is left in start_dir can be deleted
find "$start_dir" -type d ! -path "$start_dir" -exec /bin/rm -fr {} \;
Details:
scanning for files is most efficient with the find command
the -print0 with read ... method is to ensure all valid filenames are processed, even if they include spaces or other "weird" characters.
the result of the above code is that each file that matches your keyword, with extensions MP4 or MKV will be processed once.
you can then use "$file" to access the file being processed in the current loop.
make sure you ALWAYS double quote $file, otherwise any weird filename will brake your code. Well you should always double quote your variables anyway.
more complex logic can be added for your specific needs. Ex. create the target directory if it does not exist. Create a different target directory depending on your keyword. etc.
to delete all sub-directories under $start_dir, I use find. Again this will process weird directory names.
One point, some will argue that it could all be done in 1 find command with -exec option. True, but IMHO the version with the while loop is easier to code, understand, debug, learn.
And this construct is good to have in your bash toolbox.
When you create a script, only one #! line is needed.
And I fixed the indentation in your question, much easier to read your code properly indented and formatted (see the edit help in the question editor).
Last point to discuss, lets say you have a LARGE number of directories and files to process, and it is possible that new files are added while the script is running. Ex. you are moving many MP4 files, and while it is doing it, new files are deposited in the directories. Then when you do the deletion you could potentially loose files.
If such a case is possible, you could add a check for new files just before you do the /bin/rm, it would help. To be absolutely certain, you could setup a script that processes 1 file, and have it triggered by inotify. But that is another ball game, more complicated and out of scope for this answer.

Rename files within folders to folder names while retaining extensions

I have a large repository of media files that follow torrent naming conventions- something unpleasant to read. At one point, I had properly named the folders that contain said files, but not want to dump all the .avi, .mkv, etc files into my main media directory using a bash script.
Overview:
Current directory tree:
Proper Movie Title/
->Proper.Movie.Title.2013.avi
->Proper.Movie.Title.2013.srt
Title 2/
->Title2[proper].mkv
Movie- Epilogue/
->MOVIE EPILOGUE .AVI
Media Movie/
->MEDIAMOVIE.CD1.mkv
->MEDIAMOVIE.CD2.mkv
.
.
.
Desired directory tree:
Proper Movie Title/
->Proper Movie Title.avi
->Proper Movie Title.srt
Title 2.mkv
Movie- Epilogue.avi
Media Movie/
->Media Movie.cd1.mkv
->Media Movie.cd2.mkv
Though this would be an ideal, my main wish is for the directories with only a single movie file within to have that file be renamed and moved into the parent directory.
My current approach is to use a double for loop in a .sh file, but I'm currently having a hard time keeping new bash knowledge in my head.
Help would be appreciated.
My current code (Just to get access to the internal movie files):
#!/bin/bash
FILES=./*
for f in $FILES
do
if [[ -d $f ]]; then
INFILES=$f/*
for file in $INFILES
do
echo "Processing >$file< folder..."
done
#cat $f
fi
done
Here's something simple:
find * -type f -maxdepth 1 | while read file
do
dirname="$(dirname "$file")"
new_name="${dirname##*/}"
file_ext=${file##*.}
if [ -n "$file_ext" -a -n "$dirname" -a -n "$new_name" ]
then
echo "mv '$file' '$dirname/$new_name.$file_ext'"
fi
done
The find * says to run find on all items in the current directory. The -type f says you only are interested in files, and -maxdepth 1 limits the depth of the search to the immediate directory.
The ${file##*.} is using a pattern match. The ## says the largest left hand match to *. which is basically pulling everything off to the file extension.
The file_dir="$(dirname "$file")" gets the directory name.
Note quotes everywhere! You have to be careful about white spaces.
By the way, I echo instead of doing the actual move. I can pipe the output to a file, examine that file and make sure everything looks okay, then run that file as a shell script.

Bash: Maintaining a set of files and their gzipped equivalent

I have a directory tree in which there are some files and some subdirectories.
/
/file1.txt
/file2.png
/dir1
/subfile1.gif
The objective is to have a script that generates a gzipped version of each file and saves it next to each file, with an added .gz suffix:
/
/file1.txt
/file1.txt.gz
/file2.png
/file2.png.gz
/dir1
/subfile1.gif
/subfile1.gif.gz
This would handle the creation of new .gz files.
Another part is deletion: Whenever a non-gzipped file is created, the script would need to delete the orphaned .gz version when it runs.
The last and trickiest part is modification: Whenever some (non-gzipped) files are changed, re-running the script would update the .gz version of only those changed files, based on file timestamp (mtime) comparison between a file and its gzipped version.
Is it possible to implement such a script in bash?
Edit: The goal of this is to have prepared compressed copies of each file for nginx to serve using the gzip_static module. It is not meant to be a background service which automatically compresses things as soon as anything changes, because nginx's gzip_static module is smart enough to serve content from the uncompressed version if no compressed version exists, or if the uncompressed version's timestamp is more recent than the gzipped version's timestamp. As such, this is a script that would run occasionally, whenever the server is not busy.
Here is my attempt at it:
#!/bin/bash
# you need to clean up .gz files when you remove things
find . -type f -perm -o=r -not -iname \*.gz | \
while read -r x
do
if [ "$x" -nt "$x.gz" ]; then
gzip -cn9 "$x" > "$x.gz"
chown --reference="$x" "$x.gz"
chmod --reference="$x" "$x.gz"
touch --reference="$x" "$x.gz"
if [ `stat -c %s "$x.gz"` -ge `stat -c %s "$x"` ]; then
rm "$x.gz"
fi
fi
done
Stole most of it from here: https://superuser.com/questions/482787/gzip-all-files-without-deleting-them
Changes include:
skipping .gz files
adding -9 and -n to make the files smaller
deleting files that ended up larger (unfortunately this means they will be retried every time you run the script.)
made sure the owner, permissions, and timestamp on the compressed file matches the original
only works on files that are readable by everyone
Something like this, maybe?
#!/bin/sh
case $1 in
*.gz )
# If it's an orphan, remove it
test -f "${1%.gz}" || rm "$1" ;;
# Otherwise, will be handled when the existing parent is handled
* )
make -f - <<'____HERE' "$1.gz"
%.gz: %
# Make sure you have literal tab here!
gzip -9 <$< >$#
____HERE
;;
esac
If you have a Makefile already, by all means use a literal file rather than a here document.
Integrating with find left as an exercise. You might want to accept multiple target files and loop over them, if you want to save processes.

Resources