Keep only one version of each file (bash) - bash

I want to remove redundant files in a folder. Something like
cat_1.jpg
cat_2.jpg
cat_3.jpg
dog_10.jpg
dog_100.jpg
reduced to
cat_3.jpg
dog_100.jpg
That is, take only the version of each file with the highest number suffix and delete the rest.
This is very much like
list the files with minimum sequence
but the bash answer there has a "for ... in ... ". I have thousands of file names.
EDIT:
Got the file name convention wrong. There may be other underscores (ex. cat_and_dog_100.jpg). I need it to only take the number after the last underscore.

Assuming your filenames are always in the form <name>_<numbers>.jpg, here's a quick hack:
while read filename; do
prefix=${filename/%_*/} # Get text before underscore
if [ "$prev_prefix" != "$prefix" ]; then # we see a new prefix
echo "Keeping filename"
prev_prefix=$prefix
else # same prefix
echo "Deleting $filename"
rm $filename
fi
done < <(find . -maxdepth 1 -name "*.jpg"| sort -n -t'_' -k1,2)
How this works:
Sorts all *.jpg files first by <name> and then by <numbers>.
all files with the same prefix will be grouped together with the highest <number> appearing first
Iterates through the list of filenames and delete files except when a new <name> is found (which should be the one with the highest <number> )
Note that find is used instead of ls *.jpg so we can better handle large number of files.
Disclaimer: This is a rather fragile way of dealing with files and versioning, and should not be adopted as a long term solution. Do heed the comments posted on the question.

Related

In bash, how can I remove multiple versions of the same file?

This may be a very specific case, but I know very little about bash and I need to remove "duplicate" files. I've been downloading totally legal videogame roms these past few days, and I noticed that a lot of packs have many different versions of the same game, like this:
Awesome Golf (1991).lnx
Awesome Golf (1991) [b1].lnx
Baseball Heroes (1991).lnx
Baseball Heroes (1991) [b1].lnx
Basketbrawl (1992).lnx
Basketbrawl (1992) [a1].lnx
Basketbrawl (1992) [b1].lnx
Batman Returns (1992).lnx
Batman Returns (1992) [b1].lnx
How can I make a bash script that removes the duplicates? A duplicate would be any file that has the same name, and the name would be the string before the first parenthesis. The script should parse all the files and grab their names, see which names match to detect duplicates, and remove all files except the first one (first being the first that comes up in alphabetical order).
Would you please try the following:
#!/bin/bash
dir="dir" # the directory where the rom files are located
declare -A seen # associative array to detect the duplicates
while IFS= read -r -d "" f; do # loop over filenames by assigning "f" to it
name=${f%(*} # extract the "name" by removing left paren and following characters
name=${name%.*} # remove the extension considering the case the filename doesn't have parens
name=${name%[*} # remove the left square bracket and following characters considering the case as above
name=${name%% } # remove trailing whitespaces, if any
if (( seen[$name]++ )); then # if the name duplicates...
# remove "echo" if the output looks good
echo rm -- "$f" # then remove the file
fi
done < <(find "$dir" -type f -name "*.lnx" -print0 | sort -z -t "." -k1,1)
# sort the list of filenames in alphabetical order
Please modify the first dir= line to your directory path which holds the rom files.
The echo command just prints the filenames to be removed as a rehearsal. If the output looks good, then remove echo and execute the real one.
[Explanation]
An associative array seen associates the extracted "name" with a
counter of appearance. If the counter is non-zero, the file is a duplicated
one and can be removed (as long as the files are properly sorted).
The -print0 option to find, the -z option to sort and the -d ""
option to read make a null character as a delimiter of filenames to
accept filenames which contain special characters such as a whitespace,
tab, newline, etc.

Rename batch of specific files using bash

Say I have a folder which contains files like this: a constant prefix and then an underscore and some description which is different for every file:
constantnamehere_description1.doc
constantnamehere_description2.doc
.
.
etc
Here description1, description2 etc just symbolized the different descriptions and not the actual number 1,2 etc..
How can I rename these files to just this?
constantnamehere1.doc
constantnamehere2.doc
.
.
etc
Here the numbers 1,2,..,etc symbolize the actual sequential ending that i want my files to have after the renaming.
The sequential ending (1,2,3,...,end) is very important.
Till now I have tried:
for i in *.doc; do mv "$i" "{i/_*.doc/ .doc}"; done
example actual file names
1003407_cc_1.vtk
1003407_cc_2.vtk
1003407_cc_3.vtk
1003407_cv.left.right.vtk
1003407_thalamo_frontal.left.vtk
I want to be like:
1003407_1.vtk
1003407_2.vtk
1003407_3.vtk
1003407_4.vtk
1003407_5.vtk
To make it extremely clear: I want everything to be removed after the first underscore and to be replaced with sequential numbers keeping the ".vtk" extension of the file
Using the an answer to Capturing Groups From a Grep RegEx, we can generate a regex for these file names and then rename by using the captured groups:
$ regex="([^_]*)_[^0-9]*([0-9]*).([a-z]*)"
$ for f in *doc
do
[[ $f =~ $regex ]]
echo "mv $f --> ${BASH_REMATCH[1]}${BASH_REMATCH[2]}.${BASH_REMATCH[3]}"
done
The regex says: get everything up to _, then expect some characters until a digit is found. Catch that set of digits and then expect a dot followed by the extension.
Use rename:
i=1
for file in *_*.vtk
do
rename "s/_[^.]*/${i}/" "$file"
i=$(( i + 1 ))
done
This removes everything between the underscore and the first . from all files matching the *_*.vtk pattern. If your filenames contain more than one ., the pattern needs to be adapted.
EDIT: Solution modified according to modified question.
I solved it like this:
i=0;
for file in *.vtk; do mv "${file}" 100307_"${i}".vtk; i=$((i+1)); done

Iterating a group of folders and files while removing certain files that are contained in a list

I have a set of files that I download that contain files that I want to remove. I would like to create a list of some form, the script should support blobbing so I can be pretty aggressive with file removal without getting into the complexities of using regex within the list of files.
I am also stumped in that I put a sleep command within the loop of my script, and that is not getting run after each iteration, but only once at the end of run.
Here is the script
# Get to the place where all the durty work happens
cd /Volumes/Videos
FILES=".DS_Store
*.txt
*.sample
*.sample.*
*.samples"
if [ "$(pwd)" == "/Volumes/Videos" ]; then
echo "You are currently in $(pwd)"
echo "You would not have read the above if this script were operating anywhere else"
# Dekete fikes from list above
for f in "$FILES"
do
echo "Removing $f";
rm -f "$f";
echo "$f has been deleted";
sleep 10;
echo "";
echo "";
done
# See if dir is empty, ask if we want to delete it or keep it
# Iterate evert movie file, see if we want to nuke contents. Maybe use part of last openned to help find those files fast
else
# Not in the correct directory
echo "This script is trying to alter files in a location that it should not be working"
echo "Script is currently trying to work in $(pwd)"
exit 1
fi
The main thing that has be completely stumped is the sleep command. It is run once, not once per file iteration. If I have 100 files to go through I get 10 seconds of sleep, not 100*10.
I will be adding in some other features, like if a file is smaller than x bytes, go ahead and delete it too. These files will have spaces and other odd characters in the filenames, am I creating my variables correctly to make this script handle those scenarios as well as be as POSIX compliant as possible. I will change the shebang to sh over bash and try to add in set -o noun set and set -o err exit though I tend to have a lot of trouble when I do that.
Is there a better form of list I should be using? I am not objectionable to storing the pattern match list in a separate file. I can include it, or read it in with any of a few commands.
These are also nested files, a dir, that contains files, or a dir that contains a dir that contains some files. Something like this:
/Volumes/Videos:
The Great guy in a tree
The Great guy in a tree S01e01
sample.avi
readme.txt
The Great guy in a tree S01e01.mpg
The Great guy in a tree S01e02
The Great guy in a tree S01e02.mpg
The Great guy in a tree S01e03
The Great guy in a tree S01e03.mpg
The Great guy in a tree S01e04
The Great guy in a tree S01e04.mpg
Thank you.
The reason that your script is not working as you expect is because your for loop is written incorrectly. This example shows what is going on:
$ i=0
$ FILES=".DS_Store
*.txt
*.sample
*.sample.*
*.samples"
$ for f in "$FILES"; do echo $((++i)) "$f"; done
1 .DS_Store
*.txt
*.sample
*.sample.*
*.samples
Note that only one number is output, indicating that the loop is only going around once. Also, no pathname expansion has occurred.
In order to make your script work as you expect, you can remove the quotes around "$FILES". This means that each word in your string will be evaluated separately, rather than all at once. It also means that pathname expansion of the wildcards that you are using will occur, so all files ending in .txt will be removed, which I guess is what you meant.
Instead of using a string to store your list of expressions, you might prefer to make use of an array:
FILES=( '.DS_Store' '*.txt' '*.sample' '*.sample.*' '*.samples' )
The quotes around each element prevent expansion (so the array only has 5 elements, not the fully expanded list). You could then change your loop to for f in ${FILES[#]} (again, no double quotes results in each element of the list being expanded).
Although removing the quotes fixes your script, I would agree with #hek2mgl's suggestion of using find. It allows you to find files by name, size, date modified and a lot more in one line. If you want to pause between the deletion of each file, you could use something like this:
find \( -name "*.sample" -o -name "*.txt" \) -delete -exec sleep 10 \;
You can use find:
find -type f -name '.DS_Store' -o -name '*.txt' -o -name '*.sample.*' -o -name '*.samples' -delete

How to rename files with ordering using Bash {thisfile.jpg -> newfile1.jpg, thatfile.jpg -> newfile2.jpg}

Let's say I have 100 jpg files.
DSC_0001.jpg
DSC_0002.jpg
DSC_0003.jpg
....
DSC_0100.jpg
And I want to rename them like
summer_trip_1.jpg
summer_trip_2.jpg
summer_trip_3.jpg
.....
summer_trip_100.jpg
So I want these properties to be modified:
1. filename
2. order of the file(as the order by date files were created)
How could I achieve this by using bash? Like:
for file in *.jpg ; do mv blah blah blah ; done
Thank you!
It's very simple: have a variable and increment it at each step.
Example:
cur_number=1
prefix="summer_trip_"
suffix=""
for file in *.jpg ; do
echo mv "$file" "${prefix}${cur_number}${suffix}.jpg" ;
let cur_number=$cur_number+1 # or : cur_number=$(( $cur_number + 1 ))
done
and once you think it's ready, take out the echo to let the mv occur.
If you prefer them to be ordered by file date (usefull, for example, when mixing photos from several cameras, of if on yours the "numbers" rolled over):
change
for file in *.jpg ; do
into
for file in $( ls -1t *.jpg ) ; do
Note that that second example will only work if your original filenames don't have space (and other weird characters) in them, which is fine with almost all cameras I know about.
Finally, instead of ${cur_number} you could use $(printf '%04s' "${cur_number}") so that the new number has leading zeros, making sorting much easier.
How about using rename ?
rename DSC_ summer_trip_ *.jpg
See man page of rename
This works if your original numbers are all padded with zeroes to the same length:
i=0; for f in DSC_*.jpg; do mv "$f" "summer_trip_$((++i)).jpg"; done
If I understand your goal correctly:
So I want these properties to be modified: 1. filename 2. order of the file(as the order by date files were created)
if numbers of renamed files shall increment in order by file creation date, then use the following for loop:
for file in $(ls -t -r *.jpg); do
-t sorts by mtime (last modification time, not exactly creation time, newest first) and -r reverses the order listing oldest first. Just in case if original .jpg file numbers are not in the same order as pictures were taken.
As it was mentioned previously, this won't work if file names have whitespaces. If your files have spaces please try modifying IFS variable before for loop:
IFS=$'\n'
It will force split of 'ls' command results on newlines only (not whitespaces). Also it would fail if there is a newline in a file name (rather exotic IMHO :). Changing IFS may have some subtle effects further in your script, so you can save old one and restore after the loop.

Bash: Check all files in a location against another for existence

I'm after a little help with some Bash scripting (on OSX). I want to create a script that takes two parameters - source folder and target folder - and checks all files in the source hierarchy to see whether or not they exist in the target hierarchy. i.e. Given a data DVD check whether the files contained on it are already on the internal drive.
What I've come up with so far is
#!/bin/bash
if [ $# -ne 2 ]
then
echo "Usage is command sourcedir targetdir"
exit 0
fi
source="$1"
target="$2"
for f in "$( find $source -type f -name '*' -print )"
do
I'm now not sure how it's best to obtain the filename without its path and then see if it exists. I am really a beginner at scripting.
Edit: The answers given so far are all very efficient in terms of compact code. However I need to be able to look for files found within the total source hierarchy anywhere within the target hierarchy. If found I would like to compare checksums and last modified dates etc and comment or, if not found, I would like to note this. The purpose is to check whether files on external media have been uploaded to a file server.
This should give you some ideas:
#!/bin/bash
DIR1="tmpa"
DIR2="tmpb"
function sorted_contents
{
cd "$1"
find . -type f | sort
}
DIR1_CONTENTS=$(sorted_contents "$DIR1")
DIR2_CONTENTS=$(sorted_contents "$DIR2")
diff -y <(echo "$DIR1_CONTENTS") <(echo "$DIR2_CONTENTS")
In my test directories, the output was:
[user#host so]$ ./dirdiff.sh
./address-book.dat ./address-book.dat
./passwords.txt ./passwords.txt
./some-song.mp3 <
./the-holy-grail.info ./the-holy-grail.info
> ./victory.wav
./zzz.wad ./zzz.wad
If its not clear, "some-song.mp3" was only in the first directory while "victory.wav" was only in the second. The rest of the files were common.
Note that this only compares the file names, not the contents. If you like where this is headed, you could play with the diff options (maybe --suppress-common-lines if you want cleaner output).
But this is probably how I'd approach it -- offload a lot of the work onto diff.
EDIT: I should also point out that something as simple as:
[user#host so]$ diff tmpa tmpb
would also work:
Only in tmpa: some-song.mp3
Only in tmpb: victory.wav
... but not feel as satisfying as writing a script yourself. :-)
To list only files in $source_dir that do not exist in $target_dir:
comm -23 <(cd "$source_dir" && find .|sort) <(cd "$target_dir" && find .|sort)
You can limit it to just regular files with -f on the find commands, etc.
The comm command (short for "common") finds lines in common between two text files and outputs three columns: lines only in the first file, lines only in the second file, and lines common to both. The numbers suppress the corresponding column, so the output of comm -23 is only the lines from the first file that don't appear in the second.
The process substitution syntax <(command) is replaced by the pathname to a named pipe connected to the output of the given command, which lets you use a "pipe" anywhere you could put a filename, instead of only stdin and stdout.
The commands in this case generate lists of files under the two directories - the cd makes the output relative to the directories being compared, so that corresponding files come out as identical strings, and the sort ensures that comm won't be confused by the same files listed in different order in the two folders.
A few remarks about the line for f in "$( find $source -type f -name '*' -print )":
Make that "$source". Always use double quotes around variable substitutions. Otherwise the result is split into words that are treated as wildcard patterns (a historical oddity in the shell parsing rules); in particular, this would fail if the value of the variable contain spaces.
You can't iterate over the output of find that way. Because of the double quotes, there would be a single iteration through the loop, with $f containing the complete output from find. Without double quotes, file names containing spaces and other special characters would trip the script.
-name '*' is a no-op, it matches everything.
As far as I understand, you want to look for files by name independently of their location, i.e. you consider /dvd/path/to/somefile to be a match to /internal-drive/different/path-to/somefile. So make an list of files on each side indexed by name. You can do this by massaging the output of find a little. The code below can cope with any character in file names except newlines.
list_files () {
find . -type f -print |
sed 's:^\(.*\)/\(.*\)$:\2/\1/\2:' |
sort
}
source_files="$(cd "$1" && list_files)"
dest_files="$(cd "$2" && list_files)"
join -t / -v 1 <(echo "$source_files") <(echo "$dest_files") |
sed 's:^[^/]*/::'
The list_files function generates a list of file names with paths, and prepends the file name in front of the files, so e.g. /mnt/dvd/some/dir/filename.txt will appear as filename.txt/./some/dir/filename.txt. It then sorts the files.
The join command prints out lines like filename.txt/./some/dir/filename.txt when there is a file called filename.txt in the source hierarchy but not in the destination hierarchy. We finally massage its output a little since we no longer need the filename at the beginning of the line.

Resources