In bash, how can I remove multiple versions of the same file? - bash

This may be a very specific case, but I know very little about bash and I need to remove "duplicate" files. I've been downloading totally legal videogame roms these past few days, and I noticed that a lot of packs have many different versions of the same game, like this:
Awesome Golf (1991).lnx
Awesome Golf (1991) [b1].lnx
Baseball Heroes (1991).lnx
Baseball Heroes (1991) [b1].lnx
Basketbrawl (1992).lnx
Basketbrawl (1992) [a1].lnx
Basketbrawl (1992) [b1].lnx
Batman Returns (1992).lnx
Batman Returns (1992) [b1].lnx
How can I make a bash script that removes the duplicates? A duplicate would be any file that has the same name, and the name would be the string before the first parenthesis. The script should parse all the files and grab their names, see which names match to detect duplicates, and remove all files except the first one (first being the first that comes up in alphabetical order).

Would you please try the following:
#!/bin/bash
dir="dir" # the directory where the rom files are located
declare -A seen # associative array to detect the duplicates
while IFS= read -r -d "" f; do # loop over filenames by assigning "f" to it
name=${f%(*} # extract the "name" by removing left paren and following characters
name=${name%.*} # remove the extension considering the case the filename doesn't have parens
name=${name%[*} # remove the left square bracket and following characters considering the case as above
name=${name%% } # remove trailing whitespaces, if any
if (( seen[$name]++ )); then # if the name duplicates...
# remove "echo" if the output looks good
echo rm -- "$f" # then remove the file
fi
done < <(find "$dir" -type f -name "*.lnx" -print0 | sort -z -t "." -k1,1)
# sort the list of filenames in alphabetical order
Please modify the first dir= line to your directory path which holds the rom files.
The echo command just prints the filenames to be removed as a rehearsal. If the output looks good, then remove echo and execute the real one.
[Explanation]
An associative array seen associates the extracted "name" with a
counter of appearance. If the counter is non-zero, the file is a duplicated
one and can be removed (as long as the files are properly sorted).
The -print0 option to find, the -z option to sort and the -d ""
option to read make a null character as a delimiter of filenames to
accept filenames which contain special characters such as a whitespace,
tab, newline, etc.

Related

bash: rename files dropping a specific delimited part of the filename

I've been trying to find an efficient way to rename lots of files, by removing a specific component of the filename, in bash shell in linux. Filenames are like:
DATA_X3.A2022086.40e50s.231.2022087023101.csv
I want to remove the 2nd to last element entirely, resulting in:
DATA_X3.A2022086.40e50s.231.csv
I've seen suggestions to use perl-rename, that might handle this (I'm not clear), but this system does not have perl-rename available. (Has GNU bash 4.2, and rename from util-linux 2.23)
I like extended globbing and parameter parsing for things like this.
$: shopt -s extglob
$: n=DATA_X3.A2022086.40e50s.231.2022087023101.csv
$: echo ${n/.+([0-9]).csv/.csv}
DATA_X3.A2022086.40e50s.231.csv
So ...
for f in *.csv; do mv "$f" "${f/.+([0-9]).csv/.csv}"; done
This assumes all the files in the local directory, and no other CSV files with similar formatting you don't want to rename, etc.
edit
In the more general case where the .csv is not immediately following the component to be removed, is there a way to drop the nth dot-separated component in the filename? (without a more complicated sequence to string-split in bash (always seems cumbersome) and rebuild the filename?
There is usually a way. If you know which field needs to be removed -
$: ( IFS=. read -ra line <<< "$n"; unset line[4]; IFS=".$IFS"; echo "${line[*]}" )
DATA_X3.A2022086.40e50s.231.csv
Breaking that out:
( # open a subshell to localize IFS
IFS=. read -ra line <<< "$n"; # inline set IFS to . to parse to fields
unset line[4]; # unset the desired field from the array
IFS=".$IFS"; # prepend . as the OUTPUT separator
echo "${line[*]}" # reference with * to reinsert
) # closing the subshell restores IFS
I will confess I am not certain why the inline setting of IFS doesn't work on the reassembly. /shrug
This is a simple split/drop-field/reassemble, but I think it may be an X/Y Problem
If what you are doing is dropping the one field that has the date/timestamp info, then as long as the format of that field is consistent and unique, it's probably easier to use a version of the first approach.
Is it possible you meant for DATA_X3.A2022086.40e50s.231.2022087023101.csv's 5th field to be 20220807023101? i.e., August 7th of 2022 # 02:31:01 AM? Because if that's what you mean, and it's supposed to be 14 digits instead of 13, and that is the only field that is always supposed to be exactly 14 digits, then you don't need shopt and can leave the field position floating -
$: n=DATA_X3.A2022086.40e50s.231.20220807023101.csv
$: $: echo ${n/.[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]./.}
DATA_X3.A2022086.40e50s.231.csv

Extract number from filename

I'm doing a bash script, which automatically can run simulations for me. In order to start the simulation, this other script need an input, which should be dictated by the name of the folder.
So if I have a folder names No200, then I want to extract the number 200. So far, what I have is
PedsDirs=`find . -type d -maxdepth 1`
for dir in $PedsDirs
do
if [ $dir != "." ]; then
NoOfPeds = "Number appearing in the name dir"
fi
done
$ dir="No200"
$ echo "${dir#No}"
200
In general, to remove a prefix use ${variable-name#prefix}; to remove a suffix: ${variable-name%suffix}.
Bonus tip: avoid using find. It introduces many problems, especially when your files/directories contain whitespace. Use bash builtins glob features instead:
for dir in No*/ # Loops over all directories starting with 'No'.
do
dir="${dir%/}" # Removes the trailing slash from the directory name.
NoOfPeds="${dir#No}" # Removes the 'No' prefix.
done
Also, try to always use quotes around variable names to avoid accidental expansion (i.e. use "$dir" instead of just $dir).
Be careful, as you have to join the = to the variable name in bash. To get just the number, you can do something like:
NoOfPeds=`echo $dir | tr -d -c 0-9`
(that is, delete whatever char that it is not a number). All numbers will be then in NoOfPeds.

Rename batch of specific files using bash

Say I have a folder which contains files like this: a constant prefix and then an underscore and some description which is different for every file:
constantnamehere_description1.doc
constantnamehere_description2.doc
.
.
etc
Here description1, description2 etc just symbolized the different descriptions and not the actual number 1,2 etc..
How can I rename these files to just this?
constantnamehere1.doc
constantnamehere2.doc
.
.
etc
Here the numbers 1,2,..,etc symbolize the actual sequential ending that i want my files to have after the renaming.
The sequential ending (1,2,3,...,end) is very important.
Till now I have tried:
for i in *.doc; do mv "$i" "{i/_*.doc/ .doc}"; done
example actual file names
1003407_cc_1.vtk
1003407_cc_2.vtk
1003407_cc_3.vtk
1003407_cv.left.right.vtk
1003407_thalamo_frontal.left.vtk
I want to be like:
1003407_1.vtk
1003407_2.vtk
1003407_3.vtk
1003407_4.vtk
1003407_5.vtk
To make it extremely clear: I want everything to be removed after the first underscore and to be replaced with sequential numbers keeping the ".vtk" extension of the file
Using the an answer to Capturing Groups From a Grep RegEx, we can generate a regex for these file names and then rename by using the captured groups:
$ regex="([^_]*)_[^0-9]*([0-9]*).([a-z]*)"
$ for f in *doc
do
[[ $f =~ $regex ]]
echo "mv $f --> ${BASH_REMATCH[1]}${BASH_REMATCH[2]}.${BASH_REMATCH[3]}"
done
The regex says: get everything up to _, then expect some characters until a digit is found. Catch that set of digits and then expect a dot followed by the extension.
Use rename:
i=1
for file in *_*.vtk
do
rename "s/_[^.]*/${i}/" "$file"
i=$(( i + 1 ))
done
This removes everything between the underscore and the first . from all files matching the *_*.vtk pattern. If your filenames contain more than one ., the pattern needs to be adapted.
EDIT: Solution modified according to modified question.
I solved it like this:
i=0;
for file in *.vtk; do mv "${file}" 100307_"${i}".vtk; i=$((i+1)); done

Keep only one version of each file (bash)

I want to remove redundant files in a folder. Something like
cat_1.jpg
cat_2.jpg
cat_3.jpg
dog_10.jpg
dog_100.jpg
reduced to
cat_3.jpg
dog_100.jpg
That is, take only the version of each file with the highest number suffix and delete the rest.
This is very much like
list the files with minimum sequence
but the bash answer there has a "for ... in ... ". I have thousands of file names.
EDIT:
Got the file name convention wrong. There may be other underscores (ex. cat_and_dog_100.jpg). I need it to only take the number after the last underscore.
Assuming your filenames are always in the form <name>_<numbers>.jpg, here's a quick hack:
while read filename; do
prefix=${filename/%_*/} # Get text before underscore
if [ "$prev_prefix" != "$prefix" ]; then # we see a new prefix
echo "Keeping filename"
prev_prefix=$prefix
else # same prefix
echo "Deleting $filename"
rm $filename
fi
done < <(find . -maxdepth 1 -name "*.jpg"| sort -n -t'_' -k1,2)
How this works:
Sorts all *.jpg files first by <name> and then by <numbers>.
all files with the same prefix will be grouped together with the highest <number> appearing first
Iterates through the list of filenames and delete files except when a new <name> is found (which should be the one with the highest <number> )
Note that find is used instead of ls *.jpg so we can better handle large number of files.
Disclaimer: This is a rather fragile way of dealing with files and versioning, and should not be adopted as a long term solution. Do heed the comments posted on the question.

Bash: Check all files in a location against another for existence

I'm after a little help with some Bash scripting (on OSX). I want to create a script that takes two parameters - source folder and target folder - and checks all files in the source hierarchy to see whether or not they exist in the target hierarchy. i.e. Given a data DVD check whether the files contained on it are already on the internal drive.
What I've come up with so far is
#!/bin/bash
if [ $# -ne 2 ]
then
echo "Usage is command sourcedir targetdir"
exit 0
fi
source="$1"
target="$2"
for f in "$( find $source -type f -name '*' -print )"
do
I'm now not sure how it's best to obtain the filename without its path and then see if it exists. I am really a beginner at scripting.
Edit: The answers given so far are all very efficient in terms of compact code. However I need to be able to look for files found within the total source hierarchy anywhere within the target hierarchy. If found I would like to compare checksums and last modified dates etc and comment or, if not found, I would like to note this. The purpose is to check whether files on external media have been uploaded to a file server.
This should give you some ideas:
#!/bin/bash
DIR1="tmpa"
DIR2="tmpb"
function sorted_contents
{
cd "$1"
find . -type f | sort
}
DIR1_CONTENTS=$(sorted_contents "$DIR1")
DIR2_CONTENTS=$(sorted_contents "$DIR2")
diff -y <(echo "$DIR1_CONTENTS") <(echo "$DIR2_CONTENTS")
In my test directories, the output was:
[user#host so]$ ./dirdiff.sh
./address-book.dat ./address-book.dat
./passwords.txt ./passwords.txt
./some-song.mp3 <
./the-holy-grail.info ./the-holy-grail.info
> ./victory.wav
./zzz.wad ./zzz.wad
If its not clear, "some-song.mp3" was only in the first directory while "victory.wav" was only in the second. The rest of the files were common.
Note that this only compares the file names, not the contents. If you like where this is headed, you could play with the diff options (maybe --suppress-common-lines if you want cleaner output).
But this is probably how I'd approach it -- offload a lot of the work onto diff.
EDIT: I should also point out that something as simple as:
[user#host so]$ diff tmpa tmpb
would also work:
Only in tmpa: some-song.mp3
Only in tmpb: victory.wav
... but not feel as satisfying as writing a script yourself. :-)
To list only files in $source_dir that do not exist in $target_dir:
comm -23 <(cd "$source_dir" && find .|sort) <(cd "$target_dir" && find .|sort)
You can limit it to just regular files with -f on the find commands, etc.
The comm command (short for "common") finds lines in common between two text files and outputs three columns: lines only in the first file, lines only in the second file, and lines common to both. The numbers suppress the corresponding column, so the output of comm -23 is only the lines from the first file that don't appear in the second.
The process substitution syntax <(command) is replaced by the pathname to a named pipe connected to the output of the given command, which lets you use a "pipe" anywhere you could put a filename, instead of only stdin and stdout.
The commands in this case generate lists of files under the two directories - the cd makes the output relative to the directories being compared, so that corresponding files come out as identical strings, and the sort ensures that comm won't be confused by the same files listed in different order in the two folders.
A few remarks about the line for f in "$( find $source -type f -name '*' -print )":
Make that "$source". Always use double quotes around variable substitutions. Otherwise the result is split into words that are treated as wildcard patterns (a historical oddity in the shell parsing rules); in particular, this would fail if the value of the variable contain spaces.
You can't iterate over the output of find that way. Because of the double quotes, there would be a single iteration through the loop, with $f containing the complete output from find. Without double quotes, file names containing spaces and other special characters would trip the script.
-name '*' is a no-op, it matches everything.
As far as I understand, you want to look for files by name independently of their location, i.e. you consider /dvd/path/to/somefile to be a match to /internal-drive/different/path-to/somefile. So make an list of files on each side indexed by name. You can do this by massaging the output of find a little. The code below can cope with any character in file names except newlines.
list_files () {
find . -type f -print |
sed 's:^\(.*\)/\(.*\)$:\2/\1/\2:' |
sort
}
source_files="$(cd "$1" && list_files)"
dest_files="$(cd "$2" && list_files)"
join -t / -v 1 <(echo "$source_files") <(echo "$dest_files") |
sed 's:^[^/]*/::'
The list_files function generates a list of file names with paths, and prepends the file name in front of the files, so e.g. /mnt/dvd/some/dir/filename.txt will appear as filename.txt/./some/dir/filename.txt. It then sorts the files.
The join command prints out lines like filename.txt/./some/dir/filename.txt when there is a file called filename.txt in the source hierarchy but not in the destination hierarchy. We finally massage its output a little since we no longer need the filename at the beginning of the line.

Resources