Bash command to (re)move duplicate files [eg photos] - bash

If I have a directory with a bunch of photos and some of them are duplicates [in everything except name], is there a way I can get a list of uniques and move them to another dir?
Eg
find . -type f -print0 | xargs -0 md5sum
that will give me a list of "md5 filename"
Now I just want to look at uniques based on that...
eg pipe that to sort -u.
After that I want to mv all of those files somewhere else, but I can worry about that later...

You can use fdupes:
fdupes -r .
to get a list of duplicates. The move should be possible with some command chaining.
fdupes -r -f .
Shows you only the duplicated files. So if you have an image twice. You'll get one entry instead of both duplicated paths.
To move you could do:
for file in $(fdupes -r -f . | grep -v '^$')
do
mv "$file" duplicated-files/
done
But be aware of name clashes..

From there:
sort | uniq -w 32
Will compare only the first 32 characters, which I believe should be the md5sum itself.

Related

Batch rename all files in a directory to basename-sequentialnumber.extention

I have a directory containing .jpg files, currently named photo-1.jpg, photo-2.jpg etc. There are about 20,000 of these files, sequentially numbered.
Sometimes I delete some of these files, which creates gaps in the file naming convention.
Can you guys help me with a bash script that would sequentially rename all the files in the directory to eliminate the gaps? I have found many posts about renaming files and tried a bunch of things, but can't quite get exactly what I'm looking for.
For example:
photo-1.jpg
photo-2.jpg
photo-3.jpg
Delete photo-2.jpg
photo-1.jpg
photo-3.jpg
run script to sequentially rename all files
photo-1.jpg
photo-2.jpg
done
With find and sort.
First check the output of
find directory -type f -name '*.jpg' | sort -nk2 -t-
If the output is not what you expected it to be, meaning the order of sorting is not correct, then It might have something to do with your locale. Add the LC_ALL=C before the sort.
find directory -type f -name '*.jpg' | LC_ALL=C sort -nk2 -t-
Redirect it to a file so it can be recorded, add a | tee output.txt after the sort
Add the LC_ALL=C before the sort in the code below if it is needed.
#!/bin/sh
counter=1
find directory -type f -name '*.jpg' |
sort -nk2 -t- | while read -r file; do
ext=${file##*[0-9]} filename=${file%-*}
[ ! -e "$filename-$counter$ext" ] &&
echo mv -v "$file" "$filename-$counter$ext"
counter=$((counter+1))
done # 2>&1 | tee log.txt
Change the directory to the actual name of your directory that contains the files that you need to rename.
If your sort has the -V flag/option then that should work too.
sort -nk2 -t- The -n means numerically sort. -k2 means the second field and the -t- means the delimiter/separator is a dash -, can be written as -t -, caveat, if the directory name has a dash - as well, sorting failure is expected. Adjust the value of -k and it should work.
ext=${file##*[0-9]} is a Parameter Expansion, will remain only the .jpg
filename=${file%-*} also a Parameter Expansion, will remain only the photo plus the directory name before it.
[ ! -e "$filename-$counter$ext" ] will trigger the mv ONLY if the file does not exists.
If you want some record or log, remove the comment # after the done
Remove the echo if you think that the output is correct

Faster way to list files with similar names (using bash)?

I have a directory with more than 20K files all with a random number prefix (eg 12345--name.jpg). I want to find files with similar names and remove all but one. I don't care which one because they are duplicates.
To find duplicated names I've use
find . -type f \( -name "*.jpg" \) | | sed -e 's/^[0-9]*--//g' | sort | uniq -d
as the list of a for/next loop.
To find all but one to delete, I'm currently using
rm $(ls -1 *name.jpg | tail -n +2)
This operation is pretty slow. I want to speed this up. Any suggestions?
I would do it like this.
*Note that you are dealing with rm command, so make sure that you have backup of the existing directory in case something goes south.
Create a backup directory and take backup of existing files. Once done check if all the files are there.
mkdir bkp_dir;cp *.jpg /bkp_dir
Create another temp directory where we will keep all only 1 file for each similar name. So all unique file names will be here.
$ mkdir tmp
$ for i in $(ls -1 *.jpg|sed 's/^[[:digit:]].*--\(.*\.jpg\)/\1/'|sort|uniq);do cp $(ls -1|grep "$i"|head -1) tmp/ ;done
*Explanation of the command is at the last. Once executed, check in /tmp directory if you got unique instances of the files.
Remove all *.jpg files from main directory. Saying again, please verify that all files have been backed up before executing rm command.
rm *.jpg
Backup the unique instances from the temp directory.
cp tmp/*.jpg .
Explanation of command in step 2.
Command to get unique file names for step 2 will be
for i in $(ls -1 *.jpg|sed 's/^[[:digit:]].*--\(.*\.jpg\)/\1/'|sort|uniq);do cp $(ls -1|grep "$i"|head -1) tmp/ ;done
$(ls -1 *.jpg|sed 's/^[[:digit:]].*--\(.*\.jpg\)/\1/'|sort|uniq) will get the unique file names like file1.jpg , file2.jpg
for i in $(...);do cp $(ls -1|grep "$i"|head -1) tmp/ ;done will copy one file for each filename to tmp/ directory.
You should not be using ls in scripts and there is no reason to use a separate file list like in userunknown's reply.
keepone () {
shift
rm "$#"
}
keepone *name.jpg
If you are running find to identify the files you want to isolate anyway, traversing the directory twice is inefficient. Filter the output from find directly.
find . -type f -name "*.jpg" |
awk '{ f=$0; sub(/^[0-9]*--/, "", f); if (a[f]++) print }' |
xargs echo rm
Take out the echo if the results look like what you expect.
As an aside, the /g flag to sed is useless for a regex which can only match once. The flag says to replace all occurrences on a line instead of the first occurrence on a line, but if there can be only one, the first is equivalent to all.
Assuming no subdirectories and no whitespace-in-filenames involved:
find . -type f -name "*.jpg" | sed -e 's/^[0-9]*--//' | sort | uniq -d > namelist
removebutone () { shift; echo rm "$#"; }; cat namelist | while read n; do removebutone "*--$n"; done
or, better readable:
removebutone () {
shift
echo rm "$#"
}
cat namelist | while read n; do removebutone "*--$n"; done
Shift takes the first parameter from $* off.
Note that the parens around the name parmeter are superflous, and that there shouldn't be two pipes before sed. Maybe you had something else there, which needed to be covered.
If it looks promising, you have, of course, to remove the 'echo' in front of 'rm'.

sorting output of find before running the command in -exec

I have a series of directories containing multiple mp3 files with filenames 001.mp3, 002.mp3, ..., 030.mp3.
What I want to do is to put them all together in order into a single mp3 file and add some meta data to that.
Here's what I have at the moment (removed some variable definitions, for clarity):
#!/bin/bash
for d in */; do
cd $d
find . -iname '*.mp3' -exec lame --decode '{}' - ';' | lame --tt "$title_prefix$name" --ty "${name:5}" --ta "$artist" --tl "$album" -b 64 - $final_path"${d%/}".mp3
cd ..
done
Sometimes this works and I get a single file with all the "tracks" in the correct order.
However, more often than not I get a single file with all the "tracks" in reverse order, which really isn't good.
What I can't understand is why the order varies between different runs of the script as all the directories contain the same set of filenames. I've poured over the man page and can't find a sort option for find.
I could run find . -iname '*.mp3' | sort -n >> temp.txt to put the files in a temporary file and then try and loop through that, but I can't get that to work with lame.
Is there any way I can put a sort in before find runs the exec? I can find plenty of examples here and elsewhere of doing this with -exec ls but not where one needs to execute something more complicated with exec.
find . -iname '*.mp3' -print0 | sort -zn | xargs -0 -I '{}' lame --decode '{}' - | lame --tt "$title_prefix$name" --ty "${name:5}" --ta "$artist" --tl "$album" -b 64 - $final_path"${d%/}".mp3
Untested but might be worth a try.
Normally xargs appends the arguments to the end of the command you give it. The -I option tells it to replace the given string instead ({} in this case).
Edit: I've added -print0, -z, -0 to make sure the pipeline still works even if your filenames contain newlines.

Peform deletion of rows for all files in a directory

Currently I am trying to delete the last two rows of every file in a specific directory I am working on. I know how to delete the last two rows in it:
head -n -2 myfile.txt
but how can I perform that task for all files in my dir without retyping it and listing filenames .....
you can use find and do it one by one:
find -exec head -n -2 {} \;
this line use find to find out all the files in the directory, and then execute(-exec) a command head -n -2 {} where {} means the filename found, and the \; at last is essential, it means do it one by one.
UPDATE:
i didn't found the way you said to use head is actually not writing to the file,if you want to delete the last two lines, that is not as easy, one liner could be subtle:
find -exec bash -c 'c=`wc -l {} | awk '"'{print \$1}'"'`; sed -i "$((c-1)),${c}d" {}' \;
sounds better to use a for loop:
for file in folder/*; do
c=`wc -l $file | awk '{print $1}'`
sed -i "$((c-1)),${c}d" $file
done

unix command to find most recent directory created

I want to copy the files from the most recent directory created. How would I do so in unix?
For example, if I have the directories names as date stamp as such:
/20110311
/20110318
/20110325
This is the answer to the question I think you are asking.
When I deal with many directories that have date/time stamps in the name, I always take the approach that you have which is YYYYMMDD - the great thing about that is that the date order is then also the alphabetical order. In most shells (certainly in bash and I am 90% sure of the others), the '*' expansion is done alphabetically, and by default 'ls' return alphabetical order. Hence
ls | head -1
ls | tail -1
Give you the earliest and the latest dates in the directory.
This can be extended to only keep the last 5 entries etc.
lastdir=`ls -tr <parentdir> | tail -1`
I don't know how to make the backticks play nice with the commenting system here. Just replace those apostrophes with backticks.
After some experimenting, I came up with the following:
The unix stat command is useful here. The '-t' option causes stat to print its output in terse mode (all in one line), and the 13th element of that terse output is the unix timestamp (seconds since epoch) for the last-modified time. This command will list all directories (and sub-directories) in order from newest-modified to oldest-modified:
find -type d -exec stat -t {} \; | sort -r -n -k 13,13
Hopefully the "terse" mode of stat will remain consistent in future releases of stat !
Here's some explanation of the command-line options used:
find -type d # only find directories
find -exec [command] {} \; # execute given command against each *found* file.
sort -r # reverse the sort
sort -n # numeric sort (100 should not appear before 2!)
sort -k M,N # only sort the line using elements M through N.
Returning to your original request, to copy files, maybe try the following. To output just a single directory (the most recent), append this to the command (notice the initial pipe), and feed it all into your 'cp' command with backticks.
| head --lines=1 | sed 's/\ .*$//'
The trouble with the ls based solutions is that they are not filtering just for directories. I think this:
cp `find . -mindepth 1 -maxdepth 1 -type d -exec stat -c "%Y %n" {} \; |sort -n -r |head -1 |awk '{print $2}'`/* /target-directory/.
might do the trick, though note that that will only copy files in the immediate directory. If you want a more general answer for copying anything below your newest directory over to a new directory I think you would be better off using rsync like:
rsync -av `find . -mindepth 1 -maxdepth 1 -type d -exec stat -c "%Y %n" {} \; |sort -n -r |head -1 |awk '{print $2}'`/ /target-directory/
but it depends a bit which behaviour you want. The explanation of the stuff in the backticks is:
. - the current directory (you may want to specify an absolute path here)
-mindepth/-maxdepth - restrict the find command only to the immediate children of the current directory
-type d - only directories
-exec stat .. - outputs the modified time and the name of the directory from find
sort -n -r |head -1 | awk '{print $2}' - date orders the directory and outputs the name of the most recently modified
If your directories are named YYYYMMDD like your question suggests, take advantage of the alphabetic globbing.
Put all directories in an array, and then pick the first one:
dirs=(*/); first_dir="$dirs";
(This is actually a shortcut for first_dir="${dirs[0]}";.)
Similarly, for the last one:
dirs=(*/); last_dir="${dirs[$((${#dirs[#]} - 1))]}";
Ugly syntax, but this is what it breaks down to:
# Create an array of all directories inside the working directory.
dirs=(*/);
# Get the number of entries in the array.
num_dirs=${#dirs[#]};
# Calculate the index of the last entry.
last_index=$(($num_dirs - 1));
# Get the value at the last index.
last_dir="${dirs[$last_index]}";
I know this is an old question with an accepted answer, but I think this method is preferable as it does everything in Bash. No reason to spawn extra processes, let alone parse the output of ls. (Which, admittedly, should be fine in this particular case of YYYYMMDD names.)
please try with following command
ls -1tr | tail -1
find ~ -type d | ls -ltra
This one is simple and useful which I learned recently.
This command will show the results in reverse chronological order.
I wrote a command that can be used to identify which folder or files are created in a folder as a newest. That's seems pure :)
#/bin/sh
path=/var/folder_name
newest=`find $path -maxdepth 1 -exec stat -t {} \; |sed 1d |sort -r -k 14 | head -1 |awk {'print $1'} | sed 's/\.\///g'`
find $path -maxdepth 1| sed 1d |grep -v $newest

Resources