How to find duplicated jpgs by content? - image

I'd like to find and remove an image in a series of folders. The problem is that the image names are not necessarily the same.
What I did was to copy an arbitrary string from the images bytecode and use it like
grep -ir 'YA'uu�KU���^H2�Q�W^YSp��.�^H^\^Q��P^T' .
But since there are thousands of images this method lasts for ever. Also, some images are created by imagemagic of the original, so can not use size to find them all.
So I'm wondering what is the most efficient way to do so?

Updated Answer
If you have the checksum of a specific file in mind that you want to compare with, you can checksum all files in all subdirectories and find the one that is the same:
find . -name \*.jpg -exec bash -c 's=$(md5 < {}); echo $s {}' \; | grep "94b48ea6e8ca3df05b9b66c0208d5184"
Or this may work for you too:
find . -name \*.jpg -exec md5 {} \; | grep "94b48ea6e8ca3df05b9b66c0208d5184"
Original Answer
The easiest way is to generate an md5 checksum once for each file. Depending on how your md5 program works, you would do something like this:
find . -name \*.jpg -exec bash -c 's=$(md5 < {}); echo $s {}' \;
94b48ea6e8ca3df05b9b66c0208d5184 ./a.jpg
f0361a81cfbe9e4194090b2f46db5dad ./b.jpg
c7e4f278095f40a5705739da65532739 ./c.jpg
Or maybe you can use
md5 -r *.jpg
94b48ea6e8ca3df05b9b66c0208d5184 a.jpg
f0361a81cfbe9e4194090b2f46db5dad b.jpg
c7e4f278095f40a5705739da65532739 c.jpg
Now you can use uniq to find all duplicates.

Related

sorting output of find before running the command in -exec

I have a series of directories containing multiple mp3 files with filenames 001.mp3, 002.mp3, ..., 030.mp3.
What I want to do is to put them all together in order into a single mp3 file and add some meta data to that.
Here's what I have at the moment (removed some variable definitions, for clarity):
#!/bin/bash
for d in */; do
cd $d
find . -iname '*.mp3' -exec lame --decode '{}' - ';' | lame --tt "$title_prefix$name" --ty "${name:5}" --ta "$artist" --tl "$album" -b 64 - $final_path"${d%/}".mp3
cd ..
done
Sometimes this works and I get a single file with all the "tracks" in the correct order.
However, more often than not I get a single file with all the "tracks" in reverse order, which really isn't good.
What I can't understand is why the order varies between different runs of the script as all the directories contain the same set of filenames. I've poured over the man page and can't find a sort option for find.
I could run find . -iname '*.mp3' | sort -n >> temp.txt to put the files in a temporary file and then try and loop through that, but I can't get that to work with lame.
Is there any way I can put a sort in before find runs the exec? I can find plenty of examples here and elsewhere of doing this with -exec ls but not where one needs to execute something more complicated with exec.
find . -iname '*.mp3' -print0 | sort -zn | xargs -0 -I '{}' lame --decode '{}' - | lame --tt "$title_prefix$name" --ty "${name:5}" --ta "$artist" --tl "$album" -b 64 - $final_path"${d%/}".mp3
Untested but might be worth a try.
Normally xargs appends the arguments to the end of the command you give it. The -I option tells it to replace the given string instead ({} in this case).
Edit: I've added -print0, -z, -0 to make sure the pipeline still works even if your filenames contain newlines.

(mogrify) ln -s creating copies of files

Running the following script:
for i in $(find dir -name "*.jpg"); do
ln -s $i
done
incredibly makes symbolic links for 90% of the files and makes of a copy of the remaining 10%. How's that possible?
Edit: what's happening after is relevant:
Those are links to images that I rotate through mogrify e.g.
mogrify -rotate 90 link_to_image
It seems like mogrify on a link silently makes a copy of the image, debatable choice, but that's what it is.
Skip the first paragraph if you want to know more about processing of files with spaces in the names
It was not clear, what is the root of the problem and our assumption was that the problem is in the spaces in the filenames: that files that have them are not processed correctly.
The real problem was mogrify that applied to the created links, processed them and changed with real files.
No about spaces in filenames.
Processing of files with spaces in their names
That is because of spaces in names of the files.
You can write something like this:
find dir -name \*.jpg | while IFS= read i
do
ln -s "$i"
done
(IFS= is used here to avoiding stripping of leading spaces, thanks to #Alfe for the tip).
Or use xargs.
If it is possible that names contain "\n", it's better to use print0:
find dir -name \*.jpg -print0 | xargs -0 -N1 ln -s
Of course, you can use other methods also, for example:
find dir -name '*.jpg' -exec ln -s "{}" \;
ln -s "$(find dir -name '*.jpg')" .
(Imagemagick) mogrify applied on a link delete the link and makes a copy of the image
Try with single quotes:
find dir -name '*.jpg' -exec ln -s "{}" \;

Grepping from a text file list

I know I can find specific types of files and then grep them in one shot, i.e.
find . -type f -name "*.log" -exec grep -o "some-pattern" {} \;
But I need to do this in two steps. This is because the find operation is expensive (there are lots of files and subdirectories to search). I'd like to save down the file-list to a text file, and then repeatedly grep for different patterns on this precomputed set of files whenever I need to. The first part is easy:
find . -type f -name "*.log" > my-file-list.txt
Now I have a file that looks like this:
./logs/log1.log
./logs/log2.log
etc
What does the grep look like? I've tried a few combinations but can't get it right.
xargs grep "your pattern" < my-file-list.txt

How to find files bigger then some size, and sort them by last modification date?

I need to write script which write down each files in selected catalog, which are bigger then some size. Also I need to sort them by size, name and last modification date.
So I have made the first two cases:
Sort by size
RESULTS=`find $CATALOG -size +$SIZE | sort -n -r | sed 's_.*/__'`
Sort by name
RESULTS=`find $CATALOG -size +$SIZE | sed 's_.*/__' | sort -n `
But I have no idea how to sort results by last modification date.
Any help would be appreciated.
One of the best approaches, provided you don't have too many files, is to use ls to do the sorting itself.
Sort by name and print one file per line:
find $CATALOG -size +$SIZE -exec ls -1 {} +
Sort by size and print one file per line:
find $CATALOG -size +$SIZE -exec ls -S1 {} +
Sort by modification time and print one file per line:
find $CATALOG -size +$SIZE -exec ls -t1 {} +
You can also play with the ls switches: Sort by modification time (small first) with long listing format, with human-readable sizes:
find $CATALOG -size +$SIZE -exec ls -hlrt {} +
Oh, you might want to only find the files (and ignore the directories):
find $CATALOG -size +$SIZE -type f -exec ls -hlrt {} +
Finally, some remarks: Avoid capitalized variable names in bash (it's considered bad practice) and avoid back ticks, use $(...) instead. E.g.,
results=$(find "$catalog" -size +$size -type f -exec ls -1rt {} +)
Also, you probably don't want to put all the results in a string like the previous line. You probably want to put the results in an array. In that case, use mapfile like this:
mapfile -t results < <(find "$catalog" -size +$size -type f -exec ls -1rt {} +)
Try xargs (do whatever, treating STDIN as a list of arguments) and the -t and -r flags to ls.
i.e. something like this:
find $CATALOG -size +$SIZE | xargs ls -ltr
That will give you the files sorted by last modification date.
Sorting by multiple attributes at once is going to be really awkward to do with shell utilities and pipes though — I think you'll need to use a scripting language (ruby, perl, php, whatever), unless your shell fu is strong.

How can I convert JPG images recursively using find?

Essentially what I want to do is search the working directory recursively, then use the paths given to resize the images. For example, find all *.jpg files, resize them to 300x300 and rename to whatever.jpg.
Should I be doing something along the lines of $(find | grep *.jpg) to get the paths? When I do that, the output is directories not enclosed in quotation marks, meaning that I would have to insert them before it would be useful, right?
I use mogrify with find.
Lets say, I need everything inside my nested folder/another/folder/*.jpg to be in *.png
find . -name "*.jpg" -print0|xargs -I{} -0 mogrify -format png {}
&& with a bit of explaination:
find . -name *.jpeg -- to find all the jpeg's inside the nested folders.
-print0 -- to print desired filename withouth andy nasty surprises (eg: filenames space seperated)
xargs -I {} -0 -- to process file one by one with mogrify
and lastly those {} are just dummy file name for result from find.
You can use something like this with GNU find:
find . -iname \*jpg -exec /your/image/conversion/script.sh {} +
This will be safer in terms of quoting, and spawn fewer processes. As long as your script can handle the length of the argument list, this solution should be the most efficient option.
If you need to handle really long file lists, you may have to pay the price and spawn more processes. You can modify find to handle each file separately. For example:
find . -iname \*jpg -exec /your/image/conversion/script.sh {} \;

Resources