Duplicate photo searching with compare only pure imagedata and image similarity?

Duplicate photo searching with compare only pure imagedata and image similarity? - image

Having approximately 600GB of photos collected over 13 years - now stored on freebsd zfs/server.
Photos comes from family computers, from several partial backups to different external USB HDDs, reconstructed images from disk disasters, from different photo manipulation softwares (iPhoto, Picassa, HP and many others :( ) in several deep subdirectories - shortly = TERRIBLE MESS with many duplicates.
So in the first i done:
searched the the tree for the same size files (fast) and make md5 checksum for those.
collected duplicated images (same size + same md5 = duplicate)
This helped a lot, but here are still MANY MANY duplicates:
photos what are different only with exif/iptc data added by some photo management software, but the image is the same (or at least "looks as same" and have the same dimensions)
or they are only a resized versions of the original image
or they are the "enhanced" versions of originals, etc..
Now the questions:
how to find duplicates withg checksuming only the "pure image bytes" in a JPG without exif/IPTC and like meta informations? So, want filter out the photo-duplicates, what are different only with exif tags, but the image is the same. (therefore file checksuming doesn't works, but image checksuming could...). This is (i hope) not very complicated - but need some direction.
What perl module can extract the "pure" image data from an JPG file what is usable for comparison/checksuming?
More complex
how to find "similar" images, what are only the
resized versions of the originals
"enchanced" versions of the originals (from some photo manipulation programs)
is here already any algorithm available in a unix command form or perl module (XS?) what i can use to detect these special "duplicates"?
I'm able make complex scripts is BASH and "+-" :) know perl.. Can use FreeBSD/Linux utilities directly on the server and over the network can use OS X (but working with 600GB over the LAN not the fastest way)...
My rough idea:
delete images only at the end of workflow
use Image::ExifTool script for collecting duplicate image data based on image-creation date, and camera model (maybe other exif data too).
make checksum of pure image data (or extract histogram - same images should have the same histogram) - not sure about this
use some similarity detection for finding duplicates based on resize and foto enhancement - no idea how to do...
Any idea, help, any (software/algorithm) hint how to make order in the chaos?
Ps:
Here is nearly identical question: Finding Duplicate image files but i'm already done with the answer (md5). and looking for more precise checksuming and image comparing algorithms.

Assuming you can work with localy mounted FS:
rmlint : fastest tool I've ever used to find exact duplicates
findimagedupes : automatize the whole ImageMagick way (as Randal Schwartz's script that I haven't tested? it seems)
Detecting Similar and Identical Images Using Perseptual Hashes goes all the way (a great reference post)
dupeguru-pe (gui) : dedicated tool that is fast and does an excellent job
geeqie (gui) : I find it fast/excellent to finish the job, using the granular deduplication options. Also then you can generate an ordered collection of images such that 'simular images are next to each other, allowing you to 'flip' between the two to see the changes.

Have you looked at this article by Randal Schwartz? He uses a perl script with ImageMagick to compare resized (4x4 RGB grid) versions of the pictures that he then compares in order to flag "similar" pictures.

You can remove exif data with mogrify -strip from ImageMagick toolset. So you could, for each image, copy it without exif, md5sum, and then compare md5sums.
When it comes to visually similar messages - you can, for example, use compare (also from ImageMagick toolset), and produce black/white diff map, like described here, then make histogram of the difference and check if there is "enough" white to mean that it's different.

I had a similar dilemma - several hundred gigs of photos and videos spread and duplicated over about a dozen drives. I know this may not be the exact way you are looking for, but the FSlint Janitor application (on Ubuntu 16.x, then 18.x) was a lifesaver for me. I took the project in chunks, eventually cleaning it all up and ended up with three complete sets (I wanted two off-site backups).
FSLint Janitor:
sudo apt install fslint

Related

How can I clustering multiple images in a folder?

Due the Hard Failure I lost the separated photos. The I recovered them using image recovery. But now all the images are in one folder. Those images may be over 500 in the same one folder.
The images have customized names also.
The images are not in the same size also.
The images are not in same dimension also.
I am unable to cluster them and separate them in to a new folder as manually and time consuming. So, is there any online solution or software to automatically cluster them and move them into a folder?
For example :
Image set 1 :
Image set 2 :
Image set 3 :
In the above set of pictures, every image has the same background. So those images should be clustered as one and put them in a folder.
As like this, is there any solution or API level solution to simplify the manual works?

If they are JPEG images, you can try running jhead on them and it should be able to find the dates in the files. See jhead.
It can then rename the files based on the date for you, then you could separate them by their names/dates.
It may also tell you the GPS latitude/longitude, so you could move them to folders based on their proximity to each other.
Try the -v option to see the full information in a file:
jhead -v recovered123.jpg

Get the time information from the EXIF metadata.
Use this to automatically name and sort the images. Since you likely did not operate two cameras at two different events at the same time, this will work extremely well. Unlesw you managed to destroy this metadata.

overlay one pdf with another from the command line: pdftk alternative?

I use a bash script to auto-generate a pdf calendar each month.I use the wonderful remind program as the basis for this routine. Great as are the calendars I get using that program, I need a more detailed header for the calendar (than just the name of the month and the year). I couldn't puzzle out a way to get the remind program to enhance the header, but I was able to get the enhanced results I wanted by creating a second pdf containing the header enhancements I need, then overlaying that pdf onto the calendar I produce with remind, via the pdftk utility (pdftk calendar.pdf stamp calendar_overlay.pdf output MONTH-YEAR-cal.pdf). Unfortunately, I recently lost the ability to use pdftk since keeping it on my system would necessitate me ceasing to do other system updates. In short, I had to remove it in order to continue updating my system.
So now I'm looking for some alternative that I can incorporate into my bash script. I am not finding any utility that will allow me to overlay one pdf with another, like pdftk allows. It seems I may be able to do something like what I'm after using imagemagick (-convert), though I would likely need to overlay the pdf with an image file like a .jpg rather than with a pdf. Another possible solution may be to use TeX/LaTeX to insert text into the pdf as described at https://rsmith.home.xs4all.nl/howto/adding-text-or-graphics-to-a-pdf-file.html.
I wanted to ask here, before investing a lot of time and effort into pursuing one or other of the two potential options I've identified, whether there is some other way, using command line options that can be incorporated into a bash script, of overlaying one pdf with another in the manner described? Input will be appreciated.
LATER EDIT: another link with indications how to do such things using LaTeX https://askubuntu.com/questions/712691/batch-add-header-footer-to-pdf-files

Assuming for simplicity that both of your files are of size 500pt x 200pt,
you can use pdfjam with nup and delta options to trick it into overlaying your source pdf files.
pdfjam bottom.pdf top.pdf --outfile merged.pdf \
--nup "1x2" \
--noautoscale true \
--delta "0 -200pt" \
--papersize "{500pt, 200pt}"
Unfortunately, I've found in my tests that I needed to increase the y delta by one point to get perfect alignment.

pdftk-java is a Java-based port of pdftk which looks to be actively in development. Given that its only real requirement appears to be Java 7+, it should work even in environments such as your own that no longer support the requirements of pdftk, so long as they have a Java runtime installed.

Separating icons from pictures in a bunch of random PNGs

A long time ago I screwed with my HDD and had to recover all my data, but I couldn't recover the files' names.
I used a tool to sort all these files by extension, and another to sort the JPGs by date, since the date when a JPG was created is stored in the file itself. I can't do this with PNGs, though, unfortunately.
So I have a lot of PNGs, but most of them are just icons or assets used formerly as data by the software I used at that time. But I know there are other, "real" pictures, that are valuable to me, and I would really love to get them back.
I'm looking for any tool, or any way, just anything you can think of, that would help me separate the trash from the good in this bunch of pictures, it would really be amazing of you.
Just so you know, I'm speaking of 230 thousand files, for ~2GB of data.
As an exemple, this is what I call trash :
or , and all these kind of images.
I'd like these to be separated from pictures of landscapes / people / screenshots, the kind of pictures you could have in you phone's gallery...
Thanks for reading, I hope you'll be able to help !

This simple ImageMagick command will tell you the:
height
width
number of colours
name
of every PNG in the current directory, separated by colons for easy parsing:
convert *.png -format "%h:%w:%k:%f\n" info:
Sample Output
600:450:5435:face.png
600:450:17067:face_sobel_magnitude.png
2074:856:2:lottery.png
450:450:1016:mask.png
450:450:7216:result.png
600:450:5435:scratches.png
800:550:471:spectrum.png
752:714:20851:z.png
If you are on macOS or Linux, you can easily run it under GNU Parallel to get 16 done at a time and you can parse the results easily with awk, but you may be on Windows.
You may want to change the \n at the end for \r\n under Windows if you are planning to parse the output.

Minimize disc activity with rmagick or imagemagick

I'm generating animated GIF files from multiple source images using Ruby. I need to maximize throughput / minimize time spent to create each GIF. I'd prefer to keep the source images in memory (probably Memcached) rather than read them from disc every time I need them. I've been using convert in backticks to execute imagemagick commands directly from Ruby, e.g.
`convert -delay #{delay} -page #{w}x#{h}+0+0 src01.gif... etc`
I slightly prefer this over RMagick as I've found more examples, I can reference the ImageMagick docs directly. It seems that images passed to the convert command need to be paths to images on disc. Additionally it seems like the output of the convert commend is a file path so the generated image would be written to disc by ImageMagick and I'd need to read it back off disc using Ruby to access the resulting image data. It seems like I'm making ImageMagick read the source images from disc each time and write the generated GIF to disc each time. I think this is likely to be a bottleneck and unnecessary as I don't need to persist the generated images I just need to access their image data in Ruby momentarily.
I noticed that RMagick methods can take Magick::Images as parameters instead of filepaths. I could keep the source images in memory in this case. Additionally RMagick returns the generated image as data to Ruby which is what I need, I don't need it written to disc.
I'm thinking of using RMagick instead of
`convert...`
to reduce disc activity.
So question 1: Does this make sense though? Since RMagick presumably wraps ImageMagick, is RMagick actually reading and writing to disc under the hood or does it have some way of utilizing ImageMagick without disc activity?
And question 2: Is there any way to get image data in and out of ImageMagick's convert command without disc activity?
Hope this makes sense. Just trying to wrap my head around this and apologize if I'm unclear.

Does this make sense though?
Not really. We can argue about open fd's, and cost of shell environments over direct API, but there wouldn't be any disk I/O benefit between the convert utility & RMagick.
Is there any way to get image data in and out of ImageMagick's convert command without disc activity?
ImageMagick ships with stream utility. There's not much usage-documentation, but it could be leveraged to extract the image data to a blob that can be distributed via memcached.
There's also the mpr: protocol to handle label based memory access, but that might not be the distributed solution your looking for. Plus data is removed at time of process completion.
Personally, Marks comment about RAMdisk would be something I would recommend. A simple memory/tmpfs mount is easy to set-up on a system, and then it would just be a matter of updating policy.xml configuration to use said mount as a temporary directory.

How to convert .epsf to .eps?

I'm looking for a method of converting .epsf to .eps for a publication I'm submitting. The submission site requires .eps (even though my understanding is that modern renderers should be able to read .epsf as well - the site is archaic, I have to upload all 100 images individually.) My co-author sent me the zipped files to upload (and now to convert) - I didn't make them myself. Further, the programs that made these images may exist on my co-authors computer but where is uncertain.
I've tried this in Mathematica 8 to reasonable but not full success - as in colored files become black and white, files with duplicate entries (as in Fig11a.eps and Fig11a.epsf both exist though they are different, it seems that the .eps is the background and the .epsf is the foreground layer) convert incorrectly. My attempt was to import the .epsf files to Mathematica and export them as .eps.
Also, I've using a middle man format - e.g. gif/tiff/png/jpg - with similar results. I haven't been able to find a program that's free (I assume photoshop could pull this off) that I could use - also I'd like to do it as a batch. A method that uses requires python/Mathematica or XP/Linux OS's would be fine. Thanks.

You do not need to convert anything. Encapsulated PostScript files can have both extensions (both EPS and EPSF). If you publisher refuses to accept files with an EPSF extension just rename them to EPS.
Any processing/conversion you do on the files (using GhostScript, Mathematica, etc.) carries the risk of corrupting the graphics in some way. But there's no need to do it. Send them as they are or rename them if you prefer.
(If you have any doubt, you can check the EPS Format Specification from 1992 which says that on the Macintish the recommended file extension is .epsf while on DOS it's .EPS)

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio