How can I clustering multiple images in a folder? - image

Due the Hard Failure I lost the separated photos. The I recovered them using image recovery. But now all the images are in one folder. Those images may be over 500 in the same one folder.
The images have customized names also.
The images are not in the same size also.
The images are not in same dimension also.
I am unable to cluster them and separate them in to a new folder as manually and time consuming. So, is there any online solution or software to automatically cluster them and move them into a folder?
For example :
Image set 1 :
Image set 2 :
Image set 3 :
In the above set of pictures, every image has the same background. So those images should be clustered as one and put them in a folder.
As like this, is there any solution or API level solution to simplify the manual works?

If they are JPEG images, you can try running jhead on them and it should be able to find the dates in the files. See jhead.
It can then rename the files based on the date for you, then you could separate them by their names/dates.
It may also tell you the GPS latitude/longitude, so you could move them to folders based on their proximity to each other.
Try the -v option to see the full information in a file:
jhead -v recovered123.jpg

Get the time information from the EXIF metadata.
Use this to automatically name and sort the images. Since you likely did not operate two cameras at two different events at the same time, this will work extremely well. Unlesw you managed to destroy this metadata.

Related

Separating icons from pictures in a bunch of random PNGs

A long time ago I screwed with my HDD and had to recover all my data, but I couldn't recover the files' names.
I used a tool to sort all these files by extension, and another to sort the JPGs by date, since the date when a JPG was created is stored in the file itself. I can't do this with PNGs, though, unfortunately.
So I have a lot of PNGs, but most of them are just icons or assets used formerly as data by the software I used at that time. But I know there are other, "real" pictures, that are valuable to me, and I would really love to get them back.
I'm looking for any tool, or any way, just anything you can think of, that would help me separate the trash from the good in this bunch of pictures, it would really be amazing of you.
Just so you know, I'm speaking of 230 thousand files, for ~2GB of data.
As an exemple, this is what I call trash :
or , and all these kind of images.
I'd like these to be separated from pictures of landscapes / people / screenshots, the kind of pictures you could have in you phone's gallery...
Thanks for reading, I hope you'll be able to help !
This simple ImageMagick command will tell you the:
height
width
number of colours
name
of every PNG in the current directory, separated by colons for easy parsing:
convert *.png -format "%h:%w:%k:%f\n" info:
Sample Output
600:450:5435:face.png
600:450:17067:face_sobel_magnitude.png
2074:856:2:lottery.png
450:450:1016:mask.png
450:450:7216:result.png
600:450:5435:scratches.png
800:550:471:spectrum.png
752:714:20851:z.png
If you are on macOS or Linux, you can easily run it under GNU Parallel to get 16 done at a time and you can parse the results easily with awk, but you may be on Windows.
You may want to change the \n at the end for \r\n under Windows if you are planning to parse the output.

How to reuse a path in PDF generation?

I am learning the PDF "syntax" and try to create various PDF documents manually (on Windows 7, Notepad++ to write the 1st unreferenced, broken, file, then run them through pdftk to produce a valid file with updated references, as explained here...).
My learning materials:
PDF Reference 6th edition, version 1.7
+ various online resources.
My question : I would like to create a path only once in the document, then possibly reuse it many times in other parts of the same document. E.g. I could define a logo once then reuse it in different pages, maybe several times at different positions in the same page, maybe with different scaling factors... What is the best way to achieve this?
To give a better idea, I could define a logo (here a cross) that way :
3 0 obj
<< /Length 32>>
stream
10 0 10 30 re f
0 10 30 10 re S
endstream
endobj
And would like to reuse the same "shape" (possibly with different scalings) in other places in the document, without having to specify the path again.
I am not looking for a software to do the job (eg. Acrobat...). I want to learn how to write this manually (then ask pdftk to fix the file).
You can look in this recently created GitHub repository for some examples of handcoded PDF files.
This file specifically is re-using a (very small, 2x2 pixels) image multiple times on one page, at different positions, with different skews, scalings and rotations.
Here is a screenshot of that PDF file:

Duplicate photo searching with compare only pure imagedata and image similarity?

Having approximately 600GB of photos collected over 13 years - now stored on freebsd zfs/server.
Photos comes from family computers, from several partial backups to different external USB HDDs, reconstructed images from disk disasters, from different photo manipulation softwares (iPhoto, Picassa, HP and many others :( ) in several deep subdirectories - shortly = TERRIBLE MESS with many duplicates.
So in the first i done:
searched the the tree for the same size files (fast) and make md5 checksum for those.
collected duplicated images (same size + same md5 = duplicate)
This helped a lot, but here are still MANY MANY duplicates:
photos what are different only with exif/iptc data added by some photo management software, but the image is the same (or at least "looks as same" and have the same dimensions)
or they are only a resized versions of the original image
or they are the "enhanced" versions of originals, etc..
Now the questions:
how to find duplicates withg checksuming only the "pure image bytes" in a JPG without exif/IPTC and like meta informations? So, want filter out the photo-duplicates, what are different only with exif tags, but the image is the same. (therefore file checksuming doesn't works, but image checksuming could...). This is (i hope) not very complicated - but need some direction.
What perl module can extract the "pure" image data from an JPG file what is usable for comparison/checksuming?
More complex
how to find "similar" images, what are only the
resized versions of the originals
"enchanced" versions of the originals (from some photo manipulation programs)
is here already any algorithm available in a unix command form or perl module (XS?) what i can use to detect these special "duplicates"?
I'm able make complex scripts is BASH and "+-" :) know perl.. Can use FreeBSD/Linux utilities directly on the server and over the network can use OS X (but working with 600GB over the LAN not the fastest way)...
My rough idea:
delete images only at the end of workflow
use Image::ExifTool script for collecting duplicate image data based on image-creation date, and camera model (maybe other exif data too).
make checksum of pure image data (or extract histogram - same images should have the same histogram) - not sure about this
use some similarity detection for finding duplicates based on resize and foto enhancement - no idea how to do...
Any idea, help, any (software/algorithm) hint how to make order in the chaos?
Ps:
Here is nearly identical question: Finding Duplicate image files but i'm already done with the answer (md5). and looking for more precise checksuming and image comparing algorithms.
Assuming you can work with localy mounted FS:
rmlint : fastest tool I've ever used to find exact duplicates
findimagedupes : automatize the whole ImageMagick way (as Randal Schwartz's script that I haven't tested? it seems)
Detecting Similar and Identical Images Using Perseptual Hashes goes all the way (a great reference post)
dupeguru-pe (gui) : dedicated tool that is fast and does an excellent job
geeqie (gui) : I find it fast/excellent to finish the job, using the granular deduplication options. Also then you can generate an ordered collection of images such that 'simular images are next to each other, allowing you to 'flip' between the two to see the changes.
Have you looked at this article by Randal Schwartz? He uses a perl script with ImageMagick to compare resized (4x4 RGB grid) versions of the pictures that he then compares in order to flag "similar" pictures.
You can remove exif data with mogrify -strip from ImageMagick toolset. So you could, for each image, copy it without exif, md5sum, and then compare md5sums.
When it comes to visually similar messages - you can, for example, use compare (also from ImageMagick toolset), and produce black/white diff map, like described here, then make histogram of the difference and check if there is "enough" white to mean that it's different.
I had a similar dilemma - several hundred gigs of photos and videos spread and duplicated over about a dozen drives. I know this may not be the exact way you are looking for, but the FSlint Janitor application (on Ubuntu 16.x, then 18.x) was a lifesaver for me. I took the project in chunks, eventually cleaning it all up and ended up with three complete sets (I wanted two off-site backups).
FSLint Janitor:
sudo apt install fslint

compress large amount of small files in different folders

In our appliation,we save some images in different folders like:
1
2
3
4
...
500
...
And inside each folder there are large amount of images whose size is (5kb-20kb).
Now we found that when we try to transfer these files,we have to compress them first using the winrar,however it cost toooooo much time!! Also two hours to compress one parent folder.
In fact the images in the application are map images like the google map tiles:
||
So I wonder if there is an good idea to save/transfer these small but large amount files?
Images like that are likely to already be compressed so you will get little gain in bandwidth use (and so transfer speed) from the compression step.
If the compression process is taking along time where your CPU is busy then try instead just creating a plain tar file (which joins all the files into one archive without applying any compression). I don't know about winrar but most other compression tools (like 7zip) can generate a tar file, so I'm guessing winrar can too.
If you regularly transfer the whole set of files but only small numbers are added/changed each time, you might want to look into other transfer methods like rsync. You don't describe either of your environments so I can't tell if this is likely to be available to you, but if it is rsync does an excellent job of only transferring changes (speeding up the transfer significantly) and it also always uses one connection so you don't get hit by the per file latency of FTP and other protocols - one file follows the previous one down the same connection as if the parts being transferred had been tared together so you don't need that extra step to pack the files at one and (and unpack them at the other).
Those images are already compressed. However, to increase transfer speed, you might try using rar in 'archive' mode. This does the same thing as tar: concatenates all the files together into one big file. Don't use any compression in your archive format.
Maybe you can use a fast compression library like Snappy. However, it can only compress a single file, and you surely don't want to transfer each file separately. I'd create an uncompressed TAR archive for that.

Millions of small graphics files and how to overcome slow file system access on XP

I'm rendering millions of tiles which will be displayed as an overlay on Google Maps. The files are created by GMapCreator from the Centre for Advanced Spatial Analysis at University College London. The application renders files in to a single folder at a time, in some cases I need to create about 4.2 million tiles. Im running it on Windows XP using an NTFS filesystem, the disk is 500GB and was formatted using the default operating system options.
I'm finding the rendering of tiles gets slower and slower as the number of rendered tiles increases. I have also seen that if I try to look at the folders in Windows Explorer or using the Command line then the whole machine effectively locks up for a number of minutes before it recovers enough to do something again.
I've been splitting the input shapefiles into smaller pieces, running on different machines and so on, but the issue is still causing me considerable pain. I wondered if the cluster size on my disk might be hindering the thing or whether I should look at using another file system altogether. Does anyone have any ideas how I might be able to overcome this issue?
Thanks,
Barry.
Update:
Thanks to everyone for the suggestions. The eventual solution involved writing piece of code which monitored the GMapCreator output folder, moving files into a directory heirarchy based upon their filenames; so a file named abcdefg.gif would be moved into \a\b\c\d\e\f\g.gif. Running this at the same time as GMapCreator overcame the filesystem performance problems. The hint about the generation of DOS 8.3 filenames was also very useful - as noted below I was amazed how much of a difference this made. Cheers :-)
There are several things you could/should do
Disable automatic NTFS short file name generation (google it)
Or restrict file names to use 8.3 pattern (e.g. i0000001.jpg, ...)
In any case try making the first six characters of the filename as unique/different as possible
If you use the same folder over and (say adding file, removing file, readding files, ...)
Use contig to keep the index file of the directory as less fragmented as possible (check this for explanation)
Especially when removing many files consider using the folder remove trick to reduce the direcotry index file size
As already posted consider splitting up the files in multiple directories.
.e.g. instead of
directory/abc.jpg
directory/acc.jpg
directory/acd.jpg
directory/adc.jpg
directory/aec.jpg
use
directory/b/c/abc.jpg
directory/c/c/acc.jpg
directory/c/d/acd.jpg
directory/d/c/adc.jpg
directory/e/c/aec.jpg
You could try an SSD....
http://www.crucial.com/promo/index.aspx?prog=ssd
Use more folders and limit the number of entries in any given folder. The time to enumerate the number of entries in a directory goes up (exponentially? I'm not sure about that) with the number of entries, and if you have millions of small files in the same directory, even doing something like dir folder_with_millions_of_files can take minutes. Switching to another FS or OS will not solve the problem---Linux has the same behavior, last time I checked.
Find a way to group the images into subfolders of no more than a few hundred files each. Make the directory tree as deep as it needs to be in order to support this.
The solution is most likely to restrict the number of files per directory.
I had a very similar problem with financial data held in ~200,000 flat files. We solved it by storing the files in directories based on their name. e.g.
gbp97m.xls
was stored in
g/b/p97m.xls
This works fine provided your files are named appropriately (we had a spread of characters to work with). So the resulting tree of directories and files wasn't optimal in terms of distribution, but it worked well enough to reduced each directory to 100s of files and free the disk bottleneck.
One solution is to implement haystacks. This is what Facebook does for photos, as the meta-data and random-reads required to fetch a file is quite high, and offers no value for a data store.
Haystack presents a generic HTTP-based object store containing needles that map to stored opaque objects. Storing photos as needles in the haystack eliminates the metadata overhead by aggregating hundreds of thousands of images in a single haystack store file. This keeps the metadata overhead very small and allows us to store each needle’s location in the store file in an in-memory index. This allows retrieval of an image’s data in a minimal number of I/O operations, eliminating all unnecessary metadata overhead.

Resources