Separating icons from pictures in a bunch of random PNGs - image

A long time ago I screwed with my HDD and had to recover all my data, but I couldn't recover the files' names.
I used a tool to sort all these files by extension, and another to sort the JPGs by date, since the date when a JPG was created is stored in the file itself. I can't do this with PNGs, though, unfortunately.
So I have a lot of PNGs, but most of them are just icons or assets used formerly as data by the software I used at that time. But I know there are other, "real" pictures, that are valuable to me, and I would really love to get them back.
I'm looking for any tool, or any way, just anything you can think of, that would help me separate the trash from the good in this bunch of pictures, it would really be amazing of you.
Just so you know, I'm speaking of 230 thousand files, for ~2GB of data.
As an exemple, this is what I call trash :
or , and all these kind of images.
I'd like these to be separated from pictures of landscapes / people / screenshots, the kind of pictures you could have in you phone's gallery...
Thanks for reading, I hope you'll be able to help !

This simple ImageMagick command will tell you the:
height
width
number of colours
name
of every PNG in the current directory, separated by colons for easy parsing:
convert *.png -format "%h:%w:%k:%f\n" info:
Sample Output
600:450:5435:face.png
600:450:17067:face_sobel_magnitude.png
2074:856:2:lottery.png
450:450:1016:mask.png
450:450:7216:result.png
600:450:5435:scratches.png
800:550:471:spectrum.png
752:714:20851:z.png
If you are on macOS or Linux, you can easily run it under GNU Parallel to get 16 done at a time and you can parse the results easily with awk, but you may be on Windows.
You may want to change the \n at the end for \r\n under Windows if you are planning to parse the output.

Related

Converting text file with spaces between CR & LF

I've never seen this line ending before and I am trying to load the file into a database.
The lines all have a fixed width. After the CSV text which contains the data (the length varies line-by-line), there is a CR followed by multiple spaces and ending with LF. The spaces provide the padding to equalize the line width.
Line1,Data 1,Data 2,Data 3,4,50D20202020200A
Line2,Data 11,Data 21,Data 31,41,510D2020200A
Line3,Data12,Data22,Data 32,42,520D202020200A
I am about to handle this with a stream reader / writer in C#, but there are 40 files that come in each month and if there is a way to convert them all at once instead of one line at a time, I would rather do that.
Any thoughts?
Line-by-line processing of a stream doesn't have to be a bottleneck if you implement it at the right point in your overall process.
When I've had to do this kind of preprocessing I put a folder watch on the inbound folder, then automatically pick up each file and process it upon arrival, putting the original into an archive folder and writing the processed file into another location from which data will be parsed or loaded into the database. Unless you have unusual real-time requirements, you'll never notice this kind of overhead. If you do have real-time requirements, this issue will pale in comparison to all the other issues you'll face with batched data files :)
But you may not even have to go through a preprocessing step at all. You didn't indicate what database you will be using or how you plan to load the data, but many databases do include utilities to process fixed-length records. In the past, fixed-format files came with every imaginable kind of bizarre format (and contained all kinds of stuff that had to be stripped out or converted). As a result those utilities tend to be very efficient at this kind of task. In my experience they can easily be at least an order of magnitude faster than line-by-line processing, which can make a real difference on larger bulk loads.
If your database doesn't have good bulk import processing tools, there are a number of many open-source or freeware utilities already written that do pretty much exactly what you need. You can find them on GitHub and other places. For example, NPM replace is here and zzzprojects findandreplace is here.
For a quick and dirty approach that allows you to preview all the changes as you develop a more robust solution, many text editors have the ability to find and replace in multiple files. I've used that approach successfully in the past. For example, here's the window from NotePad++ that lets you use RegEx to remove or change whatever you like in all files matching defined criteria.

overlay one pdf with another from the command line: pdftk alternative?

I use a bash script to auto-generate a pdf calendar each month.I use the wonderful remind program as the basis for this routine. Great as are the calendars I get using that program, I need a more detailed header for the calendar (than just the name of the month and the year). I couldn't puzzle out a way to get the remind program to enhance the header, but I was able to get the enhanced results I wanted by creating a second pdf containing the header enhancements I need, then overlaying that pdf onto the calendar I produce with remind, via the pdftk utility (pdftk calendar.pdf stamp calendar_overlay.pdf output MONTH-YEAR-cal.pdf). Unfortunately, I recently lost the ability to use pdftk since keeping it on my system would necessitate me ceasing to do other system updates. In short, I had to remove it in order to continue updating my system.
So now I'm looking for some alternative that I can incorporate into my bash script. I am not finding any utility that will allow me to overlay one pdf with another, like pdftk allows. It seems I may be able to do something like what I'm after using imagemagick (-convert), though I would likely need to overlay the pdf with an image file like a .jpg rather than with a pdf. Another possible solution may be to use TeX/LaTeX to insert text into the pdf as described at https://rsmith.home.xs4all.nl/howto/adding-text-or-graphics-to-a-pdf-file.html.
I wanted to ask here, before investing a lot of time and effort into pursuing one or other of the two potential options I've identified, whether there is some other way, using command line options that can be incorporated into a bash script, of overlaying one pdf with another in the manner described? Input will be appreciated.
LATER EDIT: another link with indications how to do such things using LaTeX https://askubuntu.com/questions/712691/batch-add-header-footer-to-pdf-files
Assuming for simplicity that both of your files are of size 500pt x 200pt,
you can use pdfjam with nup and delta options to trick it into overlaying your source pdf files.
pdfjam bottom.pdf top.pdf --outfile merged.pdf \
--nup "1x2" \
--noautoscale true \
--delta "0 -200pt" \
--papersize "{500pt, 200pt}"
Unfortunately, I've found in my tests that I needed to increase the y delta by one point to get perfect alignment.
pdftk-java is a Java-based port of pdftk which looks to be actively in development. Given that its only real requirement appears to be Java 7+, it should work even in environments such as your own that no longer support the requirements of pdftk, so long as they have a Java runtime installed.

How do I decompress a .astc file with an additional .ccz extension? How do I view .ita files?

First, full disclosure: I'm very new to coding and very new to file dissecting, but its something I anticipate studying in school very soon, so please pardon my ignorance in future interactions.
As a project I've decided to dissect the files of a mobile app I greatly enjoy. This app is Futurama: Worlds of Tomorrow. I'm a big fan of the cartoon, even spent money on the stuff, so I figured it was natural for me to pick.
Extracting the .apk file was easy, I found some of the assets they use in the game, like the music, the soundbytes, and some .pngs. All simple stuff.
However there are two files I'm absolutely baffled by: files with an .astc.czz extension and an .ita file that is not an italian read me file, the developers informed me that those are animation files.
Allow me to go into what I know and what I don't know:
Filename.astc.czz
Example file here
I recognize .astc as a compression file and was informed that .astc files are common for mobile games. Fair enough, but the real extension is .czz, the "real" extension of the file leads me to dead end. I've found the ASTC Evaluation Codec
by ARM-Software on github so I tried that. I changed the extension to .astc and then tried keeping .czz but the codec gives me an error every time. This is where I show my ignorance, I didn't know the right way to do this so I'm showing you every combination of what I tried. I replaced my name with user.
C:\Users\user\Downloads\astc-encoder-master\Binary\Win32
λ astcenc -d C:\Users\user\Downloads\astc-encoder-master\Binary\Win32\AC0001-dialogue1-003#2x.astc C:\Users\user\Downloads\astc-encoder-master\Binary\Win32\AC0001-dialogue1-003#2x.tga
File C:\Users\user\Downloads\astc-encoder-master\Binary\Win32\AC0001-dialogue1-003#2x.astc not recognized
C:\Users\user\Downloads\astc-encoder-master\Binary\Win32
λ astcenc -d AC0001-dialogue1-003#2x.astc AC0001-dialogue1-003#2x.tga
File AC0001-dialogue1-003#2x.astc not recognized
C:\Users\user\Downloads\astc-encoder-master\Binary\Win32
λ astcenc -d C:\Users\user\Downloads\astc-encoder-master\Binary\Win32\AC0001-dialogue1-003#2x.astc.czz C:\Users\user\Downloads\astc-encoder-master\Binary\Win32\AC0001-dialogue1-003#2x.tga
Failed to open file C:\Users\user\Downloads\astc-encoder-master\Binary\Win32\AC0001-dialogue1-003#2x.astc.czz
C:\Users\user\Downloads\astc-encoder-master\Binary\Win32
λ astcenc -d AC0001-dialogue1-003#2x.astc.czz AC0001-dialogue1-003#2x.tga
Failed to open file AC0001-dialogue1-003#2x.astc.czz
No success there.
So then I learned that .CZZ files are apparently associated with visECAD Viewer and I downloaded that and the .astc.czz files became associated with the program. I tried opening them but visECAD says it cant open them because they are "outdated." So that's another dead end.
Right, so that's all I know.
Filename.ita
Example file here
Out of curiosity I've actually emailed the developers about this file (and the astc ones too) and they said those are the animation of the game. They couldn't send me a viewer, which is perfectly fine, but I don't even know what .ita files are associated with that aren't italian read me's. Any insight would be appreciated, the animations are great and I would love to see them.
For full disclosure here are snippets of what the developers sent me:
Those strange file types are actually compressed files (like
".astc.ccz"). Different devices use different compression methods, so
we support many types to maintain low storage and memory usage. Some
devices don't use compression and just use .png versions of the same
file names.
The .lta files are the game's animations. I wish I could help you out
with viewing them, but there's no way for me to send you a viewer. :(
Well that's all folks, sorry it was so long, and thank you so much in advance. I'm grateful already!
I realise this is a few months old, but in case you're still interested, I've just cracked it. Basically, it's a compressed texture, the ccz part being the compression, and the astc being the texture format. I managed to decompress the file using QuickBMS (http://aluigi.altervista.org/quickbms.htm), using the following script for ccz files (copy the following into a txt file):
endian big
comtype zlib_dynamic
get ZSIZE asize
math ZSIZE - 0x10
get NAME basename
idstring "\x43\x43\x5a\x21"
goto 0xc
get SIZE long
clog NAME 0x10 ZSIZE SIZE
On running QuickBMS, it will first ask for a script, upon which point it to your new txt file. Then it will ask for the file you want to decompress, point it at your ccz file. Then it will ask where you want to save your astc file.
Now you will need a program that can open astc files! I used this one, Noesis: http://www.richwhitehouse.com/index.php?content=inc_projects.php&showproject=91
Find your astc file (the interface is quite straightforward), then from there you can double click the file to open it, then right-click and export to a variety of formats. For proof of concept, here is the extracted pf0001-action5-001#4xout (PF being Philip Fry I assume). https://www.dropbox.com/s/t2l3mesi2psbd1p/pf0001-action5-001%404xout.png?dl=0
Both programs allow for batch processing as well, so you should have everything you need! However, the lta files are skeletal animation I believe, so unfortunately the character animations are all in pieces. However, I'm looking into that next. Hope this helps!
EDIT: The above information is useful for your specific query, i.e. decompressing and reading the contents of those files. HOWEVER, if your end goal is to view the assets of the game, it's worth knowing that many of the assets are only downloaded AFTER the game is run, so looking in the "com.tinyco.futurama" on your Android voice will show all kinds of assets not present in the apk file. Many of them will be ready-extracted as well, being made ready for gameplay, so I would highly recommend copying the contents of this folder periodically. I think it re-compresses unused assets as well, so I would copy out the ccz files also, then either way you should reap the maximum benefits.

Duplicate photo searching with compare only pure imagedata and image similarity?

Having approximately 600GB of photos collected over 13 years - now stored on freebsd zfs/server.
Photos comes from family computers, from several partial backups to different external USB HDDs, reconstructed images from disk disasters, from different photo manipulation softwares (iPhoto, Picassa, HP and many others :( ) in several deep subdirectories - shortly = TERRIBLE MESS with many duplicates.
So in the first i done:
searched the the tree for the same size files (fast) and make md5 checksum for those.
collected duplicated images (same size + same md5 = duplicate)
This helped a lot, but here are still MANY MANY duplicates:
photos what are different only with exif/iptc data added by some photo management software, but the image is the same (or at least "looks as same" and have the same dimensions)
or they are only a resized versions of the original image
or they are the "enhanced" versions of originals, etc..
Now the questions:
how to find duplicates withg checksuming only the "pure image bytes" in a JPG without exif/IPTC and like meta informations? So, want filter out the photo-duplicates, what are different only with exif tags, but the image is the same. (therefore file checksuming doesn't works, but image checksuming could...). This is (i hope) not very complicated - but need some direction.
What perl module can extract the "pure" image data from an JPG file what is usable for comparison/checksuming?
More complex
how to find "similar" images, what are only the
resized versions of the originals
"enchanced" versions of the originals (from some photo manipulation programs)
is here already any algorithm available in a unix command form or perl module (XS?) what i can use to detect these special "duplicates"?
I'm able make complex scripts is BASH and "+-" :) know perl.. Can use FreeBSD/Linux utilities directly on the server and over the network can use OS X (but working with 600GB over the LAN not the fastest way)...
My rough idea:
delete images only at the end of workflow
use Image::ExifTool script for collecting duplicate image data based on image-creation date, and camera model (maybe other exif data too).
make checksum of pure image data (or extract histogram - same images should have the same histogram) - not sure about this
use some similarity detection for finding duplicates based on resize and foto enhancement - no idea how to do...
Any idea, help, any (software/algorithm) hint how to make order in the chaos?
Ps:
Here is nearly identical question: Finding Duplicate image files but i'm already done with the answer (md5). and looking for more precise checksuming and image comparing algorithms.
Assuming you can work with localy mounted FS:
rmlint : fastest tool I've ever used to find exact duplicates
findimagedupes : automatize the whole ImageMagick way (as Randal Schwartz's script that I haven't tested? it seems)
Detecting Similar and Identical Images Using Perseptual Hashes goes all the way (a great reference post)
dupeguru-pe (gui) : dedicated tool that is fast and does an excellent job
geeqie (gui) : I find it fast/excellent to finish the job, using the granular deduplication options. Also then you can generate an ordered collection of images such that 'simular images are next to each other, allowing you to 'flip' between the two to see the changes.
Have you looked at this article by Randal Schwartz? He uses a perl script with ImageMagick to compare resized (4x4 RGB grid) versions of the pictures that he then compares in order to flag "similar" pictures.
You can remove exif data with mogrify -strip from ImageMagick toolset. So you could, for each image, copy it without exif, md5sum, and then compare md5sums.
When it comes to visually similar messages - you can, for example, use compare (also from ImageMagick toolset), and produce black/white diff map, like described here, then make histogram of the difference and check if there is "enough" white to mean that it's different.
I had a similar dilemma - several hundred gigs of photos and videos spread and duplicated over about a dozen drives. I know this may not be the exact way you are looking for, but the FSlint Janitor application (on Ubuntu 16.x, then 18.x) was a lifesaver for me. I took the project in chunks, eventually cleaning it all up and ended up with three complete sets (I wanted two off-site backups).
FSLint Janitor:
sudo apt install fslint

How to convert .epsf to .eps?

I'm looking for a method of converting .epsf to .eps for a publication I'm submitting. The submission site requires .eps (even though my understanding is that modern renderers should be able to read .epsf as well - the site is archaic, I have to upload all 100 images individually.) My co-author sent me the zipped files to upload (and now to convert) - I didn't make them myself. Further, the programs that made these images may exist on my co-authors computer but where is uncertain.
I've tried this in Mathematica 8 to reasonable but not full success - as in colored files become black and white, files with duplicate entries (as in Fig11a.eps and Fig11a.epsf both exist though they are different, it seems that the .eps is the background and the .epsf is the foreground layer) convert incorrectly. My attempt was to import the .epsf files to Mathematica and export them as .eps.
Also, I've using a middle man format - e.g. gif/tiff/png/jpg - with similar results. I haven't been able to find a program that's free (I assume photoshop could pull this off) that I could use - also I'd like to do it as a batch. A method that uses requires python/Mathematica or XP/Linux OS's would be fine. Thanks.
You do not need to convert anything. Encapsulated PostScript files can have both extensions (both EPS and EPSF). If you publisher refuses to accept files with an EPSF extension just rename them to EPS.
Any processing/conversion you do on the files (using GhostScript, Mathematica, etc.) carries the risk of corrupting the graphics in some way. But there's no need to do it. Send them as they are or rename them if you prefer.
(If you have any doubt, you can check the EPS Format Specification from 1992 which says that on the Macintish the recommended file extension is .epsf while on DOS it's .EPS)

Resources