how to find image files without extensions (on macos 10.8) - macos

i have an app that has decided to die which had a library of images it stored on my hard drive in a series of guid-like folders. the files themselves have no file extensions, there must have been an internal database (unrecoverable/corrupt) that associated the file itself with its name/extension/mime. So to get my stuff back out I'd like to be able to search the disk to at least identify which of the files are images (jpeg and png files). I know that both jpeg and png have particular byte sequences in the first few bytes of the files. Is there a grep command that can match these known byte sequences in the first few bytes of each file in the massively nested file system structure that I have (e.g. folders 0 through f, each containing folders 0 through f, nested several levels deep, with files with uid filenames.

Starting at the current directory .:
find . -type f -print0 | xargs -J fname -0 -P 4 identify -ping fname 2>|/dev/null
This will print the files that ImageMagick can identify, which are mostly images, but there are also exceptions (like txt files). ImageMagick is not particularly fast for this task either, so depending on what you have available there might be faster alternatives. For instance, the PIL package for Python will make this faster simply because it supports a lesser amount of image formats, but which might be enough for your task.

Related

How to mix files of compressed and stored types in the same zip file

I am looking for a shell command (preferably a one-liner) that will create a zip file with both compressed, and stored content (by stored, I mean uncompressed as stated in the official documentation, link 1).
The .ZIP File Format Specification gives freedom of mixing different compression types, including just storing files :
4.1.8 Each data file placed into a ZIP file MAY be compressed, stored, encrypted or digitally signed independent of how other
data files in the same ZIP file are archived.
If this was necessary, this technical possibility is confirmed by the Media Type registered in the IANA registry under application/zip :
A. Local file header:
local file header signature 4 bytes (0x04034b50) ..
compression method 2 bytes
Till now I have tried unsuccessfully several zip parameters (-f -u -U,..)
Ideally the command would compress text files, and store binary content, differentiated by their file extension (for example : html, css, js would be considered as text, and jpg, ico, jar as binary).
Are you looking for the -n flag?
-n suffixes
--suffixes suffixes
Do not attempt to compress files named with the given suffixes. Such files are simply
stored (0% compression) in the output zip file, so that zip doesn't waste its time trying
to compress them. The suffixes are separated by either colons or semicolons. For example:
zip -rn .Z:.zip:.tiff:.gif:.snd foo foo
will copy everything from foo into foo.zip, but will store any files that end in .Z,
.zip, .tiff, .gif, or .snd without trying to compress them.
Adding to #cody's answer, you can also do this on a per-file (group) basis with -g and -0. Something like:
zip archive.zip compressme.txt
zip -g archive.zip -0 dontcompressme.jpg
-#
(-0, -1, -2, -3, -4, -5, -6, -7, -8, -9)
Regulate the speed of compression using the specified digit #, where -0
indicates no compression (store all files), -1 indicates the fastest
compression speed (less compression) and -9 indicates the slowest
compression speed (optimal compression, ignores the suffix list).
The default compression level is -6.
-g
--grow
Grow (append to) the specified zip archive, instead of creating a new one.
If this operation fails, zip attempts to restore the archive to its original
state. If the restoration fails, the archive might become corrupted.
This option is ignored when there's no existing archive or when at least
one archive member must be updated or deleted.

(OS X Shell) Copying files based on text file containing file names without extensions

Preface: I’m not much of a shell-scripter, in fact not a shell-scripter at all.
I have a folder (folder/files/) with many thousand files in it, with varying extensions and random names. None of the file names have spaces in them. There are no subfolders.
I have a plain text file (filelist.txt) with a few hundred file names, all of them without extensions. All the file names have corresponding files in folder/files/, but with varying extensions. Some may have more than one corresponding file in folder/files/ with different extensions.
An example from filelist.txt:
WP_20160115_15_11_20_Pro
P1192685
100-1373
HPIM2836
These might, for example, correspond to the following files in folder/files/:
WP_20160115_15_11_20_Pro.xml
P1192685.jpeg
100-1373.php
100-1373.docx
HPIM2836.avi
(Note the two files named 100-1373 with different extensions.)
I am working on an OS X (10.11) machine. What I need to do is copy all the files in folder/files/ that match a file name in filelist.txt into folder/copiedfiles/.1
I’ve been searching and Googling like mad for a bit now, and I’ve found bucketloads of people explaining how to extract file names without extensions, find and copy all files that have no extension, and various tangentially related issues—but I can’t find anything that really helps me figure out how to do this in particular. Doing a cp ˋcat filelist.txtˋ folder/copiedfiles/ would work (as far as I can tell) if the file names in the text file included extensions; but they don’t, so it doesn’t.
What is the simplest (and preferably fastest) way to do this?
1 What I need to do is exactly the same as in this question, but that one is specifically asking about batch-file, which is a very different kettle of sea-dwellers.
This should do it:
while read filename
do
find /path/to/folder/files/ -maxdepth 1 -type f \
-name "$filename*" -exec cp {} /path/to/folder/copiedfiles/ \;
done</path/to/filelist.txt

find files in huge directory - very slow

I have a directory with files. The archive is very big and has 1.5 million pdf files inside.
the directory is stored on an IBM i server with OS V7R1 and the machine is new and very fast.
The files are named like this :
invoice_[custno]_[year']_[invoice_number].pdf
invoice_081500_2013_7534435564.pdf
No I try to find files with the find command using the Shell.
find . -name 'invoice_2013_*.pdf' -type f | ls -l > log.dat
The command took a long time so I aborted the operation with no result.
If I try it with smaller directories all works fine.
Later I want to have a job that runs everey day and finds the files created the last 24 hours but I it aleays runs so slow I can forget this.
That invocation would never work because ls does not read filenames from stdin.
Possible solutions are:
Use the find utility's built-in list option:
find . -name 'invoice_2013_*.pdf' -type f -ls > log.dat
Use the find utility's -exec option to execute ls -l for each matching file:
find . -name 'invoice_2013_*.pdf' -type f -exec ls {} \; > log.dat
Pipe the filenames to the xargs utility and let it execute ls -l with the filenames as parameters:
find . -name 'invoice_2013_*.pdf' -type f | xargs ls -l > log.dat
A pattern search of 1.5 million files in a single directory is going to be inefficient on any filesystem.
For looking only at a list of new entries in the directory, you might consider journaling the directory. You would specify INHERIT(*NO) to prevent journaling all the files in the directory as well. Then you could simply extract the recent journal entries with DSPJRN to find out what objects had been added.
I don't think I'd put more than maybe 15k files in a single directory. Some QShell utilities run into trouble at around 16k files. But I'm not sure I'd store them in a directory in any case, except maybe for ones over 16MB if that's a significant fraction of the total. I'd possibly look to store them in CLOBs/BLOBs in the database first.
Storing as individual streamfile objects brings ownership/authority problems that need to be addressed. Some profile is getting entries into its owned-objects table, and I'd expect that profile to be getting pretty large. Perhaps getting to one or more limits.
By storing in the database, you drop to a single owned object.
Or perhaps a few similar objects... There might be a purging/archiving process that moves rows off to a secondary or tertiary table. Hard to guess how that might need to be structured, if at all.
Saves could also benefit, especially SAVSECDTA and SAV saves. Security data is greatly reduced. And saving a 4GB table is faster than saving a thousand 4MB objects (or whatever the breakdown might be).
Other than determining how the original setup and implementation would go in your environment, the big tricky part could involve volatility. If these are stable objects with relatively few changes and few deletions, it should be okay. But if BLOBs are often modified, it can bring trouble when the table takes at a significant fraction of DASD capacity. It gets particularly rough when it exceeds the size of DASD free space and a re-org is needed. With low volatility, that's much less of a concern.
Typically what is done in such cases is to create subdirectories -- perhaps by using the first letter of each file.. For example, the file
abcsdsjahdjhfdsfds.xyz would be store in
/something/a/abcsdsjahdjhfdsfds.xyz
that would cut down on the size each subdirectory..

shellscript to convert .TIF to a .PDF

I'm wanting to progress through a directory's subdirectories and either convert or place .TIF images into a pdf. I have a directory structure like this:
folder
item_one
file1.TIF
file2.TIF
...
fileN.TIF
item_two
file1.TIF
file2.TIF
...
...
I'm working on a Mac and considered using sips to change my .TIF files to .PNG files and then use pdfjoin to join all the .PNG files into a single .PDF file per folder.
I have used:
for filename in *; do sips -s format png $filename --out $filename.png; done
but this only works for the .TIF files in a single directory. How would one write a shellscript to progress through a series of directories as well?
once the .PNG files were created I'd do essentially the same thing but using:
pdfjoin --a4paper --fitpaper false --rotateoversize false *.png
Is this a valid way of doing this? Is there a better, more efficient way of performing such an action? Or am I being an idiot and should be doing this with some sort of software, like ImageMagick or something?
Try using the find command with the exec switch to call your image conversion solution. Alternatively, instead of using the exec switch, you could pipe the output of find to xargs. There is lots of information online about using find. Here's one example from StackOverflow.
As far as the image conversion, I think that really depends on your requirements for speed and efficiency. If you've verified the process you described, and this is a one-time process, and it only takes seconds or minutes to run, then you're probably fine. On the other hand, if you need to do this frequently, then it might be worth investing the time to find a one-step conversion solution that takes less time than your current, two-pass solution.
Note that, instead of two passes, you may be able to pipe the output of sips to pdfjoin; however, that would require some investigation to verify.

Bash PDF-merging misses files

I'm trying to merge many PDF files in chunks of 3000 or so files. After many tries, this script seemed to do the trick. (of course I was wrong)
#!/bin/bash
basepath='/home/lemonidas/pdfstuff';
datename=`date "+%Y%m%d%H%M.%S"`;
start=`date "+%s"`;
echo "parsing pdf list to file..."
find $basepath/input/ -name "*.pdf" | xargs -I {} ls {} >> $basepath/tmp/biglist$datename.txt
split -l 3000 $basepath/tmp/biglist$datename.txt $basepath/tmp/splitfile
rm $basepath/tmp/biglist$datename.txt
echo "deleting big file..."
echo "done splitting!"
declare -i x
x=1
for f in $basepath/tmp/splitfile*
do
linenum=`cat $f | wc -l`;
echo "Processing $f ($linenum lines)..."
# merge to one big PDF
cat $f | xargs gs -q -sstdout=$basepath/error.log -sPAPERSIZE=letter -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -sOutputFile=$basepath/output/$x.big.pdf 2>$basepath/error.log
echo "Completed PDF $x"
(( x++ ))
# delete the list file
rm $f
echo "Deleted processed file $f"
done
end=`date "+%s"`;
echo "Started: $start"
echo "Finished: $end"
The problem is, I have 22000 2-page PDFs, each output file (except the last one) should be 6000 pages (since we have 3000 PDFs in each merge list, as verified by the "wc -l" before parsing), and I only get around 658 pages or so.
No errors are reported except this by gs:
Warning: Embedded symbolic TT fonts must contain a cmap for Platform=1 Encoding=0.Warning: Embedded symbolic TT fonts must contain a cmap for Platform=1 Encoding=0.Warning: Embedded symbolic TT fonts must contain a cmap for Platform=1 Encoding=0.Warning: Embedded symbolic TT fonts must contain a cmap for Platform=1 Encoding=0.Warning: Embedded symbolic TT fonts must contain a cmap for Platform=1 Encoding=0.Warning: Embedded symbolic TT fonts must contain a cmap for Platform=1 Encoding=0.Warning: Embedded symbolic TT fonts must contain a cmap for Platform=1 Encoding=0.Warning: Embedded symbolic TT fonts must contain a cmap for Platform=1 Encoding=0.
This file had errors that were repaired or ignored.
The file was produced by: >>>> Powered By Crystal
Please notify the author of the software that produced this file that it does not conform to Adobe's published PDF specification.
over and over (but not 22000 times though)
When I try it with 300-400 files, it runs smoothly, but when I try the full run, after 2.5 hours, I get much less than half the files merged.
My next thought is to convert each 2-page PDF in .pgm files, but I then have no idea how to remake them as PDF (so that no font embedding issues arise).
Am I missing something? (probably)
You would probably do better to use a tool better suited to the task. pdfwrite (the Ghostscript device for emitting PDF files) is not in my opinion the right tool for this.
In order to 'merge' PDF files, Ghostscript fully interprets the input into marking operations, then rewrite the marking operations as a PDF file. While creating that list of operations a great deal of information needs to be held (fonts, images, other things) and compared against new input to see if we already have a copy. As the input grows larger, it takes longer to scan that list, and of course the memory consumption increases. You may find that Ghostscript is already swapping memory.
Now I'm not sure this is your actual problem, or if you are saying that after you 'merge' the files there are pages missing. That should not happen. You also don't say what version of Ghostscript you are using.
All the same, I would think that a tool like pdftk would be faster at doing this kind of merge, though the final PDF file might well be larger/less efficient than pdfwrite would make.

Resources