Control file order with tiffcp - image

I want to use my tried-and-true script to combine all tif in a directory into a single multipage tiff
tiffcp *.tif out.tif
but I want the files in the reverse of alphabetical order, e.g. 003.tif, 002.tif, 001.tif. Is there a flag in tiffcp? Do I need to rename all the files?

You can do this with some bash:
tiffcp `echo *.tif | sort -r` out.tif

Related

sort files before converting to pdf in imagemagick

I have a folder full of image files I need to convert to a pdf. I used wget to download them. The problem is the ordering linux gives the files isn't the actual order of the pages, this is an example of the file ordering:
100-52b69f4490.jpg
101-689eb36688.jpg
10-1bf275d638.jpg
102-6f7dc2def9.jpg
103-2da8842faf.jpg
104-9b01a64111.jpg
105-1d5e3862d8.jpg
106-221412a767.jpg
...
I can convert these images to a pdf using imagemagick, with the command
convert *.jpg output.pdf
but it'll put the pages into that pdf in the above order, not in human readable numerical order 1-blahblahblah.jpg, 2-blahblahblah.jpg, 3-blahblahblah.jpg etc.
Is the easiest way to do this pipe the output of sort to convert? or to pipe my wget to add each file as I'm getting it to a pdf file?
convert $(ls -1v *.jpg) book.pdf
worked for me
There are several options:
The simplest is as follows, but may overflow your command-line length if you have too many pages:
convert $(ls *jpg | sort -n) result.pdf
Next up is feeding the list of files on stdin like this:
ls *jpg | sort -n | convert #- result.pdf
Here one bash script to do it that:
#!/bin/bash
sort -n < list.txt > sorted_list.tmp
readarray -t list < sorted_list.tmp
convert "${list[#]}" output.pdf
rm sorted_list.tmp
exit
You can get list.txt by first listing your directory with ls > list.txt.
The sort -n (numerical sort) "normalizes" your entries.
The sorted list is saved in the .tmp file and deleted at the end.
Greetings,

ImageMagick convert tiffs to pdf with sequential file suffix

I have the following scenario and I'm not much of a coder (nor do I know bash well). I don't even have a base working bash script to share, so any help would be appreciated.
I have a file share that contains tiffs (thousands) of a document management system. The goal is to convert and combine from multiple file tiffs to single file pdfs (preferably PDF/A 1a format).
The directory format:
/Document Management Root # This is root directory
./2009/ # each subdirectory represents a year
./2010/
./2011/
....
./2016/
./2016/000009.001
./2016/000010.001
# files are stored flat - just thousands of files per year directory
The document management system stores tiffs with sequential number file names along with sequential file suffixes:
000009.001
000010.001
000011.002
000012.003
000013.001
Where each page of a document is represented by the suffix. The suffix restarts when a new, non-related document is created. In the example above, 000009.001 is a single page tiff. Files 000010.001, 000011.002, and 000012.003 belong to the same document (i.e. the pages are all related). File 000013.001 represents a new document.
I need to preserve the file name for the first file of a multipage document so that the filename can be cross referenced with the document management system database for metadata.
The pseudo code I've come up with is:
for each file in {tiff directory}
while file extension is "001"
convert file to pdf and place new pdf file in {pdf directory}
else
convert multiple files to pdf and place new pd file in {pdf directory}
But this seems like it will have the side effect of converting all 001 files regardless of what the next file is.
Any help is greatly appreciated.
EDIT - Both answers below work. The second answer worked, however it was my mistake in not realizing that the data set I tested against was different than my scenario above.
So, save the following script in your login ($HOME) directory as TIFF2PDF
#!/bin/bash
ls *[0-9] | awk -F'.' '
/001$/ { if(NR>1)print cmd,outfile; outfile=$1 ".pdf"; cmd="convert " $0;next}
{ cmd=cmd " " $0}
END { print cmd,outfile}'
and make it executable (necessary just once) by going in Terminal and running:
chmod +x TIFF2PDF
Then copy a few documents from any given year into a temporary directory to try things out... then go to the directory and run:
~/TIFF2PDF
Sample Output
convert 000009.001 000009.pdf
convert 000010.001 000011.002 000012.003 000010.pdf
convert 000013.001 000013.pdf
If that looks correct, you can actually execute those commands like this:
~/TIFF2PDF | bash
or, preferably if you have GNU Parallel installed:
~/TIFF2PDF | parallel
The script says... "Generate a listing of all files whose names end in a digit and send that list to awk. In awk, use the dot as the separator between fields, so if the file is called 00011.0002, then $0 will be 00011.0002, $1 will be 00011 and $2 will be 0002. Now, if the filename ends in 0001, print the accumulated command and append the output filename. Then save the filename prefix with PDF extension as the output filename of the next PDF and start building up the next ImageMagick convert command. On subsequent lines (which don't end in 0001), add the filename to the list of filenames to include in the PDF. At the end, output any accumulated commands and append the output filename."
As regards the ugly black block at the bottom of your image, it happens because there are some tiny white specks in there that prevent ImageMagick from removing the black area. I have circled them in red:
If you blur the picture a little (to diffuse the specks) and then get the size of the trim-box, you can apply that to the original, unblurred image like this:
trimbox=$(convert original.tif -blur x2 -bordercolor black -border 1 -fuzz 50% -format %# info:)
convert original.tif -crop $trimbox result.tif
I would recommend you do that first to A COPY of all your images, then run the PDF conversion afterwards. As you will want to save a TIFF file but with the extension 0001, 0002, you will need to tell ImageMagick to trim and force the output filetype to TIF:
original=XYZ.001
trimbox=$(convert $original -blur x2 -bordercolor black -border 1 -fuzz 50% -format %# info:)
convert $original -crop $trimbox TIF:$original
As #AlexP. mentions, there can be issues with globbing if there is a large number of files. On OSX, ARG_MAX is very high (262144) and your filenames are around 10 characters, so you may hit problems if there are more than around 26,000 files in one directory. If that is the case, simply change:
ls *[0-9] | awk ...
to
ls | grep "\d$" | awk ...
The following command would convert the whole /Document Management Root tree (assuming it's actual absolute path) properly processing all subfolders even with names including whitespace characters and properly skipping all other files not matching the 000000.000 naming pattern:
find '/Document Management Root' -type f -regextype sed -regex '.*/[0-9]\{6\}.001$' -exec bash -c 'p="{}"; d="${p:0: -10}"; n=${p: -10:6}; m=10#$n; c[1]="$d$n.001"; for i in {2..999}; do k=$((m+i-1)); l=$(printf "%s%06d.%03d" "$d" $k $i); [[ -f "$l" ]] || break; c[$i]="$l"; done; echo -n "convert"; printf " %q" "${c[#]}" "$d$n.pdf"; echo' \; | bash
To do a dry run just remove the | bash in the end.
Updated to match the 00000000.000 pattern (and split to multiple lines for clarity):
find '/Document Management Root' -type f -regextype sed -regex '.*/[0-9]\{8\}.001$' -exec bash -c '
pages[1]="{}"
p1num="10#${pages[1]: -12:8}"
for i in {2..999}; do
nextpage=$(printf "%s%08d.%03d" "${pages[1]:0: -12}" $((p1num+i-1)) $i)
[[ -f "$nextpage" ]] || break
pages[i]="$nextpage"
done
echo -n "convert"
printf " %q" "${pages[#]}" "${pages[1]:0: -3}pdf"
echo
' \; | bash

Use exiv2 or imagemagick to remove EXIF data from stdin and output to stdout

How can I pipe an image into exiv2 or imagemagick, strip the EXIF tag, and pipe it out to stdout for more manipulation?
I'm hoping for something like:
exiv2 rm - - | md5sum
which would output an image supplied via stdin and calcualte its md5sum.
Alternatively, is there a faster way to do this?
Using exiv2
I was not able to find a way to get exiv2 to output to stdout -- it only wants to overwrite the existing file. You could use a small bash script to make a temporary file and get the md5 hash of that.
image.sh:
#!/bin/bash
cat <&0 > tmp.jpg # Take input on stdin and dump it to temp file.
exiv2 rm tmp.jpg # Remove EXIF tags in place.
md5sum tmp.jpg # md5 hash of stripped file.
rm tmp.jpg # Remove temp file.
You would use it like this:
cat image.jpg | image.sh
Using ImageMagick
You can do this using ImageMagick instead by using the convert command:
cat image.jpg | convert -strip - - | md5sum
Caveat:
I found that stripping an image of EXIF tags using convert resulted in a smaller file-size than using exiv2. I don't know why this is and what exactly is done differently by these two commands.
From man exiv2:
rm Delete image metadata from the files.
From man convert:
-strip strip image of all profiles and comments
Using exiftool
ExifTool by Phil Harvey
You could use exiftool (I got the idea from https://stackoverflow.com/a/2654314/3565972):
cat image.jpg | exiftool -all= - -out - | md5sum
This too, for some reason, produces a slightly different image size from the other two.
Conclusion
Needless to say, all three methods (exiv2, convert, exiftool) produce outputs with different md5 hashes. Not sure why this is. But perhaps if you pick a method and stick to it, it will be consistent enough for your needs.
I tested with NEF file. Seems only
exiv2 rm
works best. exiftool and convert can't remove all metadata from .nef FILE.
Notice that the output file of exiv2 rm can no longer be displayed by most image viewers. But I only need the MD5 hash keeps same after I update any metadata of the .NEF file. It works perfect for me.

Build sorted and annotated pdf from images

I am trying to build a pdf from a set of image files in the same folder from the bash. So far I've got this code:
ls *.jpg | sort > files.txt
ls *.jpg | sort | tr '\n' ' ' | sed 's/$/\ data_graphs.pdf/' | xargs convert -gravity North -annotate #files.txt
rm files.txt
This code collapses the image, but they are not properly sorted, and the annotation is the same for every image (the first one in the list).
Here is the ls * jpg |sort output for reference.
$ ls *.jpg | sort
01.20.2014_A549_void.jpg
01.20.2014_EPOR_full_sorter.jpg
01.20.2014_EPOR_trunc_sorter.jpg
01.20.2014_WTGFP_sorter.jpg
01.27.2014_A549_void.jpg
01.27.2014_EPOR_full_I10412.jpg
01.27.2014_EPOR_full_sorter.jpg
01.27.2014_EPOR_trunc_I10412.jpg
01.27.2014_EPOR_trunc_sorter.jpg
01.27.2014_WTGFP_I10412.jpg
01.27.2014_WTGFP_sorter.jpg
02.03.2014_A549_void.jpg
02.03.2014_EPOR_full_sorter.jpg
02.03.2014_EPOR_trunc_sorter.jpg
02.03.2014_WTGFP_sorter.jpg
How about this, no need generate the temporary file files.txt
convert -gravity North -annotate `ls *.jpg | sort -t . -k3.3n -k1.1n -k2.2n ` data_graphs.pdf
According the comments, these jpg files have time-stamp in file name (MM-DD-YYYY), I updated the sort command.
another way, convert each jpg file to pdf first, then use pdftk to merge them, I used pdftk for long years and know the software can do the job easily. Here is the pdftk server url : pdflabs.com/tools/pdftk-server.
Below script will convert jpg file to pdf one by one
for file in *jpg
do
convert -gravity North -annotate "$file" "$file".pdf
done
Then run the pdftk command, if you have hugh number of pdf. With pdftk, you can merge every 10~20 into a small pdf, then merge the small pdf to final pdf. For example:
pdftk 1.pdf 2.pdf 3.pdf output m1.pdf
then you will get mXXX.pdf files, then run the pdftk again:
pdftk m1.pdf m2.pdf m3.pdf output final.pdf

convert a directory of images into a single PDF

I have a directory of images:
path/to/directory/
image01.jpg
image02.jpg
...
and would like to convert it into a single PDF file:
path/to/directory.pdf
This is what I managed to code so far:
#!/bin/bash
echo Directory $1
out=$(echo $1 | sed 's|/$|.pdf|')
echo Output $out
mkdir tmp
for i in $(ls $1)
do
# MAC hates sed with "I" (ignore case) - thanks SO for the perl solution!
# I want to match "jpg, JPG, Jpg, ..."
echo $1$i $(echo "tmp/$i" | perl -C -e 'use utf8;' -pe 's/jpg$/pdf/i')
convert $1$i $(echo "tmp/$i" | perl -C -e 'use utf8;' -pe 's/jpg$/pdf/i')
done
pdftk tmp/*.pdf cat output $out
rm -rf tmp
So the idea was to convert each image into a pdf file with imagemagick, and use pdftk to merge it into a single file. Thanks to the naming of the files I don't have to bother about the ordering.
Since I'm a newbie to this I'm sure there are many refinements one can do:
only iterate over image-files in the directory (in case there is some Readme.txt,...)
including the extensions png, jpeg, ...
using the trailing "/" is not elegant I admint
etc.
Currently my main problem is, however, that there are cases where my directories and image files contain spaces in their names. The for-loop then iterates over sub-strings of the filename and I imagine that the line with convert will also fail.
I have tried out some things but haven't succeeded so far and hope someone will be able to help me here.
If anyone has ideas to address the issues I listed above as well I would be very glad to hear them too.
convert can do this in one go:
convert *.[jJ][pP][gG] output.pdf
Or to answer several of your other questions and replace your script:
#!/bin/bash
shopt -s nullglob nocaseglob
convert "$1"/*.{png,jpg,jpeg} "${1%/}.pdf"
will iterate over all the given extensions in the first argument, regardless of capitalization, and write to yourdir.pdf. It will not break on spaces.

Resources