Build sorted and annotated pdf from images - image

I am trying to build a pdf from a set of image files in the same folder from the bash. So far I've got this code:
ls *.jpg | sort > files.txt
ls *.jpg | sort | tr '\n' ' ' | sed 's/$/\ data_graphs.pdf/' | xargs convert -gravity North -annotate #files.txt
rm files.txt
This code collapses the image, but they are not properly sorted, and the annotation is the same for every image (the first one in the list).
Here is the ls * jpg |sort output for reference.
$ ls *.jpg | sort
01.20.2014_A549_void.jpg
01.20.2014_EPOR_full_sorter.jpg
01.20.2014_EPOR_trunc_sorter.jpg
01.20.2014_WTGFP_sorter.jpg
01.27.2014_A549_void.jpg
01.27.2014_EPOR_full_I10412.jpg
01.27.2014_EPOR_full_sorter.jpg
01.27.2014_EPOR_trunc_I10412.jpg
01.27.2014_EPOR_trunc_sorter.jpg
01.27.2014_WTGFP_I10412.jpg
01.27.2014_WTGFP_sorter.jpg
02.03.2014_A549_void.jpg
02.03.2014_EPOR_full_sorter.jpg
02.03.2014_EPOR_trunc_sorter.jpg
02.03.2014_WTGFP_sorter.jpg

How about this, no need generate the temporary file files.txt
convert -gravity North -annotate `ls *.jpg | sort -t . -k3.3n -k1.1n -k2.2n ` data_graphs.pdf
According the comments, these jpg files have time-stamp in file name (MM-DD-YYYY), I updated the sort command.
another way, convert each jpg file to pdf first, then use pdftk to merge them, I used pdftk for long years and know the software can do the job easily. Here is the pdftk server url : pdflabs.com/tools/pdftk-server.
Below script will convert jpg file to pdf one by one
for file in *jpg
do
convert -gravity North -annotate "$file" "$file".pdf
done
Then run the pdftk command, if you have hugh number of pdf. With pdftk, you can merge every 10~20 into a small pdf, then merge the small pdf to final pdf. For example:
pdftk 1.pdf 2.pdf 3.pdf output m1.pdf
then you will get mXXX.pdf files, then run the pdftk again:
pdftk m1.pdf m2.pdf m3.pdf output final.pdf

Related

Batch resize images when one side is too large (linux)

I know that image resizing on the command line is something ImageMagick and similar could do unfortunately I do only have very basic bash scripting abilities so I wonder if this is even possible:
check all directories and subdirectories for all files that are an image
check width and height of the image
if any of both exceeds X amount of pixels resize it to X while keeping aspect ratio.
replace old file with new file (old file shall be removed/deleted)
Thank you for any input.
Implementation might be not so trivial even for advanced users. As a one-liner:
find \ # 1
~/Downloads \ # 2
-type f \ # 3
-exec file \{\} \; \ # 4
| awk -F: '{if ($2 ~/image/) print $1}' \ # 5
| while IFS= read -r file_path; do \ # 6
mogrify -resize 1024x1024\> "$file_path"; \ # 7
done # 8
Lines 1-4 are an invocation of the find command:
Specify a directory to scan.
Specify you need files only.
Per each found item run file command. Example outputs per file:
/Downloads/391A6 625.png: PNG image data, 1024 x 810, 8-bit/color RGB, interlaced
/Downloads/STRUCTURED NODES IN UML 2.0 ACTIVITES.pdf: PDF document, version 1.4
Note how file names are delimited from their info by : and info about PNG contains image word. This also will be true for other image formats.
Use awk to filter only those files which have image word in their info. This gives us image files only. Here, -F: specifies that the delimiter is :. This gives us the variable $1 to contain the original file name and $2 for the file info. We search image word in file info and print file name if it's present.
This one is a bit tricky. Lines 6-8 read the output of awk line by line and invoke the mogrify command to resize images. Here we do not use piping and xargs, as if file paths contain spaces or other characters which must be escaped,
we will get xargs unterminated quote errors and it's a pain to handle that.
Invoke the mogrify command of ImageMagic. Unlike convert, which is also ImageMagic's command, mogrify changes files in-place without creating new ones. Here, 1024x1024\> tells to resize image to have max size of 1024x1024. The \> part tells to preserve aspect ratio, so that the final image will have the biggest side of 1024px. Other side will be smaller than that, unless the original image is square. Pay attention to the ;, as it's needed inside loops.
Note, it's safe to run mogrify several times over the same file: if a file's size already corresponds to your target dimensions, it will not be resized again. However, it will change file's modification time, though.
Additionally, you may need not only to resize images, but to compress them as well. Please, refer to my gist to see how this can be done: https://gist.github.com/oblalex/79fa3f85f05924017d25004496493adb
If your goal is just to reduce big images in size, e.g. bigger than 300K, you may:
find /path/to/dir -type f -size +300k
and as before combine it with mogrify -strip -interlace Plane -format jpg -quality 85 -define jpeg:extent=300KB "$FILE_PATH"
In such case new jpg files will be created for non-jpg originals and originals will need to be removed. Refer to the gist to see how this can be done.
You can do that with a bash unix shell script looping over your directories. You must identify all the file formats you want such as jpg and png, etc. Then for each directory, loop over each file of the given list of formats. Then use ImageMagick to resize the files.
cd
dirlist="path2/directory1 path2/directory2 ...."
for dir in $dirlist; do
cd "$dir"
imglist=`ls | grep -i ".jpg\|.png"`
for img in $imglist; do
convert $img -resize "200x200>" $img
done
done
See https://www.imagemagick.org/script/command-line-processing.php#geometry

sort files before converting to pdf in imagemagick

I have a folder full of image files I need to convert to a pdf. I used wget to download them. The problem is the ordering linux gives the files isn't the actual order of the pages, this is an example of the file ordering:
100-52b69f4490.jpg
101-689eb36688.jpg
10-1bf275d638.jpg
102-6f7dc2def9.jpg
103-2da8842faf.jpg
104-9b01a64111.jpg
105-1d5e3862d8.jpg
106-221412a767.jpg
...
I can convert these images to a pdf using imagemagick, with the command
convert *.jpg output.pdf
but it'll put the pages into that pdf in the above order, not in human readable numerical order 1-blahblahblah.jpg, 2-blahblahblah.jpg, 3-blahblahblah.jpg etc.
Is the easiest way to do this pipe the output of sort to convert? or to pipe my wget to add each file as I'm getting it to a pdf file?
convert $(ls -1v *.jpg) book.pdf
worked for me
There are several options:
The simplest is as follows, but may overflow your command-line length if you have too many pages:
convert $(ls *jpg | sort -n) result.pdf
Next up is feeding the list of files on stdin like this:
ls *jpg | sort -n | convert #- result.pdf
Here one bash script to do it that:
#!/bin/bash
sort -n < list.txt > sorted_list.tmp
readarray -t list < sorted_list.tmp
convert "${list[#]}" output.pdf
rm sorted_list.tmp
exit
You can get list.txt by first listing your directory with ls > list.txt.
The sort -n (numerical sort) "normalizes" your entries.
The sorted list is saved in the .tmp file and deleted at the end.
Greetings,

ImageMagick convert tiffs to pdf with sequential file suffix

I have the following scenario and I'm not much of a coder (nor do I know bash well). I don't even have a base working bash script to share, so any help would be appreciated.
I have a file share that contains tiffs (thousands) of a document management system. The goal is to convert and combine from multiple file tiffs to single file pdfs (preferably PDF/A 1a format).
The directory format:
/Document Management Root # This is root directory
./2009/ # each subdirectory represents a year
./2010/
./2011/
....
./2016/
./2016/000009.001
./2016/000010.001
# files are stored flat - just thousands of files per year directory
The document management system stores tiffs with sequential number file names along with sequential file suffixes:
000009.001
000010.001
000011.002
000012.003
000013.001
Where each page of a document is represented by the suffix. The suffix restarts when a new, non-related document is created. In the example above, 000009.001 is a single page tiff. Files 000010.001, 000011.002, and 000012.003 belong to the same document (i.e. the pages are all related). File 000013.001 represents a new document.
I need to preserve the file name for the first file of a multipage document so that the filename can be cross referenced with the document management system database for metadata.
The pseudo code I've come up with is:
for each file in {tiff directory}
while file extension is "001"
convert file to pdf and place new pdf file in {pdf directory}
else
convert multiple files to pdf and place new pd file in {pdf directory}
But this seems like it will have the side effect of converting all 001 files regardless of what the next file is.
Any help is greatly appreciated.
EDIT - Both answers below work. The second answer worked, however it was my mistake in not realizing that the data set I tested against was different than my scenario above.
So, save the following script in your login ($HOME) directory as TIFF2PDF
#!/bin/bash
ls *[0-9] | awk -F'.' '
/001$/ { if(NR>1)print cmd,outfile; outfile=$1 ".pdf"; cmd="convert " $0;next}
{ cmd=cmd " " $0}
END { print cmd,outfile}'
and make it executable (necessary just once) by going in Terminal and running:
chmod +x TIFF2PDF
Then copy a few documents from any given year into a temporary directory to try things out... then go to the directory and run:
~/TIFF2PDF
Sample Output
convert 000009.001 000009.pdf
convert 000010.001 000011.002 000012.003 000010.pdf
convert 000013.001 000013.pdf
If that looks correct, you can actually execute those commands like this:
~/TIFF2PDF | bash
or, preferably if you have GNU Parallel installed:
~/TIFF2PDF | parallel
The script says... "Generate a listing of all files whose names end in a digit and send that list to awk. In awk, use the dot as the separator between fields, so if the file is called 00011.0002, then $0 will be 00011.0002, $1 will be 00011 and $2 will be 0002. Now, if the filename ends in 0001, print the accumulated command and append the output filename. Then save the filename prefix with PDF extension as the output filename of the next PDF and start building up the next ImageMagick convert command. On subsequent lines (which don't end in 0001), add the filename to the list of filenames to include in the PDF. At the end, output any accumulated commands and append the output filename."
As regards the ugly black block at the bottom of your image, it happens because there are some tiny white specks in there that prevent ImageMagick from removing the black area. I have circled them in red:
If you blur the picture a little (to diffuse the specks) and then get the size of the trim-box, you can apply that to the original, unblurred image like this:
trimbox=$(convert original.tif -blur x2 -bordercolor black -border 1 -fuzz 50% -format %# info:)
convert original.tif -crop $trimbox result.tif
I would recommend you do that first to A COPY of all your images, then run the PDF conversion afterwards. As you will want to save a TIFF file but with the extension 0001, 0002, you will need to tell ImageMagick to trim and force the output filetype to TIF:
original=XYZ.001
trimbox=$(convert $original -blur x2 -bordercolor black -border 1 -fuzz 50% -format %# info:)
convert $original -crop $trimbox TIF:$original
As #AlexP. mentions, there can be issues with globbing if there is a large number of files. On OSX, ARG_MAX is very high (262144) and your filenames are around 10 characters, so you may hit problems if there are more than around 26,000 files in one directory. If that is the case, simply change:
ls *[0-9] | awk ...
to
ls | grep "\d$" | awk ...
The following command would convert the whole /Document Management Root tree (assuming it's actual absolute path) properly processing all subfolders even with names including whitespace characters and properly skipping all other files not matching the 000000.000 naming pattern:
find '/Document Management Root' -type f -regextype sed -regex '.*/[0-9]\{6\}.001$' -exec bash -c 'p="{}"; d="${p:0: -10}"; n=${p: -10:6}; m=10#$n; c[1]="$d$n.001"; for i in {2..999}; do k=$((m+i-1)); l=$(printf "%s%06d.%03d" "$d" $k $i); [[ -f "$l" ]] || break; c[$i]="$l"; done; echo -n "convert"; printf " %q" "${c[#]}" "$d$n.pdf"; echo' \; | bash
To do a dry run just remove the | bash in the end.
Updated to match the 00000000.000 pattern (and split to multiple lines for clarity):
find '/Document Management Root' -type f -regextype sed -regex '.*/[0-9]\{8\}.001$' -exec bash -c '
pages[1]="{}"
p1num="10#${pages[1]: -12:8}"
for i in {2..999}; do
nextpage=$(printf "%s%08d.%03d" "${pages[1]:0: -12}" $((p1num+i-1)) $i)
[[ -f "$nextpage" ]] || break
pages[i]="$nextpage"
done
echo -n "convert"
printf " %q" "${pages[#]}" "${pages[1]:0: -3}pdf"
echo
' \; | bash

Pass .txt list of .jpgs to convert (bash)

I'm currently working on an exercise that requires me to write a shell script whose function is to take a single command-line argument that is a directory. The script takes the given directory, and finds all the .jpgs in that directory and its sub-directories, and creates an image-strip of all the .jpgs in order of modification time (newest on bottom).
So far, I've written:
#!bin/bash/
dir=$1 #the first argument given will be saved as the dir variable
#find all .jpgs in the given directory
#then ls is run for the .jpgs, with the date format %s (in seconds)
#sed lets the 'cut' process ignore the spaces in the columns
#fields 6 and 7 (the name and the time stamp) are then cut and sorted by modification date
#then, field 2 (the file name) is selected from that input
#Finally, the entire sorted output is saved in a .txt file
find "$dir" -name "*.jpg" -exec ls -l --time-style=+%s {} + | sed 's/ */ /g' | cut -d' ' -f6,7 | sort -n | cut -d' ' -f2 > jgps.txt
The script correctly outputs the directory's .jpgs in order of time modification. The part that I am currently struggling on is how to give the list in the .txt file to the convert -append command that will create an image-strip for me (For those who aren't aware of that command, what would be inputted is: convert -append image1.jpg image2.jpg image3.jpg IMAGESTRIP.jpgwith IMAGESTRIP.jpg being the name of the completed image strip file made up of the previous 3 images).
I can't quite figure out how to pass the .txt list of files and their paths to this command. I've been scouring the man pages to find a possible solution but no viable ones have arisen.
xargs is your friend:
find "$dir" -name "*.jpg" -exec ls -l --time-style=+%s {} + | sed 's/ */ /g' | cut -d' ' -f6,7 | sort -n | cut -d' ' -f2 | xargs -I files convert -append files IMAGESTRIP.jpg
Explanation
The basic use of xargs is:
find . -type f | xargs rm
That is, you specify a command to xargs, it appends the arguments it receives from standard input and then executes it. The avobe line would execute:
rm file1 file2 ...
But you also need to specify a final argument to the command, so you need to use the xarg -I parameter, which tells xargs the string you will use after to indicate where the arguments read from standard input will be put.
So, we use the string files to indicate it. Then we write the command, putting the string files where the variable arguments will be, resulting in:
xargs -I files convert -append files IMAGESTRIP.jpg
Put the list of filenames in a file called filelist.txt and call convert with the filename prepended by an ampersand:
convert #filelist.txt -append result.jpg
Here's a little example:
# Create three blocks of colour
convert xc:red[200x100] red.png
convert xc:lime[200x100] green.png
convert xc:blue[200x100] blue.png
# Put their names in a file called "filelist.txt"
echo "red.png green.png blue.png" > filelist.txt
# Tell ImageMagick to make a strip
convert #filelist.txt +append strip.png
As there's always some image with a pesky space in its name...
# Make the pesky one
convert -background black -pointsize 128 -fill white label:"Pesky" -resize x100 "image with pesky space.png"
# Whack it in the list for IM
echo "red.png green.png blue.png 'image with pesky space.png'" > filelist.txt
# IM do your stuff
convert #filelist.txt +append strip.png
By the way, it is generally poor practice to parse the output of ls in case there are spaces in your filenames. If you want to find a list of images, across directories and sort them by time, look at something like this:
# Find image files only - ignoring case, so "JPG", "jpg" both work
find . -type f -iname \*.jpg
# Now exec `stat` to get the file ages and quoted names
... -exec stat --format "%Y:%N {} \;
# Now sort that, and strip the times and colon at the start
... | sort -n | sed 's/^.*://'
# Put it all together
find . -type f -iname \*.jpg -exec stat --format "%Y:%N {} \; | sort -n | sed 's/^.*://'
Now you can either redirect all that to filelist.txt and call convert like this:
find ...as above... > file list.txt
convert #filelist +append strip.jpg
Or, if you want to avoid intermediate files and do it all in one go, you can make this monster where convert reads the filelist from its standard input stream:
find ...as above... | sed 's/^.*://' | convert #- +append strip.jpg

Regexp lines from file and run command

I have a file with output from the identify command, looks like this (following format: FILENAME FORMAT SIZE METADATA)
/foo/bar.jpg JPEG 2055x1381 2055x1381+0+0 8-bit DirectClass
/foo/ham spam.jpg JPEG 855x781 855x781+0+0 8-bit DirectClass
...
Note that the filenames can contain spaces! What I want to do is to basically run this on each of those lines:
convert -size <SIZE> -colors 1 xc:black <FILENAME>
In other words, creating blank images of existing ones. I've tried doing this with cat/sed/xargs but it's making my head explode. Any hints? And preferably a command-line solution..
Assuming, that filename is the string before " JPEG":
LINE="/foo/ham spam.jpg JPEG 855x781 855x781+0+0 8-bit DirectClass"
You can get file name as:
FILENAME=$(echo "$LINE" | sed 's/\(.*\) JPEG.*/\1/')
cat data_file | sed -e 's/\(.*\) JPEG \([^ ]*\) .*/convert -size \2 -colors 1 xc:black "\1"/' | bash
You can do what MichaƂ suggests. Also, if the metadata has a fixed number of words, you could do this easily like the following (supposing you process every line):
FILENAME=`echo $LINE | rev | cut -d\ -f 6- | rev`
(that is, reverse the line, and take the name from the sixth parameter on, then you have to reverse to obtain the filename proper.)
If not, you can use the fact that all the images have an extension and that the extension itself doesn't have spaces, and search for the extension till the first space afterwards:
FILENAME=`echo $LINE | sed -e '/([^.]+) .*$/\1/'`

Resources