sort files before converting to pdf in imagemagick - sorting

I have a folder full of image files I need to convert to a pdf. I used wget to download them. The problem is the ordering linux gives the files isn't the actual order of the pages, this is an example of the file ordering:
100-52b69f4490.jpg
101-689eb36688.jpg
10-1bf275d638.jpg
102-6f7dc2def9.jpg
103-2da8842faf.jpg
104-9b01a64111.jpg
105-1d5e3862d8.jpg
106-221412a767.jpg
...
I can convert these images to a pdf using imagemagick, with the command
convert *.jpg output.pdf
but it'll put the pages into that pdf in the above order, not in human readable numerical order 1-blahblahblah.jpg, 2-blahblahblah.jpg, 3-blahblahblah.jpg etc.
Is the easiest way to do this pipe the output of sort to convert? or to pipe my wget to add each file as I'm getting it to a pdf file?

convert $(ls -1v *.jpg) book.pdf
worked for me

There are several options:
The simplest is as follows, but may overflow your command-line length if you have too many pages:
convert $(ls *jpg | sort -n) result.pdf
Next up is feeding the list of files on stdin like this:
ls *jpg | sort -n | convert #- result.pdf

Here one bash script to do it that:
#!/bin/bash
sort -n < list.txt > sorted_list.tmp
readarray -t list < sorted_list.tmp
convert "${list[#]}" output.pdf
rm sorted_list.tmp
exit
You can get list.txt by first listing your directory with ls > list.txt.
The sort -n (numerical sort) "normalizes" your entries.
The sorted list is saved in the .tmp file and deleted at the end.
Greetings,

Related

Control file order with tiffcp

I want to use my tried-and-true script to combine all tif in a directory into a single multipage tiff
tiffcp *.tif out.tif
but I want the files in the reverse of alphabetical order, e.g. 003.tif, 002.tif, 001.tif. Is there a flag in tiffcp? Do I need to rename all the files?
You can do this with some bash:
tiffcp `echo *.tif | sort -r` out.tif

Is it possible to grep using an array as pattern?

TL;DR
How to filter an ls/find output using grep
with an array as a pattern?
Background story:
I have a pipeline which I have to rerun for datasets which run into an error.
Which datasets are run into an error is saved in a tab separated file.
I want to delete the files where the pipeline has run into an error.
To do so I extracted the dataset names from another file containing the finished dataset and saved them in a bash array {ds1 ds2 ...} but now I am stuck because I cannot figure out how to exclude the datasets in the array from my deletion step.
This is the folder structure (X=1-30):
datasets/dsX/results/dsX.tsv
Not excluding the finished datasets, meaning deleting the folders of the failed and the finished datasets works like a charm
#1. move content to a trash folder
ls /datasets/*/results/*|xargs -I '{}' mv '{}' ./trash/
#2. delete the empty folders
find /datasets/*/. -type d -empty -delete
But since I want to exclude the finished datasets I thought it would be clever to save them in an array:
#find finished datasets by extracting the dataset names from a tab separated log file
mapfile -t -s 1 finished < <(awk '{print $2}' $path/$log_pf)
echo ${finished[#]}
which works as expected but now I am stuck in filtering the ls output using that array:
*pseudocode
#trying to ignore the dataset in the array - not working
ls -I${finished[#]} -d /datasets/*/
#trying to reverse grep for the finished datasets - not working
ls /datasets/*/ | grep -v {finished}
What do you think about my current ideas?
Is this possible using bash only? I guess in python I could do that easily
but for training purposes, I want to do it in bash.
grep can get the patterns from a file using the -f option. Note that file names containing newlines will cause problems.
If you need to process the input somehow, you can use process substitution:
grep -f <(process the input...)
I must admit I'm confused about what you're doing but if you're just trying to produce a list of files excluding those stored in column 2 of some other file and your file/directory names can't contain spaces then that'd be:
find /datasets -type f | awk 'NR==FNR{a[$2]; next} !($0 in a)' "$path/$log_pf" -
If that's not all you need then please edit your question to clarify your requirements and add concise testable sample input and expected output.

My .gz/.zip file contains a huge text file; without saving that file unpacked to disk, how to extract its lines that match a regular expression?

I have a file.gz (not a .tar.gz!) or file.zip file. It contains one file (20GB-sized text file with tens of millions of lines) named 1.txt.
Without saving 1.txt to disk as a whole (this requirement is the same as in my previous question), I want to extract all its lines that match some regular expression and don't match another regex.
The resulting .txt files must not exceed a predefined limit, say, one million lines.
That is, if there are 3.5M lines in 1.txt that match those conditions, I want to get 4 output files: part1.txt, part2.txt, part3.txt, part4.txt (the latter will contain 500K lines), that's all.
I tried to make use of something like
gzip -c path/to/test/file.gz | grep -P --regexp='my regex' | split -l1000000
But the above code doesn't work. Maybe Bash can do it, as in my previous question, but I don't know how.
You can perhaps use zgrep.
zgrep [ grep_options ] [ -e ] pattern filename.gz ...
NOTE: zgrep is a wrapper script (installed with gzip package), which essentially uses the same command internally as mentioned in other answers.
However, it looks more readable in the script & easier to write the command manually.
I'm afraid It's imposible, quote from gzip man:
If you wish to create a single archive file with multiple members so
that members can later be extracted independently, use an archiver
such as tar or zip.
UPDATE: After de edit, if the gz only contains one file , a one step tool like awk shoul be fine:
gzip -cd path/to/test/file.gz | awk 'BEGIN{global=1}/my regex/{count+=1;print $0 >"part"global".txt";if (count==1000000){count=0;global+=1}}'
split is also a good choice but you will have to rename files after it.
Your solution is almost good. The problem is that You should specify for gzip what to do. To decompress use -d. So try:
gzip -dc path/to/test/file.gz | grep -P --regexp='my regex' | split -l1000000
But with this you will have a bunch of files like xaa, xab, xac, ... I suggest to use the PREFIX and numeric suffixes features to create better output:
gzip -dc path/to/test/file.gz | grep -P --regexp='my regex' | split -dl1000000 - file
In this case the result files will look like: file01, file02, fil03 etc.
If You want to filter out some not matching perl style regex, you can try something like this:
gzip -dc path/to/test/file.gz | grep -P 'my regex' | grep -vP 'other regex' | split -dl1000000 - file
I hope this helps.

Use exiv2 or imagemagick to remove EXIF data from stdin and output to stdout

How can I pipe an image into exiv2 or imagemagick, strip the EXIF tag, and pipe it out to stdout for more manipulation?
I'm hoping for something like:
exiv2 rm - - | md5sum
which would output an image supplied via stdin and calcualte its md5sum.
Alternatively, is there a faster way to do this?
Using exiv2
I was not able to find a way to get exiv2 to output to stdout -- it only wants to overwrite the existing file. You could use a small bash script to make a temporary file and get the md5 hash of that.
image.sh:
#!/bin/bash
cat <&0 > tmp.jpg # Take input on stdin and dump it to temp file.
exiv2 rm tmp.jpg # Remove EXIF tags in place.
md5sum tmp.jpg # md5 hash of stripped file.
rm tmp.jpg # Remove temp file.
You would use it like this:
cat image.jpg | image.sh
Using ImageMagick
You can do this using ImageMagick instead by using the convert command:
cat image.jpg | convert -strip - - | md5sum
Caveat:
I found that stripping an image of EXIF tags using convert resulted in a smaller file-size than using exiv2. I don't know why this is and what exactly is done differently by these two commands.
From man exiv2:
rm Delete image metadata from the files.
From man convert:
-strip strip image of all profiles and comments
Using exiftool
ExifTool by Phil Harvey
You could use exiftool (I got the idea from https://stackoverflow.com/a/2654314/3565972):
cat image.jpg | exiftool -all= - -out - | md5sum
This too, for some reason, produces a slightly different image size from the other two.
Conclusion
Needless to say, all three methods (exiv2, convert, exiftool) produce outputs with different md5 hashes. Not sure why this is. But perhaps if you pick a method and stick to it, it will be consistent enough for your needs.
I tested with NEF file. Seems only
exiv2 rm
works best. exiftool and convert can't remove all metadata from .nef FILE.
Notice that the output file of exiv2 rm can no longer be displayed by most image viewers. But I only need the MD5 hash keeps same after I update any metadata of the .NEF file. It works perfect for me.

convert a directory of images into a single PDF

I have a directory of images:
path/to/directory/
image01.jpg
image02.jpg
...
and would like to convert it into a single PDF file:
path/to/directory.pdf
This is what I managed to code so far:
#!/bin/bash
echo Directory $1
out=$(echo $1 | sed 's|/$|.pdf|')
echo Output $out
mkdir tmp
for i in $(ls $1)
do
# MAC hates sed with "I" (ignore case) - thanks SO for the perl solution!
# I want to match "jpg, JPG, Jpg, ..."
echo $1$i $(echo "tmp/$i" | perl -C -e 'use utf8;' -pe 's/jpg$/pdf/i')
convert $1$i $(echo "tmp/$i" | perl -C -e 'use utf8;' -pe 's/jpg$/pdf/i')
done
pdftk tmp/*.pdf cat output $out
rm -rf tmp
So the idea was to convert each image into a pdf file with imagemagick, and use pdftk to merge it into a single file. Thanks to the naming of the files I don't have to bother about the ordering.
Since I'm a newbie to this I'm sure there are many refinements one can do:
only iterate over image-files in the directory (in case there is some Readme.txt,...)
including the extensions png, jpeg, ...
using the trailing "/" is not elegant I admint
etc.
Currently my main problem is, however, that there are cases where my directories and image files contain spaces in their names. The for-loop then iterates over sub-strings of the filename and I imagine that the line with convert will also fail.
I have tried out some things but haven't succeeded so far and hope someone will be able to help me here.
If anyone has ideas to address the issues I listed above as well I would be very glad to hear them too.
convert can do this in one go:
convert *.[jJ][pP][gG] output.pdf
Or to answer several of your other questions and replace your script:
#!/bin/bash
shopt -s nullglob nocaseglob
convert "$1"/*.{png,jpg,jpeg} "${1%/}.pdf"
will iterate over all the given extensions in the first argument, regardless of capitalization, and write to yourdir.pdf. It will not break on spaces.

Resources