Bash PDF-merging misses files - bash

I'm trying to merge many PDF files in chunks of 3000 or so files. After many tries, this script seemed to do the trick. (of course I was wrong)
#!/bin/bash
basepath='/home/lemonidas/pdfstuff';
datename=`date "+%Y%m%d%H%M.%S"`;
start=`date "+%s"`;
echo "parsing pdf list to file..."
find $basepath/input/ -name "*.pdf" | xargs -I {} ls {} >> $basepath/tmp/biglist$datename.txt
split -l 3000 $basepath/tmp/biglist$datename.txt $basepath/tmp/splitfile
rm $basepath/tmp/biglist$datename.txt
echo "deleting big file..."
echo "done splitting!"
declare -i x
x=1
for f in $basepath/tmp/splitfile*
do
linenum=`cat $f | wc -l`;
echo "Processing $f ($linenum lines)..."
# merge to one big PDF
cat $f | xargs gs -q -sstdout=$basepath/error.log -sPAPERSIZE=letter -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -sOutputFile=$basepath/output/$x.big.pdf 2>$basepath/error.log
echo "Completed PDF $x"
(( x++ ))
# delete the list file
rm $f
echo "Deleted processed file $f"
done
end=`date "+%s"`;
echo "Started: $start"
echo "Finished: $end"
The problem is, I have 22000 2-page PDFs, each output file (except the last one) should be 6000 pages (since we have 3000 PDFs in each merge list, as verified by the "wc -l" before parsing), and I only get around 658 pages or so.
No errors are reported except this by gs:
Warning: Embedded symbolic TT fonts must contain a cmap for Platform=1 Encoding=0.Warning: Embedded symbolic TT fonts must contain a cmap for Platform=1 Encoding=0.Warning: Embedded symbolic TT fonts must contain a cmap for Platform=1 Encoding=0.Warning: Embedded symbolic TT fonts must contain a cmap for Platform=1 Encoding=0.Warning: Embedded symbolic TT fonts must contain a cmap for Platform=1 Encoding=0.Warning: Embedded symbolic TT fonts must contain a cmap for Platform=1 Encoding=0.Warning: Embedded symbolic TT fonts must contain a cmap for Platform=1 Encoding=0.Warning: Embedded symbolic TT fonts must contain a cmap for Platform=1 Encoding=0.
This file had errors that were repaired or ignored.
The file was produced by: >>>> Powered By Crystal
Please notify the author of the software that produced this file that it does not conform to Adobe's published PDF specification.
over and over (but not 22000 times though)
When I try it with 300-400 files, it runs smoothly, but when I try the full run, after 2.5 hours, I get much less than half the files merged.
My next thought is to convert each 2-page PDF in .pgm files, but I then have no idea how to remake them as PDF (so that no font embedding issues arise).
Am I missing something? (probably)

You would probably do better to use a tool better suited to the task. pdfwrite (the Ghostscript device for emitting PDF files) is not in my opinion the right tool for this.
In order to 'merge' PDF files, Ghostscript fully interprets the input into marking operations, then rewrite the marking operations as a PDF file. While creating that list of operations a great deal of information needs to be held (fonts, images, other things) and compared against new input to see if we already have a copy. As the input grows larger, it takes longer to scan that list, and of course the memory consumption increases. You may find that Ghostscript is already swapping memory.
Now I'm not sure this is your actual problem, or if you are saying that after you 'merge' the files there are pages missing. That should not happen. You also don't say what version of Ghostscript you are using.
All the same, I would think that a tool like pdftk would be faster at doing this kind of merge, though the final PDF file might well be larger/less efficient than pdfwrite would make.

Related

pdftk update_info command raising a warning which I don't understand

I'm trying to use the update_info command in order to add some bookmarks to an existing pdf's metadata using pdftk and powershell.
I first dump the metadata into a file as follows:
pdftk .\test.pdf dump_data > test.info
Then, I edit the test.info file by adding the bookmarks, I believe I am using the right syntax. I save the test.info file and attempt to write the metadata to a new pdf file using update_info:
pdftk test.pdf update_info test.info output out.pdf
Unfortunately, I get a warning as follows:
pdftk Warning: unexpected case 1 in LoadDataFile(); continuing
out.pdf is generated, but contains no bookmarks. Just to be sure it is not a syntax problem, I also ran it without editing the metadata file, by simply overwriting the same metadata. I still got the same warning.
Why is this warning occurring? Why are no bookmarks getting written to my resulting pdf?
using redirection in that fashion
pdftk .\test.pdf dump_data > test.info
will cause this known problem by building wrong file structure, so change to
pdftk .\test.pdf dump_data output test.info
In addition check your alterations are correctly balanced (and no unusual characters) then save the edited output file in the same encoding.
Note:- you may need to consider
Use dump_data_utf8 and update_info_utf8 in order to properly display characters in scripts other than Latin (e. g. oriental CJK)
I used pdftk --help >pdftk-help.txt to find the answer.
With credit to the previous answer, the following creates a text file of the information parameters: pdftk aaa.pdf dump_data output info.txt
Edit the info.txt file as needed.
The pdftk update_info option creates a new pdf file, leaving the original pdf untouched. Use: pdftk aaa.pdf update_info info.txt output bbb.pdf

Grep Zip files in windows - Have a process that works, but could this be faster?

Have seen posts for zipgrep for Linux..
For example - grep -f on files in a zipped folder
rem zipgrep -s "pattern" TestZipFolder.zip
rem zipgrep [egrep_options] pattern file[.zip] [file(s) ...] [-x xfile(s) ...]
Using Google, did find: http://www.info-zip.org/mans/zipgrep.html and looking in their archives don't see zipgrep in there. It also seems the Info-Zip binaries/code has not been updated in quite a while. I suppose I could grab some of their source and compile..
Also, looked on the Cygwin site and see they are also toying with this as well..
Here is what I am using today.. Just wondering if I could make this faster?
D:\WORK\Scripts\unzip -c D:\Logs\ArchiveTemp\%computername%-04-07-2014-??-00-00-compressed.zip server.log.* | D:\WORK\Scripts\grep -i ">somestring<" >> somestring.txt
Couple issues with the code I have posted:
* Does not show which log file the string is in
* Does not show which zip file the string is in
While the zip file I posted works, it has a lot of room for improvement.
Not much headroom for optimization, but it is worth noting that different implementations of unzip vary in performance. For speed on Windows, decompress the zip file using 7-zip, or the cygwin unzipping utlity. (Obtain via setup -nqP unzip, or the setup utility).
After unzipping, fgrep the directory structure recursively using grep -r.
In summary:
1) copy the zip file to fooCopy.zip
2) unzip fooCopy.zip
3) fgrep -r "regular expression" fooCopy
Rationale, because the file is compressed, you will have to incrementally uncompress the pieces to grep them anyway. Doing it as one batch job is faster, and clearer for someone else to understand.

shellscript to convert .TIF to a .PDF

I'm wanting to progress through a directory's subdirectories and either convert or place .TIF images into a pdf. I have a directory structure like this:
folder
item_one
file1.TIF
file2.TIF
...
fileN.TIF
item_two
file1.TIF
file2.TIF
...
...
I'm working on a Mac and considered using sips to change my .TIF files to .PNG files and then use pdfjoin to join all the .PNG files into a single .PDF file per folder.
I have used:
for filename in *; do sips -s format png $filename --out $filename.png; done
but this only works for the .TIF files in a single directory. How would one write a shellscript to progress through a series of directories as well?
once the .PNG files were created I'd do essentially the same thing but using:
pdfjoin --a4paper --fitpaper false --rotateoversize false *.png
Is this a valid way of doing this? Is there a better, more efficient way of performing such an action? Or am I being an idiot and should be doing this with some sort of software, like ImageMagick or something?
Try using the find command with the exec switch to call your image conversion solution. Alternatively, instead of using the exec switch, you could pipe the output of find to xargs. There is lots of information online about using find. Here's one example from StackOverflow.
As far as the image conversion, I think that really depends on your requirements for speed and efficiency. If you've verified the process you described, and this is a one-time process, and it only takes seconds or minutes to run, then you're probably fine. On the other hand, if you need to do this frequently, then it might be worth investing the time to find a one-step conversion solution that takes less time than your current, two-pass solution.
Note that, instead of two passes, you may be able to pipe the output of sips to pdfjoin; however, that would require some investigation to verify.

how to find image files without extensions (on macos 10.8)

i have an app that has decided to die which had a library of images it stored on my hard drive in a series of guid-like folders. the files themselves have no file extensions, there must have been an internal database (unrecoverable/corrupt) that associated the file itself with its name/extension/mime. So to get my stuff back out I'd like to be able to search the disk to at least identify which of the files are images (jpeg and png files). I know that both jpeg and png have particular byte sequences in the first few bytes of the files. Is there a grep command that can match these known byte sequences in the first few bytes of each file in the massively nested file system structure that I have (e.g. folders 0 through f, each containing folders 0 through f, nested several levels deep, with files with uid filenames.
Starting at the current directory .:
find . -type f -print0 | xargs -J fname -0 -P 4 identify -ping fname 2>|/dev/null
This will print the files that ImageMagick can identify, which are mostly images, but there are also exceptions (like txt files). ImageMagick is not particularly fast for this task either, so depending on what you have available there might be faster alternatives. For instance, the PIL package for Python will make this faster simply because it supports a lesser amount of image formats, but which might be enough for your task.

What is the fastest way to unzip textfiles in Matlab during a function?

I would like to scan text of textfiles in Matlab with the textscan function. Before I can open the textfile with fid = fopen('C:\path'), I need to unzip the files first. The files have the extension: *.gz
There are thousands of files which I need to analyze and high performance is important.
I have two ideas:
(1) Use an external program an call it from the command line in Matlab
(2) Use a Matlab 'zip'toolbox. I have heard of gunzip, but don't know about its performance.
Does anyone knows a way to unzip these files as quick as possible from within Matlab?
Thanks!
You could always try the Matlab unzip() function:
unzip
Extract contents of zip file
Syntax
unzip(zipfilename)
unzip(zipfilename, outputdir)
unzip(url, ...)
filenames = unzip(...)
Description
unzip(zipfilename) extracts the archived contents of zipfilename into the current folder and sets the files' attributes, preserving the timestamps. It overwrites any existing files with the same names as those in the archive if the existing files' attributes and ownerships permit it. For example, files from rerunning unzip on the same zip filename do not overwrite any of those files that have a read-only attribute; instead, unzip issues a warning for such files.
Internally, this uses Java's zip library org.apache.tools.zip. If your zip archives each contain many text files it might be faster to drop down into Java and extract them entry by entry, without explicitly unzipped files. look at the source of unzip.m to get some ideas, and also the Java documentation.
I've found 7zip-commandline(Windows) / p7zip(Unix) to be somewhat speedier for this.
[edit]From some quick testing, it seems making a system call to gunzip is faster than using MATLAB's native gunzip. You could give that a try as well.
Just write a new function that imitates basic MATLAB gunzip functionality:
function [] = sunzip(fullfilename,output_dir)
if ~exist('output_dir','var'), output_dir = fileparts(fullfilename); end
app_path = '/usr/bin/7za';
switches = ' e'; %extract files ignoring directory structure
options = [' -o' output_dir];
system([app_path switches options '_' fullfilename]);
Then use it as you would use gunzip:
sunzip('/data/time_1000.out.gz',tmp_dir);
With MATLAB's toc timer, I get the following extraction times with 6 uncompressed 114MB ASCII files:
gunzip: 10.15s
sunzip: 7.84s
worked well, just needed a minor change to Max's syntax calling the executable.
system([app_path switches ' ' fullfilename options ]);

Resources