I use pdftk server to automate various tasks. Recently I ran into a problem where pdftk crashed while merging a large number of pdfs with the error window:
Fatal error in gc: Too many heap sections
After receiving this error I've run some tests to confirm that this same error will happen, regardless of what task pdftk is doing on pdfs, when its memory usage exceeds 512Mb.
I was hoping someone could help me understand what this error means, and if there's a way to set up pdftk to handle these larger jobs?
If it's just a limitation of the program, does anyone have a suggestion for a similarly functioning program without this limitation?
I used pdftk for dumping data, such as bookmarks/contents. It is very useful.
However, I met similar problems.
Ghostscript may be helpful.
Ghostscript can change original PDF file into newer one, and decrease its size.
Ghostscript also split 1 large PDF file into smaller PDF files.
My commands are:
gswin64c -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dBATCH -dQUIET -dNOPAUSE -dDOPDFMARKS -dFirstPage=(some number) -dLastPage=(some number) -sOutputFile=newfile.pdf originalfile.pdf
Parameters such as -dFirstPage=(some number) -dLastPage=(some number) may be skipped if the transformed PDF file is small enough
I wish the above could be helpful
So I know it's not perfect, but I just wrote my own implementation using iTextSharp to fulfill my requirements. If anyone else runs into this post and needs it, it's available at this github link under the AGPL license.
Related
This is what I am trying to achieve:
I got several hundred small PDF files of varying size. I need to merge them into chunks of close to but no more than a certain target file size.
I am familiar with gs as well as pdftk (though i prefer to use gs).
Does anyone know a way of predicting the filesize of the merged output PDF beforehand so that i can use it to select the files to be included in the next chunk?
I am not aware of something like a --dry-run option for gs...
(If there is no other way i guess i would have to make a guess based on the sum of the input file sizes and go for trial and error.)
Thank you in advance!
I have an account on Bluehost, it's a shared machine. I have been able to run most custom scripts with no problem, but image processing scripts are killed mysteriously after about 20 seconds. No output file is created. Sometimes I can get the command line below to run if I restrict it to monochrome.
I tried ulimit and nice, but I feel I am just guessing. Is there a more methodical way to look into this? Yes, I am also contacting Bluehost support.
~]# gs -q -dQUIET -dSAFER -dBATCH -dNOPAUSE -dNOPROMPT \
> -sDEVICE=png48 -sOutputFile=11006.png 11006.pdf
Killed
~]# echo $?
137
~]#
Problem 1 is that there is no png48 device in standard Ghostscript. The most likely problem after that, I'd guess is that its using too much memory, so you need to use controls which will limit the memory usage and instead use the clist (which is a display list implementation), this will take longer but reduce the memory footprint.
Available PNG output devices are png16, png16m, png256, pngalpha, pnggray, pngmono, pngmonod
Use the -dMaxBitmap switch to control the maximum page buffer size, if the page exceeds this then it will use the clist, which will result in 'n' bands. This is slower to process but uses much less memory. You can also use -dNumRenderingThreads if your system has multiple cores.
What version of Linux is this, and what version of Ghostscript, and where did the installed Ghostscript come from (eg did you build it yourself from source ?)
If its a very old version of Ghostscript, it may simply be that it has bugs.
I have a service which produces pdf files. I have PS-compatible printers. For printing the pdf files, I use ghostscript to convert them to ps an copy them to a shared (windows) print queue. Most of the pdf-files contain just a few pages (<10) and don't cause any trouble.
From time to time I have to print large files (100+, 500+, 5000+) pages and there I observe the following:
converting to ps is fast for the first couple of pages, then slows down. The further the progress, the longer the time for a single page.
after conversion, copying to the print queue works without problems
when copying is finished and it comes to sending the document to the printer, I observe more or less the same phenomenon: the further the progress, the slower the transfer.
Here is how I convert pdf to ps:
"C:\Program Files\gs\gs9.07\bin\gswin64c.exe" \
-dSAFER -dNOPAUSE -DBATCH \
-sOutputFile=D:\temp\testGS\test.ps \
-sDEVICE=ps2write \
D:\temp\testGS\test.pdf
After this conversion I simply copy it to the print queue
copy /B test.ps \\printserever\myPSQueue
What possibilities do I have to print large files this way?
My first idea was to do the following:
"C:\Program Files\gs\gs9.07\bin\gswin64c.exe" \
-dSAFER -dNOPAUSE -DBATCH \
-sOutputFile=D:\temp\testGS\test%05d.ps \
-sDEVICE=ps2write \
D:\temp\testGS\test.pdf
Working with single pages speeds up the conversion, it doesn't slow down after every single page, and also printing is fast, when I copy every single page as own ps file to the printer. But there is one problem I will encounter sooner or later: when I copy the single ps files, they will be single print jobs. Even when they are sorted in the correct order, if someone else starts a print job on the same printer in between, the printings will all get mixed up.
The other idea was using gsPrint, which works considerable fast, but with gsPrint I need the printer to be installed locally, which is not manageable in my environment with 300+ printers at different locations.
Can anyone tell me exactly, what happens? Is this a bad way to print? Does any have a suggestion how to solve the task of printing such documents in such an environment?
Without seeing an example PDF file its difficult to say much about why it should print slowly. However the most likely explanation is that the PDF is being rendered to an image, probably because it contains transparency.
This will result in a large image, created at the default resolution of the device (720 dpi), which is almost certainly higher than required for your printer(s). This means that a latge amount of time is spent transmitting extra data to the printer, which the PostScript interpreter in the printer then has to discard.
Using gsprint renders the file to the resolution of the device, assuming this is less than 720 dpi the resulting PostScript will be smaller therefore requiring less time to transmit, less time to decompress on the printer and less time spent throwing away extra data.
One reason the speed decreases is because of the way ps2write works, it maintains much of the final content in temporary files, and stitches the main file back together from those files. It also maintains a cross reference table which grows as the number of objects int eh file does. Unless you need the files to be continuous you could create a number of print files by using the -dFirstPage and -dLastPage options so that only a subset of the final printout is created, this might improve the performance.
Note that ps2write does not render the incoming file to an image, while gsprint definitely does, the PostScript emerging from gsprint will simply define a big bitmap. This doesn't mantain colours (everything goes to RGB) and doesn't maintain vector objects as vectors, so it doesn't scale well. However.... If you want to use gsprint to print to a remote printer, you can set up a 'virtual printer' using RedMon. You can have RedMon send the output from a port to a totally different printer, even a remote one. So you use gsprint to print to (eg) 'local instance of MyPrinter' on RedMon1: and have the RedMon port set up to capture the print stream to disk and then send the PostScript file to 'MyPrinter on another PC'. Though I'd guess thats probably not going to be any faster.
My suggestion would be to set the resolution of ps2write lower; -r300 should be enough for any printer, and lower may be possible. The resolution will only affect rendered output, everything else remains as vectors and so scales nicely. Rendered images will print perfectly well at half the resolution of the printer, in general.
I can't say why the printer becomes so slow with the Ghostscript generated PostScript, but you might want to give other converters a try, like pdftops from the Poppler utils (I found a Windows download here as you seem to be using Windows).
I'm using a command similar to this:
gswin32c.exe -dNOPAUSE -dBATCH -q -dSAFER -sDEVICE=pdfwrite -dPDFSETTINGS=/prepress -sOutputFile="path/output.pdf" <PSfiles>
This gives me a single pdf document with each PS document represented as a page. However, the page sizes do not translate well. The original PS files are all different sizes and each resulting pdf page is cutoff to the same size, which looks like landscape A4.
When I convert a single PS file with the exact same command, the page size is preserved. So it seems like since all the PS files are being sent to the same pdf, they must all have the same page size and I lose content. Is there anyway to preserve the document sizes while still using a single command?
Update: I was originally using GS 8.63, but I downloaded 9.06 and have the same issue.
Additionally, I've narrowed the problem down. It seems like there is one specific PS file (call it problemFile.ps) that causes the problem, as I can run the command successfully as long as I disclude problemFile.ps. And it only causes a problem if it is the last file included on the command line. I can't post the entire file, but are there any potential problem areas I should look at?
Update2: Okay I was wrong in saying there is one specifc problem file. It appears that the page size of the last file included on the command line sets the maximum page size for all the resultant pages.
As long as each PostScript file (or indeed each page) actually requests a different media size then the resulting PDF file will honour the requests. I know this at least used to work, I've tested it.
However there are some things in your command line which you might want to reconsider:
1) When investigating problems with GS, don't use -q, this will prevent Ghostscript telling you potentially useful things.
2) DON'T use -dPDFSETTINGS unless you have read the relevant documentation and understand the implications of each parameter setting.
3) You may want to turn off AutoRotatePages, or at least set it to /PageByPage
My guess is that your PostScript files don't request a media size and therefore use the default media. Of course I can't tell without seeing an example.
NB you also don't say what version of Ghostscript you are using.
I have several large PDF reports (>500 pages) with grid lines and background shading overlay that I converted from postscript using GhostScript's ps2pdf in a batch process. The PDFs that get created look perfect in the Adobe Reader.
However, when I go to print the PDF from Adobe Reader I get about 4-5 ppm from our Dell laser printer with long, 10+ second pauses between each page. The same report PDF generated from another proprietary process (not GhostScript) yeilds a fast 25+ ppm on the same printer.
The PDF file sizes on both are nearly the same at around 1.5 MB each, but when I print both versions of the PDF to file (i.e. postscript), the GhostScript generated PDF postscript output is about 5 times larger than that of the other (2.7 mil lines vs 675K) or 48 MB vs 9 MB. Looking at the GhostScript output, I see that the background pattern for the grid lines/shading (referenced by "/PatternType1" tag) is defined many thousands of times throughout the file, where it is only defined once in the other PDF output. I believe this constant re-defining of the background pattern is what is bogging down the printer.
Is there a switch/setting to force GhostScript to only define a pattern/image only once? I've tried using the -r and -dPdfsettings=/print switches with no relief.
Patterns (and indeed images) and many other constructs should only be emitted once, you don't need to do anything to have this happen.
Forms, however, do not get reused, and its possible that this is the source of your actual problem. As Kurt Pfiefle says above its not possible to tell without seeing a file which causes the problem.
You could raise a bug report at http://bubgs.ghostscript.com which will give you the opportunity to attach a file. If you do this please do NOT attach a > 500 page file, it would be appreciated if you would try to find the time to create a smaller file which shows the same kind of size inflation.
Without seeing the PostScript file I can't make any suggestions at all.
I've looked at the source PostScript now, and as suspected the problem is indeed the use of a form. This is a comparatively unusual area of PostScript, and its even more unusual to see it actually being used properly.
Because its rare usage, we haven't any impetus to implement the feature to preserve forms in the output PDF, and this is what results in the large PDF. The way the pattern is defined inside the form doesn't help either. You could try defining the pattern separately, at least that way pdfwrite might be able to detect the multiple pattern usage and only emit it once (the pattern contains an imagemask so this may be worthwhile).
This construction:
GS C20 setpattern 384 151 32 1024 RF GR
GS C20 setpattern 384 1175 32 1024 RF GR
is inefficient, you keep re-instantiating the pattern, which is expensive, this:
GS C20 setpattern
384 151 32 1024 RF
384 1175 32 1024 RF
GR
is more efficient
In any event, there's nothing you can do with pdfwrite to really reduce this problem.
'[...] when I print both versions of the PDF to file (i.e. postscript), the GhostScript generated PDF postscript output is about 5 times larger than that of the other (2.7 mil lines vs 675K) or 48 MB vs 9 MB.'
Which version of Ghostscript do you use? (Try gs -v or gswin32c.exe -v or gswin64c.exe -v to find out.)
How exactly do you 'print to file' the PDFs? (Which OS platform, which application, which kind of settings?)
Also, ps2pdf may not be your best option for the batch process. It's a small shell/batch script anyway, which internally calls a Ghostscript command.
Using Ghostscript directly will give you much more control over the result (though its commandline 'usability' is rather inconvenient and awkward -- that's why tools like ps2pdf are so popular...).
Lastly, without direct access to one of your PS input samples for testing (as well as the PDF generated by the proprietary converter) it will not be easy to come up with good suggestions.