PostScript to PDF conversion/slow print issue [GhostScript] - ghostscript

I have several large PDF reports (>500 pages) with grid lines and background shading overlay that I converted from postscript using GhostScript's ps2pdf in a batch process. The PDFs that get created look perfect in the Adobe Reader.
However, when I go to print the PDF from Adobe Reader I get about 4-5 ppm from our Dell laser printer with long, 10+ second pauses between each page. The same report PDF generated from another proprietary process (not GhostScript) yeilds a fast 25+ ppm on the same printer.
The PDF file sizes on both are nearly the same at around 1.5 MB each, but when I print both versions of the PDF to file (i.e. postscript), the GhostScript generated PDF postscript output is about 5 times larger than that of the other (2.7 mil lines vs 675K) or 48 MB vs 9 MB. Looking at the GhostScript output, I see that the background pattern for the grid lines/shading (referenced by "/PatternType1" tag) is defined many thousands of times throughout the file, where it is only defined once in the other PDF output. I believe this constant re-defining of the background pattern is what is bogging down the printer.
Is there a switch/setting to force GhostScript to only define a pattern/image only once? I've tried using the -r and -dPdfsettings=/print switches with no relief.

Patterns (and indeed images) and many other constructs should only be emitted once, you don't need to do anything to have this happen.
Forms, however, do not get reused, and its possible that this is the source of your actual problem. As Kurt Pfiefle says above its not possible to tell without seeing a file which causes the problem.
You could raise a bug report at http://bubgs.ghostscript.com which will give you the opportunity to attach a file. If you do this please do NOT attach a > 500 page file, it would be appreciated if you would try to find the time to create a smaller file which shows the same kind of size inflation.
Without seeing the PostScript file I can't make any suggestions at all.

I've looked at the source PostScript now, and as suspected the problem is indeed the use of a form. This is a comparatively unusual area of PostScript, and its even more unusual to see it actually being used properly.
Because its rare usage, we haven't any impetus to implement the feature to preserve forms in the output PDF, and this is what results in the large PDF. The way the pattern is defined inside the form doesn't help either. You could try defining the pattern separately, at least that way pdfwrite might be able to detect the multiple pattern usage and only emit it once (the pattern contains an imagemask so this may be worthwhile).
This construction:
GS C20 setpattern 384 151 32 1024 RF GR
GS C20 setpattern 384 1175 32 1024 RF GR
is inefficient, you keep re-instantiating the pattern, which is expensive, this:
GS C20 setpattern
384 151 32 1024 RF
384 1175 32 1024 RF
GR
is more efficient
In any event, there's nothing you can do with pdfwrite to really reduce this problem.

'[...] when I print both versions of the PDF to file (i.e. postscript), the GhostScript generated PDF postscript output is about 5 times larger than that of the other (2.7 mil lines vs 675K) or 48 MB vs 9 MB.'
Which version of Ghostscript do you use? (Try gs -v or gswin32c.exe -v or gswin64c.exe -v to find out.)
How exactly do you 'print to file' the PDFs? (Which OS platform, which application, which kind of settings?)
Also, ps2pdf may not be your best option for the batch process. It's a small shell/batch script anyway, which internally calls a Ghostscript command.
Using Ghostscript directly will give you much more control over the result (though its commandline 'usability' is rather inconvenient and awkward -- that's why tools like ps2pdf are so popular...).
Lastly, without direct access to one of your PS input samples for testing (as well as the PDF generated by the proprietary converter) it will not be easy to come up with good suggestions.

Related

Batch Convert Sony raw ".ARW" image files to .jpg with raw image settings on the command line

I am looking to convert 15 Million 12.8 mb Sony .ARW files to .jpg
I have figured out how to do it using sips on the Command line BUT what I need is to make adjustments to the raw image settings: Contrast, Highlights, Blacks, Saturation, Vibrance, and most importantly Dehaze. I would be applying the same settings to every single photo.
It seems like ImageMagick should work if I can make adjustments for how to incorporate Dehaze but I can't seem to get ImageMagick to work.
I have done benchmark testing comparing Lightroom Classic / Photoshop / Bridge / RAW Power / and a few other programs. Raw Power is fastest by far (on a M1 Mac Mini 16GB Ram) but Raw Power doesn't allow me to process multiple folders at once.
I do a lot of scripting / actions with photoshop - but in this case photoshop is by far the slowest option. I believe this is because it opens each photo.
That's 200TB of input images, without even allowing any storage space for output images. It's also 173 solid days of 24 hr/day processing, assuming you can do 1 image per second - which I doubt.
You may want to speak to Fred Weinhaus #fmw42 about his Retinex script (search for "hazy" on that page), which does a rather wonderful job of haze removal. Your project sounds distinctly commercial.
© Fred Weinhaus - Fred's ImageMagick scripts
If/when you get a script that does what you want, I would suggest using GNU Parallel to get decent performance. I would also think you may want to consider porting, or having ported, Fred's algorithm to C++ or Python to run with OpenCV rather than ImageMagick.
So, say you have a 24-core MacPro, and a bash script called ProcessOne that takes the name of a Sony ARW image as parameter, you could run:
find . -iname \*.arw -print0 | parallel --progress -0 ProcessOne {}
and that will recurse in the current directory finding all Sony ARW files and passing them into GNU Parallel, which will then keep all 24-cores busy until the whole lot are done. You can specify fewer, or more jobs in parallel with, say, parallel -j 8 ...
Note 1: You could also list the names of additional servers in your network and it will spread the load across them too. GNU Parallel is capable of transferring the images to remote servers along with the jobs, but I'd have to question whether it makes sense to do that for this task - you'd probably want to put a subset of the images on each server with its own local disk I/O and run the servers independently yourself rather than distributing from a single point globally.
Note 2: You will want your disks well configured to handle multiple, parallel I/O streams.
Note 3: If you do write a script to process an image, write it so that it accepts multiple filenames as parameters, then you can run parallel -X and it will pass as many filenames as your sysctl parameter kern.argmax allows. That way you won't need a whole bash or OpenCV C/C++ process per image.

Why using unix-compress and go compress/lzw produce different files, not readable by the other decoder?

I compressed a file in a terminal with compress file.txt and got (as expected) file.txt.Z
When I pass that file to ioutil.ReadFile in Go,
buf0, err := ioutil.ReadFile("file.txt.Z")
I get the error (the line above is 116):
finder_test.go:116: lzw: invalid code
I found that Go would accept the file if I compress it using the compress/lzw package, I just used code from a website that does that. I only modified the line
outputFile, err := os.Create("file.txt.lzw")
I changed the .lzw to .Z. then used the resulting file.txt.Z in the Go code at the top, and it worked fine, no error.
Note: file.txt is 16.0 kB, unix-compressed file.txt.Z is 7.8 kB, and go-compressed file.txt.Z is 8.2 kB
Now, I was trying to understand why this happened. So, I tried to run
uncompress.real file.txt.Z
and it did not work. I got
file.txt.Z: not in compressed format
I need to use a compressor (preferably unix-compress) to compress files using lzw-compression then use the same compressed files on two different algorithms, one written in C and the other in Go, because I intend to compare the performance of the two algorithms. The C program will only accept the files compressed with unix-compress and the Go program will only accept the files compressed with Go's compress/lzw.
Can someone explain why that happened? Why are the two .Z files not equivalent? How can I overcome this?
Note: I am working on Ubuntu installed in VirtualBox on a Mac.
A .Z file does not only contain LZW compressed data, there is also a 3-bytes header that the Go LZW code does not generate because it is meant to compress data, not generate a Z file.
Presumably you only want to test the performance of two of your/some third party algorithms (& not the compression algorithms themselves), you may want to write a shell script which calls the compress command passing the files/dir's required and then call this script from your C / GO program. This is one way you can overcome this, but leaves open other parts of your queries on the correct way to use the compression libraries.
There is an ancient bug named "alignment bit groups" behind this question. I've described it in wikipedia "Special output format". Please read.
I've implemented a new library lzws. It has all possible options:
--without-magic-header (-w) - disable magic header
--max-code-bit-length (-b) - set max code bit length (9-16)
--raw (-r) - disable block mode
--msb (-m) - enable most significant bit
--unaligned-bit-groups (-u) - enable unaligned bit groups
You can use any options in all possible combinations. All combinations has been tested. I am sure that you can find combinations suitable for go lzw implementation.
You can use ruby-lzws binding if you like to use ruby.

How can an executable be this small in file size?

I've been generating payloads on Metasploit and I've been experimenting with the different templates and one of the templates you can have your payload as is exe-small. The type of payload I've been generating is a windows/meterpreter/reverse_tcp and just using the normal exe template it has a file size around 72 KB however exe-small outputs a payload the size of 2.4kb. Why is this? And how could I apply this to my programming?
The smallest possible PE file is just 97 bytes - and it does nothing (just return).
The smallest runnable executable today is 133 bytes, because Windows requires kernel32 being loaded. Executing a PE file with no imports is not possible.
At that size it can already download payload from the Internet by specifying an UNC path in the import table.
To achieve such a small executable, you have to
implement in assembler, mainly to get rid of the C runtime
decrease the file alignment which is 1024 by default
remove the DOS stub that prints the message "This program cannot be run in DOS mode"
Merge some of the PE parts into the MZ header
Remove the data directory
The full description is available in a larger research blog post called TinyPE.
For EXE's this small, the most space typically is used for the icon. Typically the icon has various sizes and color schemes contained, which you could get rid of, if you do not care having an "old, rusty" icon, or no icon at all.
There is also some 4k of space used, when you sign the EXE.
As an example for a small EXE, see never10 by grc. There is a details page which highlights the above points:
https://www.grc.com/never10/details.htm
in the last paragraph:
A final note: I'm a bit annoyed that “Never10” is as large as it is at
85 kbyte. The digital signature increases the application's size by
4k, but the high-resolution and high-color icons Microsoft now
requires takes up 56k! So without all that annoying overhead, the app
would be a respectable 25k. And, yes, of course I wrote it in
assembly language.
Disclaimer: I am not affiliated with grc in any way.
The is little need for an executable to be big, except when it contains what I call code spam, code not actually critical to the functionality of the program/exe. This is valid for other files too. Look at a manually written HTML page compared to one written in FrontPage. That's spamcode.
I remember my good old DOS files that were all KB in size and were performing practically any needed task in the OS. One of my .exes (actually .com) was only 20 bytes in size.
Just think of it this way: just as in some situations a large majority of the files contained in a Windows OS can be removed and still the OS can function perfectly, it's the same with the .exe files: large parts of the code is either useless, or has different than relevant-to-objective purpose or are intentionally added (see below).
The peak of this aberration is the code added nowdays in the .exe files of some games that use advanced copy protection, which can make the files as large as dozens of MB. The actually code needed to run the game is practically under 10% of the full code.
A file size of 72 KB as in your example can be pretty sufficient to do practically anything to a windows OS.
To apply this to your programming, as in make very small .exes, keep things simple. Don't add unnecessary code just for the looks of it or by thinking you will use that part of the program/code at a point.

Slow ghostscript conversion and slow printing with large ps files

I have a service which produces pdf files. I have PS-compatible printers. For printing the pdf files, I use ghostscript to convert them to ps an copy them to a shared (windows) print queue. Most of the pdf-files contain just a few pages (<10) and don't cause any trouble.
From time to time I have to print large files (100+, 500+, 5000+) pages and there I observe the following:
converting to ps is fast for the first couple of pages, then slows down. The further the progress, the longer the time for a single page.
after conversion, copying to the print queue works without problems
when copying is finished and it comes to sending the document to the printer, I observe more or less the same phenomenon: the further the progress, the slower the transfer.
Here is how I convert pdf to ps:
"C:\Program Files\gs\gs9.07\bin\gswin64c.exe" \
-dSAFER -dNOPAUSE -DBATCH \
-sOutputFile=D:\temp\testGS\test.ps \
-sDEVICE=ps2write \
D:\temp\testGS\test.pdf
After this conversion I simply copy it to the print queue
copy /B test.ps \\printserever\myPSQueue
What possibilities do I have to print large files this way?
My first idea was to do the following:
"C:\Program Files\gs\gs9.07\bin\gswin64c.exe" \
-dSAFER -dNOPAUSE -DBATCH \
-sOutputFile=D:\temp\testGS\test%05d.ps \
-sDEVICE=ps2write \
D:\temp\testGS\test.pdf
Working with single pages speeds up the conversion, it doesn't slow down after every single page, and also printing is fast, when I copy every single page as own ps file to the printer. But there is one problem I will encounter sooner or later: when I copy the single ps files, they will be single print jobs. Even when they are sorted in the correct order, if someone else starts a print job on the same printer in between, the printings will all get mixed up.
The other idea was using gsPrint, which works considerable fast, but with gsPrint I need the printer to be installed locally, which is not manageable in my environment with 300+ printers at different locations.
Can anyone tell me exactly, what happens? Is this a bad way to print? Does any have a suggestion how to solve the task of printing such documents in such an environment?
Without seeing an example PDF file its difficult to say much about why it should print slowly. However the most likely explanation is that the PDF is being rendered to an image, probably because it contains transparency.
This will result in a large image, created at the default resolution of the device (720 dpi), which is almost certainly higher than required for your printer(s). This means that a latge amount of time is spent transmitting extra data to the printer, which the PostScript interpreter in the printer then has to discard.
Using gsprint renders the file to the resolution of the device, assuming this is less than 720 dpi the resulting PostScript will be smaller therefore requiring less time to transmit, less time to decompress on the printer and less time spent throwing away extra data.
One reason the speed decreases is because of the way ps2write works, it maintains much of the final content in temporary files, and stitches the main file back together from those files. It also maintains a cross reference table which grows as the number of objects int eh file does. Unless you need the files to be continuous you could create a number of print files by using the -dFirstPage and -dLastPage options so that only a subset of the final printout is created, this might improve the performance.
Note that ps2write does not render the incoming file to an image, while gsprint definitely does, the PostScript emerging from gsprint will simply define a big bitmap. This doesn't mantain colours (everything goes to RGB) and doesn't maintain vector objects as vectors, so it doesn't scale well. However.... If you want to use gsprint to print to a remote printer, you can set up a 'virtual printer' using RedMon. You can have RedMon send the output from a port to a totally different printer, even a remote one. So you use gsprint to print to (eg) 'local instance of MyPrinter' on RedMon1: and have the RedMon port set up to capture the print stream to disk and then send the PostScript file to 'MyPrinter on another PC'. Though I'd guess thats probably not going to be any faster.
My suggestion would be to set the resolution of ps2write lower; -r300 should be enough for any printer, and lower may be possible. The resolution will only affect rendered output, everything else remains as vectors and so scales nicely. Rendered images will print perfectly well at half the resolution of the printer, in general.
I can't say why the printer becomes so slow with the Ghostscript generated PostScript, but you might want to give other converters a try, like pdftops from the Poppler utils (I found a Windows download here as you seem to be using Windows).

How to convert multiple, different-sized PostScript files to a single PDF?

I'm using a command similar to this:
gswin32c.exe -dNOPAUSE -dBATCH -q -dSAFER -sDEVICE=pdfwrite -dPDFSETTINGS=/prepress -sOutputFile="path/output.pdf" <PSfiles>
This gives me a single pdf document with each PS document represented as a page. However, the page sizes do not translate well. The original PS files are all different sizes and each resulting pdf page is cutoff to the same size, which looks like landscape A4.
When I convert a single PS file with the exact same command, the page size is preserved. So it seems like since all the PS files are being sent to the same pdf, they must all have the same page size and I lose content. Is there anyway to preserve the document sizes while still using a single command?
Update: I was originally using GS 8.63, but I downloaded 9.06 and have the same issue.
Additionally, I've narrowed the problem down. It seems like there is one specific PS file (call it problemFile.ps) that causes the problem, as I can run the command successfully as long as I disclude problemFile.ps. And it only causes a problem if it is the last file included on the command line. I can't post the entire file, but are there any potential problem areas I should look at?
Update2: Okay I was wrong in saying there is one specifc problem file. It appears that the page size of the last file included on the command line sets the maximum page size for all the resultant pages.
As long as each PostScript file (or indeed each page) actually requests a different media size then the resulting PDF file will honour the requests. I know this at least used to work, I've tested it.
However there are some things in your command line which you might want to reconsider:
1) When investigating problems with GS, don't use -q, this will prevent Ghostscript telling you potentially useful things.
2) DON'T use -dPDFSETTINGS unless you have read the relevant documentation and understand the implications of each parameter setting.
3) You may want to turn off AutoRotatePages, or at least set it to /PageByPage
My guess is that your PostScript files don't request a media size and therefore use the default media. Of course I can't tell without seeing an example.
NB you also don't say what version of Ghostscript you are using.

Resources