I tried to convert different PDF files (which contain only text) to PS. After which some of them are small and some are huge.
For example:
169 kb PDF => 409 kb PS
82 kb PDF => 5749 kb PS
I tried these ghostscript options:
-dASCII85EncodePages=false
-r300
-dBATCH
-dPDFFitPage
-dFIXEDMEDIA
-dNOTRANSPARENCY
-dNOINTERPOLATE
-dDEVICEWIDTHPOINTS=595
-dDEVICEHEIGHTPOINTS=841
-dcupsBitsPerColor=8
-dcupsColorOrder=0
-dcupsColorSpace=1
-dcupsCompression=1
-scupsPageSizeName=A4
-sDEVICE=ps2write
Probably a problem with the original PDF, but what exactly I can not understand. I can send the source files, just where?
The problem here is due to a limitation of the ps2write device, and a quirk of the way the files are created.
Both PDF files use CIDFonts, the ps2write device only implements basic level 2 PostScript output, and CIDFonts were not included in the original level 2 specification (they were added in a supplement).
This means the ps2write device cannot currently output the CIDFonts defined in the PDF file, it must convert them to 'something else'. What it does is render the glyph shapes to bitmaps, and construct type 3 bitmap fonts, which it then uses in the body of the PostScript program. This is, of course, less than ideal, as bitmap fonts are poorer quality than vector descriptions of the glyph shapes. Since you've fixed the rendering at 300 dpi, that reduces the quality of the glyph bitmaps, but they should be acceptable for printing on a desktop printer.
So, given that both files contain CIDFonts, why is one much larger ? This is the 'quirk' of the way the files are created. The file test1.pdf has the fonts it uses embedded in the PDF file, and they are embedded as subsets, that is each font only contains the shapes descriptions of the glyphs it uses. The other file uses Arial and Arial-Bold (IIRC) but does not embed either CIDFont.
Fonts can only address up to 255 glyphs, while CIDFonts have a more or less unlimited range. When ps2write creates type 3 fonts to represent the embedded CIDFonts, it can't create a Font which has the full addressable range that a CIDFont would, so if more than 255 glyphs are used from a given font, it has to create multiple type 3 fonts.
So we fill up the font as we go, until we reach 255 glyphs, then we start a new font. The problem is that if we then encounter a glyph that was used earlier, and therefore defined in a previous font, we can't simply switch fonts (this is a limitation of the way ps2write works). So we have to include that glyph in the font we are currently creating as well. This means we end up with two copies of the bitmap in the output; one for the first font that uses it and one for the second font.
And that's exactly what is happening here. Test1.pdf uses embedded subsets, so we essentially never overflow the type 3 fonts, and the output PostScript program contains ~300 bitmaps for the glyphs. Test2.pdf overflows the 255 limit several times and so the output PostScript program contains ~2120 bitmaps for the glyphs.
There is nothing you can do about that, unless you can control the production of the input PDF files, so the question then becomes 'is this a problem for you, and if so, why ?'
[EDIT]
Well I doubt the size has any real impact on the printing time. PostScript is streamed to the printer so it starts processing the data as soon as it is sent.
Of course converting PDF to PostScript first is a time consuming step. Assuming your printer(s) cannot print PDF directly, the most obvious answer would be to print each page to a separate PostScript file using the -sOutputFile=out%d.ps syntax.
Overall the performance would be reduced, because multiple files need to be opened and closed, the prolog will be written to every file, and there can be no reuse of resources between pages, so shared resources will have to be written to each file. This means the total size of al the pages written as separate files will be considerably larger than the total written as one largen file.
However the advantage is that you do not have to process all the pages from the PDF file before you send the first page to the printer. As soon as page 1 is complete, the file out1.ps will be closed and ready to send to the printer, so you can start sending page 1 while Ghostscript continues to produce pages 2 onwards.
Related
Noticed that images sometimes are sliced up in PDFs.
Steps:
insert an image with a high resoultion (3000x1800) into a .docx
use "Microsoft Print to PDF" option of Word to convert to PDF
extracting all images with pdfimages or pymupdf
Result:
Image is sliced horizontally into three images
Questions:
What exactly happens in the in the transition from .docx to pdf (or in generell in the process to pdf) that makes the converter slice it up into three images instead of one?
Do the individuell XObjects of the sliced images contain information which says that these three images belong to originally one?
How do I know how the images are sliced (horizontally / vertically) and what if originally there were two images inserted into the .docx file and both of them are sliced. Can you tell if slice x belongs to original image y or z?
So, as you have found out: because the code which generates the PDF choose to do so.
The technical reasons may be various - it could be that historically there were printers which would only have so much memory, and would need to get limiterd size-images when printing, and someone at some point when writing the PDF export code present in Microsoft Office choose to apply this limit.
Anyway, technically, as put in the comments, an image in a PDF file could be composed of unlimited smaller images collated together.
Now, the second part, and your actual question: to know whether images ibn a PDF file belong together in a single original image one would need a custom extractor tool to check the geometry of all images in the document and find out which images have no margins or boundaries with others - it would not be that hard to do for well behaved files (which we can't know if MS Office generated files are: there are ways to obfuscate image positioning by making it indirectly). The metadata in the image-parts may or may not contain information that would allow one to recompose the original image: it would be up to the code generating the PDF to include this metadata or not - but the geometry can't lie in this case: if the final document presents a single image visually, it is possible to detect that when fetching the images.
I am trying to do exactly what is described in the following thread:
AppleScript/Automator: renaming PDF with extracted text content of this PDF
So I am using the Chino22's version and there are two issues with it:
First, instead of the contents of the pdf, theFileContentsText gets some metadata stuff.
Second, althought the script runs to the end, I get the following error for the last step:
error "The variable thisFile is not defined." number -2753 from "thisFile"
So, how do I get the text contents instead, and how do I define thisFile to the current pdf that is being processed in the loop?
Thanks in advance!
I would not expect the linked script to work.
Except for document metadata, extracting text content from PDF is notoriously difficult and unreliable, and not a road you want to go down if you can possibly avoid it. Adobe’s PDF file format is designed for printing, not for data processing. PDF files contain blocks of Postscript-like page drawing instructions, typically compressed, and while it’s possible for PDFs also to include the original plain text for accessibility use, most PDF generators do not do this so the only way to get the original text is by reconstructing it from those low-level drawing instructions—not a trivial job.
AppleScript’s read command only reads that raw file data; it does not parse it into drawing instructions, never mind translating those drawing instructions back into plain text. Change a PDF file’s extension to .txt and open it in a plain text editor, and you’ll see what I mean. Nasty.
If you need to work with the PDF’s original content (text, images, whatever), your best solution is to get those files before they were converted into a PDF.
If you must extract content from a PDF file, use an existing tool that knows how to do it.
For instance, if you’re lucky enough to have PDFs that contain XFDF (XML form) or accessibility data, there are 3rd-party apps and libraries to extract that content in readable form. I can’t think offhand of any that are AppleScriptable (Adobe Acrobat has only minimal AS support) so you’ll probably need to find one you can run from command line (do shell script in AS).
Or, if the PDFs have a consistent visual structure, a 3rd-party library such as Python’s PDFMiner (which I’ve used in the past) can identify blocks of characters by position and convert those back into strings with varying degrees of reliability (it has to convert font glyphs back into Unicode characters, guess at which characters are close enough to constitute a word, and where to insert space and return characters between those words). You’ll have to write some Python code to extract the bits you want, so look for tutorials to get started (or pay someone to write it for you).
But again, if you can possibly avoid having to extract text from PDF, you should. You will save yourself a lot of trouble.
The Problem
I am loading the classic serife.fon file from Microsoft Windows using FreeType.
Here is how I set the size:
FT_Set_Pixel_Sizes(face, 0, fontHeight);
I use 0 for the fontWidth so that it will be auto-calculated based on the height.
How do I find the correct value for fontHeight such that the resulting font will be exactly 9 pixels tall?
Notes
Using trial and error, I know that the correct value is 32 - but I don't understand why.
I am not sure how relevant this is for bitmap fonts, but according to the docs:
pixel_size = point_size * resolution / 72
Substituting in the values:
point_size = 32
resolution = 96 (from FT_Get_WinFNT_Header)
gives:
pixel_size = 42.6666666
This is a long way from our target height of 9!
The docs do go on to say:
pixel_size computed in the above formula does not directly relate to the size of characters on the screen. It simply is the size of the EM square if it was to be displayed. Each font designer is free to place its glyphs as it pleases him within the square.
But again, I am not sure if this is relevant for bitmap fonts.
fon files are exe files with a fnt payload, where the fnt payload can be a vector or raster font. If this is a raster font (which is most likely) then the dfPixHeight value in the fnt header will tell you what size it's meant to be, which is exposed by FreeType2 as the pixel_height field of the FT_WinFNT_Header.
(And of course, note that using any size other than "the actual raster-size of the FNT" is going to lead to hilarious headaches because bitmap scaling is the kind of madness that's so bad, OpenType instead went with "just embed as many bitmaps as you need, at however many sizes you need, because that's the only way your bitmaps are going to look good")
The FNT-specific FT2 documentation can be found over on https://www.freetype.org/freetype2/docs/reference/ft2-winfnt_fonts.html but you may need to read it in conjunction with https://jeffpar.github.io/kbarchive/kb/065/Q65123 (or https://web.archive.org/web/20120215123301/http://support.microsoft.com/kb/65123) to find any further mappings that you might need between names/fields as defined in the FNT spec and FT2's naming conventions.
Is there an option for to me to ask Ghostscript to indent the Postscript it creates?
Everything starts at the beginning of a line and I find it difficult to follow.
Alternatively, I am using Emacs and ps-mode.
If anyone know how to indent code in this mode I would appreciate a tip (apologize because this may not be relevant to this StackExchange)
No, there is no option for indenting the output.
PostScript is pretty much regarded as a write-only language anyway, and the output of ps2write (which is what I assume you are using though you don't say) is particularly difficult since it fundamentally outputs PDF syntax with a PostScript program on the front to parse it into PostScript operations.
Why do you want to read it ?
[EDIT]
You can always edit your question, you don't need to post a new answer.
I'm afraid what you want to do isn't as simple as you might think.
It might be possible for this use case if the PDF files you receive are always created the same way, but there are significant problems.
The font you use as a substitute for the missing font must be encoded the same way. Say for example the font in the PDF file is encoded so that 0x41 is 'A', you need to make sure that the replacement font is also encoded so that 0x41 is an 'A'. So just the findfont, scalefont, setfont sequence is not always going to be sufficient, sometimes you will need to re-encode the font.
CIDFonts will be a major stumbling block. Firstly because ps2write simply doesn't emit CIDFonts at all. These were not part of level 2 PostScript. As a result all text in a CIDFont will be embedded as bitmaps. If your original file doesn't contain the CIDFont then you'll get the fallback CIDFont bitmapped.
Secondly CIDFonts can use multiple-byte character codes, of variable length. You can't simply replace a CIDFont with a Font, it just won't work.
The best solution, obviously, is to have the PDF files created with the fonts required embedded. This is best practice. If you can't get that, then I'd suggest that rather than trying to hand edit PostScript, you use the fontmap.GS and cidfmap files which Ghostscript uses to find font.
Ghostscript already has a load of code to do font substitution automatically, using both Fonts and CIDFonts as substitutes, and it does all the hard work of re-encoding the fonts or building CMaps as required. If you are on Windows much of this may already be done for you, when you install Ghostscript it will ask if you want to create font mappings. If you said yes then it will
Add the font substitutions you want to use in those files (they have comments explaining the layout) and then use the pdfwrite device to make a new PDF file. Set EmbedAllFonts to true (you may need to add a AlwayEmbed font array as well, listing the fonts specifically) and SubsetFonts to false.
That should create a new PDF file where the missing fonts have been replaced by your defined substitutes, those substitutes will have been embedded in the new PDF file and they have will not been subset (Acrobat will generally refuse to edit text in a subset font).
The switches I mentioned above are standard Adobe Distiller parameters, but they are documented for pdfwrite here. There's some documentation on adding fonts here and here and specifically for CIDFonts here.
Basically I'd suggest you define your substitutions and let Ghostscript do the work for you.
This is not an answer to the problem but rather an answer to KenS's question about "Why do you want to read it?"
I tried to put it in the comment box but it was too long.
I am a retired engineer with a strong programming background.
I would like to read and understand the postscript code for the reason shown below.
I play duplicate bridge as a hobby. I recieve a PDF file of what is know as a convention card (a single page document of bridge agreements).
Frequently I would like to edit these files.
When I open with Adobe Illustrator I have to spend a significant amount of time replacing fonts that are not on my system with fonts that I do have.
I can take the PDF and export it as a postscript file using Ghostscript.
I was going to write a little program to replace the embedded fonts with the fonts that I use to replace them.
I was going to leave the postscript file unaltered and insert things like
/HelveticaMonospacedPro-RG findfont
12 scalefont setfont
just above where the text is written.
I was planning on using the fonts that I have on my system (e.g., HelveticaMonospacedPro-RG).
I have a PostScript file with image and text. I want to print it to a laser printer so that the images print in a halftone round pattern.
I am trying to print it from the command line. I would prefer a PDF output first and then I'll print to the laser printer.
I have an HP Laserjet P2015 printer installed in Windows 10.
gswin64c.exe -dQUIET -dBATCH -dNOPAUSE -sDEVICE=pdfwrite -sOutputFile=output.pdf test.ps -c "<< /HalftoneType 1 /Frequency 37 /Angle 45 /SpotFunction {180 mul cos exch 180 mul cos add 2 div} >> sethalftone"
The PDF file is generated. However, the images does not appear to be in Round Halftone format. I see no change in the image.
This is the original Image
I want the printout to look like this: Required output
For some reason the output is the same as the original.
pdfwrite doesn't create monochrome output, so no, the output won't be halftoned. The whole point of the pdfwrite device is to maintain the output as close in quality as possible to the input.
There is no way to get monochrome output from pdfwrite unless the monochrome is in the input.
If you instead use a monochrome output device (e.g., tiffg3 or tiffg4) you should see a difference, unless the input file (which you haven't supplied) also sets a screen, in which case the last one encountered will take effect. That is, if the PostScript input file specifies a halftone screen, that's the one that will get used.
What you are asking for is what I might generally describe as a hard problem, probably much harder than you expect.
Your best bet is to send the PostScript directly to the printer, since it supports PostScript.
Problem 1; your PostScript file may already contain a halftone in it which overrides the one you want to use. You can usually prevent that by redefining the sethalftone and setscreen operators:
/sethalftone {pop} bind def
/setscreen {pop pop pop} bind def
You do that after you'[ve defined your own halftone, obviously. Or you can just edit the PostScript file and remove the halftone definition, its usually not that hard to find.
Problem 2 is that printer manufacturers sometimes 'tweak' the interpreter to always use their preferred halftone and prevent you overriding it in PostScript. The only way to find out if that's happening is to try a quick test; print a simple image with the default screen, then set a really coarse screen, say 10 lpi and print the same image. If there's no difference then you know that's not a viable option. If there is, then you can just set the screen you want.
Now, if the printer won't let you change the halftone then the only solution is to render the PostScript to an image using Ghostscript (which will let you change the halftone) and then send the resulting image to the printer. You will need to use a 1 bpp device, something like tiffg3 or tiffg4 should work fine.
The problem here is that you need to ensure that the image fits on the media of the printer, because if it doesn't then the likelihood is that the printer will slightly scale down the image so that it does fit, thereby squashing the halftone cells.
Even more subtly, the printable area of the media in the printer may not be the same as the size of the media. Paper handling can mean that there are some areas of the paper that can't be printed on, and the variability in the feeding of the paper can mean that the printer will refuse to print right to the edges anyway. Because if it did, and there was, say, a 1 mm deviation in the paper handling, then 1 mm of the edge would 'fall off' and there would be a 1 mm white gap at the other edge...