Why does combining PDF pages with CGContextDrawPDFPage create very large output files? - cocoa

I ran into this trying to throw together a simple Automator script to combine several one-page PDF files. I had 88 files to combine, each just about exactly 300KB, so I expected the final product to be about 30MB; the resulting PDF file, using the Combine PDFs Automator action, was 300+MB.
Poking around, the Automator action uses a Python script, with Foundation bindings, to create the new PDF document with the CoreGraphics PDF APIs. Nothing seems out of place. Basically, it's doing this (simplified, but these are the high points):
writeContext = CGPDFContextCreateWithURL(outURL, None, None)
for url in inURLs:
doc = CGPDFDocumentCreateWithURL(url)
page = CGPDFDocumentGetPage(doc, 1)
mediaBox = CGPDFPageGetBoxRect(page, kCGPDFMediaBox)
CGContextBeginPage(writeContext, mediaBox)
CGContextDrawPDFPage(writeContext, page)
CGContextEndPage(writeContext)
CGPDFContextClose(writeContext)
I can't imagine that CGContextDrawPDFPage, when drawing to a PDF context, would do anything but copy the PDF data for that page (with some window-dressing).
Even when "combining" just one PDF, the output is 2.8MB, compared to the 300KB original one-page PDF.
The resulting PDFs look exactly the same page-by-page as the original pages: text is selectable in the same places, graphics look identical, the pages are exactly the same size.
Any ideas?

Do the input PDFs contain the same set of fonts, or different sets? Maybe if the originals don't contain embedded fonts, but the output does, that could account for some of the growth.

Related

Why are images in pdf sometimes sliced into multiple images?

Noticed that images sometimes are sliced up in PDFs.
Steps:
insert an image with a high resoultion (3000x1800) into a .docx
use "Microsoft Print to PDF" option of Word to convert to PDF
extracting all images with pdfimages or pymupdf
Result:
Image is sliced horizontally into three images
Questions:
What exactly happens in the in the transition from .docx to pdf (or in generell in the process to pdf) that makes the converter slice it up into three images instead of one?
Do the individuell XObjects of the sliced images contain information which says that these three images belong to originally one?
How do I know how the images are sliced (horizontally / vertically) and what if originally there were two images inserted into the .docx file and both of them are sliced. Can you tell if slice x belongs to original image y or z?
So, as you have found out: because the code which generates the PDF choose to do so.
The technical reasons may be various - it could be that historically there were printers which would only have so much memory, and would need to get limiterd size-images when printing, and someone at some point when writing the PDF export code present in Microsoft Office choose to apply this limit.
Anyway, technically, as put in the comments, an image in a PDF file could be composed of unlimited smaller images collated together.
Now, the second part, and your actual question: to know whether images ibn a PDF file belong together in a single original image one would need a custom extractor tool to check the geometry of all images in the document and find out which images have no margins or boundaries with others - it would not be that hard to do for well behaved files (which we can't know if MS Office generated files are: there are ways to obfuscate image positioning by making it indirectly). The metadata in the image-parts may or may not contain information that would allow one to recompose the original image: it would be up to the code generating the PDF to include this metadata or not - but the geometry can't lie in this case: if the final document presents a single image visually, it is possible to detect that when fetching the images.

AppleScript: renaming PDF with content of PDF

I am trying to do exactly what is described in the following thread:
AppleScript/Automator: renaming PDF with extracted text content of this PDF
So I am using the Chino22's version and there are two issues with it:
First, instead of the contents of the pdf, theFileContentsText gets some metadata stuff.
Second, althought the script runs to the end, I get the following error for the last step:
error "The variable thisFile is not defined." number -2753 from "thisFile"
So, how do I get the text contents instead, and how do I define thisFile to the current pdf that is being processed in the loop?
Thanks in advance!
I would not expect the linked script to work.
Except for document metadata, extracting text content from PDF is notoriously difficult and unreliable, and not a road you want to go down if you can possibly avoid it. Adobe’s PDF file format is designed for printing, not for data processing. PDF files contain blocks of Postscript-like page drawing instructions, typically compressed, and while it’s possible for PDFs also to include the original plain text for accessibility use, most PDF generators do not do this so the only way to get the original text is by reconstructing it from those low-level drawing instructions—not a trivial job.
AppleScript’s read command only reads that raw file data; it does not parse it into drawing instructions, never mind translating those drawing instructions back into plain text. Change a PDF file’s extension to .txt and open it in a plain text editor, and you’ll see what I mean. Nasty.
If you need to work with the PDF’s original content (text, images, whatever), your best solution is to get those files before they were converted into a PDF.
If you must extract content from a PDF file, use an existing tool that knows how to do it.
For instance, if you’re lucky enough to have PDFs that contain XFDF (XML form) or accessibility data, there are 3rd-party apps and libraries to extract that content in readable form. I can’t think offhand of any that are AppleScriptable (Adobe Acrobat has only minimal AS support) so you’ll probably need to find one you can run from command line (do shell script in AS).
Or, if the PDFs have a consistent visual structure, a 3rd-party library such as Python’s PDFMiner (which I’ve used in the past) can identify blocks of characters by position and convert those back into strings with varying degrees of reliability (it has to convert font glyphs back into Unicode characters, guess at which characters are close enough to constitute a word, and where to insert space and return characters between those words). You’ll have to write some Python code to extract the bits you want, so look for tutorials to get started (or pay someone to write it for you).
But again, if you can possibly avoid having to extract text from PDF, you should. You will save yourself a lot of trouble.

What is the best way to convert PDF pairs into single pages?

I need to take an existing PDF (created with Prawn), and combine pairs after page 1 (the cover) into single pages. I would also like to add a vertical line in the center of the joined pages. The pages are to be printed in books, and the goal is to make single PDF pages that are similar to the side by side view in Acrobat. I know I can convert them to images, do what I need to with ImageMagick, then put them back into a PDF format, but I am trying to minimize the number of conversions so I can save as much quality as possible.
I also realize I can do this from the start with Prawn, but I am trying to avoid that as it would require a very large change to our application.
It is possible to do this with Ghostscript and the pdfwrite device, but its by no means simple. You need to write some PostScript to do the job.
You would need to add BeginPage and EndPage procedures, the BeginPage would need to check the current page number (and you would need to track this yourself). If its page 1, process normally. If its an even page, throw away the current PageSize and replace it with one which covers a pair of pages. Process the even page. Do not transmit the content.
If the page is odd (and not 1) then translate the origin so that its offset to the right by the width of the page. Process the odd page. use moveto, lineto and stroke to draw the required line between the two pages. Transmit the page.
This assumes that all the pages are the same size and orientation, or least that the sizes of each page are known in advance. It would be possible to retrieve those programmatically as well, but more complex.
Its definitely non-trivial, but if you rummage through my answers in the PostScript tags and look for anything with the word 'imposition' you'll probably find program outlines to do the job.
I did a quick look and here's an answer I wrote some time back. It uses a different approach to that outlined above, it copies some of the guts of the PDF interpreter and repurposes them. It does a chunk of what you want though.

How to detect multiple barcodes/QR codes in a TIFF image and return their value + position?

I'm currently trying to achieve this:
I have a very large TIFF image, which contains scanned documents. The image contains invoices with barcodes/QR codes, followed by multiple other scanned documents related to the invoice which preceded them. This can be repeated multiple times ( the TIFF image may look like [invoice] + [documents] + [invoice] + [documents] ... )
I need a program (doesn't really matter in which language but I'd prefer either Java, JavaScript, PHP, C++ or Python) that takes said TIFF image, scans all the barcodes and returns their values and their position in the image (either which page it is on or it's absolute position, but the page is preferable, I know for certain that there won't be multiple barcodes on one page). The goal is to split this TIFF image into multiple PDF files, each containing only one invoice and all of the documents that belong to the invoice.
I have the latter part done already. I intend to use ImageMagick to split the TIFF file into multiple files (tested, works). I have also tried multiple barcode scanning methods, but met critical problems at every one. And that's the point of my question:
Is any of my presumptions false? Is there a better way/library/SW that you know about that could work?
Libraries/SW I tried so far:
ZXing port for PHP: Can't work with TIFF files
ZXing github
Quagga for JavaScript: Can't work with TIFF files either.
Quagga github
ZBar code reader: The best looking one by far. I managed to scan multiple QR codes in one TIFF image using CMD (Windows), but didn't find a way to get their positions. Also found out that C++ and Python versions exist, but didn't get to try them out just yet.
Thanks for any ideas/corrections.
The best one I heard -that is subjective ofc- is Barcode Rendering Framework
I'm not sure if it can detect multiple barcodes on a page but it can detect many different types of barcodes.
And it's also Open Source..

Extract Images and Words with coordinates and sizes from PDF

I've read much about PDF extractions and libraries (as iText) but i just haven't found a solution to extract images and text (with coordinates) from a PDF.
The task is to scan PDF with catalog of products and extract each image. There is an image code printed next to each image and also a list of product codes for products that are shown on the image.
I know that there is no way to extract structured info from a PDF like this but with coordinates of all image and text objects I could write code to identify linked text by its distance from the image. Then I could split text using a RegExp and find out what is a product code, what is an image code etc.
Could you recommend a good and working solution for the task?
Use XPDF (http://www.foolabs.com/xpdf/)
It can extract all the characters in the PDF with co-ordinates (pdftotext -bbox [sourcefile] [outputfile]) and also all the images and SVGs in the PDF.
It's open source (GPLv2) and supports a lot of additional extraction functionalities as well.
Several Java libraries can do this. Have you looked at JPedal or PdfBox?
If a commercial library is an option for you, you could try Amyuni PDF Creator .Net or Amyuni PDF Creator ActiveX. You could use the method IacDocument.GetObjectsInRectangle to retrieve all the "graphic objects" of your interest, then use the ObjectType attribute to separate images from text. The library already provides an algorithm for putting close text together. From the documentation:
IacDocument.GetObjectsInRectangle Method
The GetObjectsInRectangle method gets all the objects that are in the specified rectangle.
Usual disclaimer applies.

Resources