Extract Images and Words with coordinates and sizes from PDF - image

I've read much about PDF extractions and libraries (as iText) but i just haven't found a solution to extract images and text (with coordinates) from a PDF.
The task is to scan PDF with catalog of products and extract each image. There is an image code printed next to each image and also a list of product codes for products that are shown on the image.
I know that there is no way to extract structured info from a PDF like this but with coordinates of all image and text objects I could write code to identify linked text by its distance from the image. Then I could split text using a RegExp and find out what is a product code, what is an image code etc.
Could you recommend a good and working solution for the task?

Use XPDF (http://www.foolabs.com/xpdf/)
It can extract all the characters in the PDF with co-ordinates (pdftotext -bbox [sourcefile] [outputfile]) and also all the images and SVGs in the PDF.
It's open source (GPLv2) and supports a lot of additional extraction functionalities as well.

Several Java libraries can do this. Have you looked at JPedal or PdfBox?

If a commercial library is an option for you, you could try Amyuni PDF Creator .Net or Amyuni PDF Creator ActiveX. You could use the method IacDocument.GetObjectsInRectangle to retrieve all the "graphic objects" of your interest, then use the ObjectType attribute to separate images from text. The library already provides an algorithm for putting close text together. From the documentation:
IacDocument.GetObjectsInRectangle Method
The GetObjectsInRectangle method gets all the objects that are in the specified rectangle.
Usual disclaimer applies.

Related

Why are images in pdf sometimes sliced into multiple images?

Noticed that images sometimes are sliced up in PDFs.
Steps:
insert an image with a high resoultion (3000x1800) into a .docx
use "Microsoft Print to PDF" option of Word to convert to PDF
extracting all images with pdfimages or pymupdf
Result:
Image is sliced horizontally into three images
Questions:
What exactly happens in the in the transition from .docx to pdf (or in generell in the process to pdf) that makes the converter slice it up into three images instead of one?
Do the individuell XObjects of the sliced images contain information which says that these three images belong to originally one?
How do I know how the images are sliced (horizontally / vertically) and what if originally there were two images inserted into the .docx file and both of them are sliced. Can you tell if slice x belongs to original image y or z?
So, as you have found out: because the code which generates the PDF choose to do so.
The technical reasons may be various - it could be that historically there were printers which would only have so much memory, and would need to get limiterd size-images when printing, and someone at some point when writing the PDF export code present in Microsoft Office choose to apply this limit.
Anyway, technically, as put in the comments, an image in a PDF file could be composed of unlimited smaller images collated together.
Now, the second part, and your actual question: to know whether images ibn a PDF file belong together in a single original image one would need a custom extractor tool to check the geometry of all images in the document and find out which images have no margins or boundaries with others - it would not be that hard to do for well behaved files (which we can't know if MS Office generated files are: there are ways to obfuscate image positioning by making it indirectly). The metadata in the image-parts may or may not contain information that would allow one to recompose the original image: it would be up to the code generating the PDF to include this metadata or not - but the geometry can't lie in this case: if the final document presents a single image visually, it is possible to detect that when fetching the images.

Extract Images and their Labels from PDFs

I am facing a problem extracting images with their labels (not metadata!) from pdf files. By label I mean the text that is assigned to the image to describe it regeardless of it's possition, underneath or above. I've tried alot of known parsers, such as iText, Tika, PDFbox and pdf2html but i found no way of how to do this. Any suggestions?

How to detect multiple barcodes/QR codes in a TIFF image and return their value + position?

I'm currently trying to achieve this:
I have a very large TIFF image, which contains scanned documents. The image contains invoices with barcodes/QR codes, followed by multiple other scanned documents related to the invoice which preceded them. This can be repeated multiple times ( the TIFF image may look like [invoice] + [documents] + [invoice] + [documents] ... )
I need a program (doesn't really matter in which language but I'd prefer either Java, JavaScript, PHP, C++ or Python) that takes said TIFF image, scans all the barcodes and returns their values and their position in the image (either which page it is on or it's absolute position, but the page is preferable, I know for certain that there won't be multiple barcodes on one page). The goal is to split this TIFF image into multiple PDF files, each containing only one invoice and all of the documents that belong to the invoice.
I have the latter part done already. I intend to use ImageMagick to split the TIFF file into multiple files (tested, works). I have also tried multiple barcode scanning methods, but met critical problems at every one. And that's the point of my question:
Is any of my presumptions false? Is there a better way/library/SW that you know about that could work?
Libraries/SW I tried so far:
ZXing port for PHP: Can't work with TIFF files
ZXing github
Quagga for JavaScript: Can't work with TIFF files either.
Quagga github
ZBar code reader: The best looking one by far. I managed to scan multiple QR codes in one TIFF image using CMD (Windows), but didn't find a way to get their positions. Also found out that C++ and Python versions exist, but didn't get to try them out just yet.
Thanks for any ideas/corrections.
The best one I heard -that is subjective ofc- is Barcode Rendering Framework
I'm not sure if it can detect multiple barcodes on a page but it can detect many different types of barcodes.
And it's also Open Source..

AsciiDoc: How can I place graphical hints on an image

I am using AsciiDoc with Asciidoctor Gradle Plugin to generate technical documentation as PDF.
When I used M$ Word, I could easily place forms on an image, for example
colored rectangles,
boxes with numbers or
even links to sections within the document,
to better point out interesting areas within the image.
Example:
On the example image I have placed two rectangles and each one contains a link (starting with the word «Dialogbereich») leading to a other sections within the document.
Is it possible to achieve something like this (directly) in AsciiDoc?
Note that the answers to asciidoc: how to add callouts asciidoc to image do not apply here as the Asciidoctor PDF backend does not use DocBook to generate the PDF.
I know I could create a layered image in GIMP to at least place the rectangles. However, that wouldn't help me with the links.

Mosaicking PDF documents?

I have (or, rather, will soon have) a number of maps created in ArcGIS 10.0 and exported as PDF documents. The maps all show contiguous areas, being rather like the pages in a map book. There will also be a smaller-scale map depicting the entire area (let's call it the "study area"), but with less detail, rather like that page of a map atlas that shows what page depicts what area.
I wonder if there is any way to create thumbnails of the larger-scale maps and mosaic them such as to create an index map of the study area. A user would then be able to see, for a particular point on the smaller-scale map, which of the larger-scale maps depicts that part of the study area. (And perhaps see that map by clicking on the larger map?) Does anyone have any ideas I can implement this? I would prefer exporting the maps in PDF format, but, if I can't do all of the above with PDF, then any other format to which a map can be exported from ArcGIS, such as JPG or TIF, will work.
You should be able to create a PDF which does this.
What you need to do is render each page to a small image.
Then collect each of these images and add them as a mosaic to an index page.
Then put links from each small image back to the original PDF page.
If the hierarchy was more than one level deep you could repeat the process.
You need a PDF component to do this. What you want in terms of features is something which does decent PDF rendering. It's an easy thing to do badly and a difficult thing to do well.
ABCpdf .NET does good quality rendering so it's what I would suggest, but then I would because I work on it. :-)

Resources