extracting images from PDF with page and screen coordinate information - image

I want to extract images from PDFs retaining a knowledge of their content (page_number and coordinates on page). (Some tools (e.g. pdfminer) only emit image files with non-semantic names, e.g. Img0.bmp). I can do this with PDFBox (Java) but I'd ideally like a Python tool
My current (arbitrary) designs is to create filenames of the form:
image_<page>_<serial_in_page>_<x1>_<x2>__<y1>_<y2>.png
Currently pdfplumber exposes cooordinates but with a PDFStream and encoding information rather than an image. Code to convert the stream to a *.png would solve the problem.
(NOTE: the pdfplumber approach of rendering to the screen and capturing the known rectangle (which I use) is not a solution as the image is often degraded and frequently overwritten with text.)
(NOTE: I have had problems with several Python tools (pdfminer.six, PuMuPDF) extracting images as they make the background black which obscures black text, etc. PDFBox (Java) doesn't have this problem.)

Python tools are likely to have similar problems to any tools even those that require a single line to manipulate images or extract their details.
Here we can see a visual layout of all the compressed images in the file by using one command line to extract images. Here the individual object references have been converted into normal tiff or jpg (other tools may use pbm and pgm especially for OCR but the result is generally similar). The Greyscale Alpha softmask (B&W) transparency components are not necessarily tied direct to a page or an image other than by internal references, and usually appear like negatives.
What you may note is that the objects that were inserted most likely as one PNG are broken in two when injected into the PDF and their scaled placement is defined. Note that a raw PNG (whatever its source common resolution was) will retain number of dots but its scale when inserted into the PDF could be totally different horizontal and vertical, thus the only meaningful data is W x H in pixel values.
It is not trivial to overlay the mask on the RGB component when simply extracted but can allow for colour changes if desired.
So PDFbox is one of the simpler/better tools for blending to a suitable output, (as you have discovered) but for Python it is generally the top end library products that can identify the placement of the two images and combine into a suitable alpha output like a new PNG.
For many suggestions see Extract images from PDF without resampling, in python?.
Your related part question was knowing where those components are placed on each page since one image (and its alpha mask) could be placed multiple times such as a heading logo on each page. Again it is easy in a single command line to see which pages are referenced by a group of images, but to see which image is placed where requires analyzing each pages resources, again requiring a library interrogation of page contents, thus best done via power house libraries such as iText or any other like PDFtron for python.
For a related command in PyMuPDF see https://pymupdf.readthedocs.io/en/latest/page.html#Page.get_image_rects

I don't have a solution in Python but here is a small script using Ruby and HexaPDF:
require 'hexapdf'
class ImageBorderProcessor < HexaPDF::Content::Processor
def initialize(page, index)
super()
#page = page
#index = index
#count = 0
end
def paint_xobject(name)
super
xobject = resources.xobject(name)
return unless xobject[:Subtype] == :Image
w, h = xobject.width, xobject.height
llx, lly = graphics_state.ctm.evaluate(0, 0)
lrx, lry = graphics_state.ctm.evaluate(1, 0)
urx, ury = graphics_state.ctm.evaluate(1, 1)
ulx, uly = graphics_state.ctm.evaluate(0, 1)
# If the image is rotated, you will need all 4 coordinates, nut just the 2
filename = "image_#{#index}_#{#count}_#{llx}_#{urx}_#{lly}_#{ury}"
xobject.write(filename) rescue puts "Can write image #{#index}-#{#count}"
#count += 1
end
end
doc = HexaPDF::Document.open(ARGV[0])
doc.pages.each_with_index do |page, index|
processor = ImageBorderProcessor.new(page, index)
page.process_contents(processor)
end
It will iterate over all pages of the input document provided on the command line and create files using your file naming scheme. Since HexaPDF doesn't currently support writing all types of PDF images, you might get some error messages for those that can't be written.
If a supported image has an associated image mask defined, it will automatically be used to create a transparent image.
The script will output all images found, even repeated ones. This could easily be changed so that just a soft link is created for repeated images.

Related

How to set MRI orientation after niftiwrite (Matlab)?

I have combined two images on matlab (3D and binary). I imported both using niftiread and then after I combined both I write them using niftiwrite. However the orientation seems to be wrong for the newly created image. Has anyone encountered this beforehand?
I tried permute, rot, and flip but it did not seem to solve this problem.
The issue is that niftiread only loads the image itself and not the associated metadata that specifies the image orientation (among other things). If you then use niftiwrite without specifying this information, you get default header values.
If the original images on your disk are "3D.nii" and "mask.nii", you would want to do something like:
threeD_info = niftiinfo('3D.nii'); % load metadata for 3D image
threeD_data = niftiread(threeD_info); % load 3D image by specifying info
mask_data = niftiread('mask.nii'); % load mask image by specifying filename
output_data = threeD_data .* mask_data; % multiply images (or other operation of your choice)
niftiwrite(output_data,'3Dmasked.nii',threeD_info); % write output image to 3Dmasked.nii including metadata
Note: Depending on what type of "combination" you are performing, you might need to update some of the fields in threeD_info accordingly, such as the datatype.

Why are images in pdf sometimes sliced into multiple images?

Noticed that images sometimes are sliced up in PDFs.
Steps:
insert an image with a high resoultion (3000x1800) into a .docx
use "Microsoft Print to PDF" option of Word to convert to PDF
extracting all images with pdfimages or pymupdf
Result:
Image is sliced horizontally into three images
Questions:
What exactly happens in the in the transition from .docx to pdf (or in generell in the process to pdf) that makes the converter slice it up into three images instead of one?
Do the individuell XObjects of the sliced images contain information which says that these three images belong to originally one?
How do I know how the images are sliced (horizontally / vertically) and what if originally there were two images inserted into the .docx file and both of them are sliced. Can you tell if slice x belongs to original image y or z?
So, as you have found out: because the code which generates the PDF choose to do so.
The technical reasons may be various - it could be that historically there were printers which would only have so much memory, and would need to get limiterd size-images when printing, and someone at some point when writing the PDF export code present in Microsoft Office choose to apply this limit.
Anyway, technically, as put in the comments, an image in a PDF file could be composed of unlimited smaller images collated together.
Now, the second part, and your actual question: to know whether images ibn a PDF file belong together in a single original image one would need a custom extractor tool to check the geometry of all images in the document and find out which images have no margins or boundaries with others - it would not be that hard to do for well behaved files (which we can't know if MS Office generated files are: there are ways to obfuscate image positioning by making it indirectly). The metadata in the image-parts may or may not contain information that would allow one to recompose the original image: it would be up to the code generating the PDF to include this metadata or not - but the geometry can't lie in this case: if the final document presents a single image visually, it is possible to detect that when fetching the images.

How to detect multiple barcodes/QR codes in a TIFF image and return their value + position?

I'm currently trying to achieve this:
I have a very large TIFF image, which contains scanned documents. The image contains invoices with barcodes/QR codes, followed by multiple other scanned documents related to the invoice which preceded them. This can be repeated multiple times ( the TIFF image may look like [invoice] + [documents] + [invoice] + [documents] ... )
I need a program (doesn't really matter in which language but I'd prefer either Java, JavaScript, PHP, C++ or Python) that takes said TIFF image, scans all the barcodes and returns their values and their position in the image (either which page it is on or it's absolute position, but the page is preferable, I know for certain that there won't be multiple barcodes on one page). The goal is to split this TIFF image into multiple PDF files, each containing only one invoice and all of the documents that belong to the invoice.
I have the latter part done already. I intend to use ImageMagick to split the TIFF file into multiple files (tested, works). I have also tried multiple barcode scanning methods, but met critical problems at every one. And that's the point of my question:
Is any of my presumptions false? Is there a better way/library/SW that you know about that could work?
Libraries/SW I tried so far:
ZXing port for PHP: Can't work with TIFF files
ZXing github
Quagga for JavaScript: Can't work with TIFF files either.
Quagga github
ZBar code reader: The best looking one by far. I managed to scan multiple QR codes in one TIFF image using CMD (Windows), but didn't find a way to get their positions. Also found out that C++ and Python versions exist, but didn't get to try them out just yet.
Thanks for any ideas/corrections.
The best one I heard -that is subjective ofc- is Barcode Rendering Framework
I'm not sure if it can detect multiple barcodes on a page but it can detect many different types of barcodes.
And it's also Open Source..

Any CLI tool to perform 3d texture mapping on the fly

I'm currently looking for a way to create a 'configurator' for a upholsters, similar to http://digitaldraping.com/configurator/furniture-sofa/?Cushions_Plain-Cream.png,Sofa_Stripe-Orange.png - you select your fabrics and they are 'drawn' on the sofa automatically.
Unfortunately, all the sites I've looked at seem to use pre-rendered transparent PNGs that are overlaid over each other to build up the full picture. The problem here is that we've figured out that we'd require over 120,000 different images to cover all models, fabrics etc!!
I've looked at a few 3d texture tools such as http://www.arahne.si/products/arah-drape.html, hoping that one of them would have a CLI option where you give it a pre-created wireframe, and a fabric to overlay, and it generates the required image on the fly, but so far everything seems to require real-time use of the GUI to use it.
So, is there a CLI tool that would do what I'm after, or can anyone suggest a way to manipulate the GUI automatically? (from a tech point of view, I'm comfortable with C, Bash, Python or PHP as a solution!)
Thanks!
ArahDrape 2.2 can now work from a command line without any GUI interface. You can also call ArahDrape as a C library. In this way, it can be used in a web server to create texture mapped images on the fly. The command line options are explained below.
ArahDrape 2.2j command line version, ©2015 Arahne
usage:
adCommand -o /tmp/outputImage.png -tN /home/user/texture.png [-hidemodel] [-divide 2] [-filterPNG] [-compressPNG 2] [-m /home/user/model.png] -owner name -activation 174b3cfb49e9 /home/user/project.drape
Input and output images can have png, .tif or .jpg extensions
-o output_image_file
-tN texture_image_file [N goes from 0 to 199]
-hidemodel will render all areas not in region as white
-divide N [N goes from 2 to 5] divide resulting image pixel size
-filterPNG if you do not filter it, rendering is faster
-compressPNG N [N goes from 0 to 9] lower number saves faster, but bigger files
-m model_image_file use this if you want to replace model image from the project; must have same pixel size
-owner owner_name pass the given owner name
-activation activation_code pass the given activation code
last parameter should be ArahDrape project file
All files should be entered with full path.
If you need spaces in filenames, use quotes "" around the filename.
If you provide only Owner name, without activation code, program returns registration code.
ArahDrape supports batch export.
Open ArahDrape project, click on texture you wish to replace, put all your texture in a directory, select from menu
Textures > Browse textures, and as you click the texture to load it, program will save the draped picture. If you have thousands of images, use keyboard shortcut = and program will automatically do them all.
Alpha channel transparency is supported in loading model images or textures, and saving the draped images, as long as you use PNG or TIFF.
Please check this video to see how
ArahDrape works in batch mode.
we (http://digitaldraping.com/) can do just what you are asking. We have two options creating images and rendering a meshed image on the fly. Just get in touch if you still need this solution.

Embedding matlab plot in pdf for printing: Sizes

I'm currently creating my figures in matlab to embed themvia latex into a pdf for later printing. I save the figures and save them via the script export_fig! Now I wonder which is the best way to go:
Which size of the matlab figure window to chose
Which -m option to take for the script? It will change the resolution and the size of the image...
I'm wondering about those points in regards to the following two points:
When chosing the figure-size bigger, there are more tickmarks shown and the single point markers are better visible
When using a small figure and using a big -m option, I still have only some tickmarks
When I generate a image which is quite huge (e.g. resolution 300 and still 2000*2000px) and than embed it into the document: Does this than look ugly? Will this be embedded in a nice scaling mode or is it the same ugliness as if you upload a 1000*1000px image onto a homepage and embed it via the widht and height tags in html -> the browser displays it quite ugly because the browser doesn't do a real resize. So it looks unsharp and ugly.
Thanks in advance!
The MATLAB plots are internally described as vector graphics, and PDF files are also described using vector graphics. Rendering the plot to a raster format is a bad idea, because you end up having to choose resolution and end up with bigger files.
Just save the plot to EPS format, which can be directly embedded into a PDF file using latex. I usually save my MATLAB plots for publication using:
saveas(gcf, 'plot.eps', 'epsc');
and embed them directly into my latex file using:
\includegraphics[width=0.7\linewidth]{plot.eps}
Then, you only need to choose the proportion of the line the image is to take (in this case, 70%).
Edit: IrfanView and others (XnView) don't display EPS very well. You can open them in Adobe Illustrator to get a better preview of what it looks like. I always insert my plots this way and they always look exactly the same in the PDF as in MATLAB.
One bonus you also get with EPS is that you can actually specify a font size so that the text is readable even when you resize the image in the document.
As for the number of ticks, you can look at the axes properties in the MATLAB documentation. In particular, the XTick and YTick properties are very useful manually controlling how many ticks appear no matter what the window resolution is.
Edit (again): If you render the image to a raster format (such as PNG), it is preferable to choose the exact same resolution as the one used in the document. Rendering a large image (by using a big window size) and making it small in the PDF will yield bad results mainly because the size of the text will scale directly with the size of the image. Rendering a small image will obviously make for a very bad effect because of stretching.
That is why you should use a vector image format. However, the default MATLAB settings for figures produce some of the same problems as raster images: text size is not specified as a font size and the number of ticks varies with the window size.
To produce optimal plots in the final render, follow the given steps:
Set the figure's font size to a decent setting (e.g. 11pt)
Render the plot
Decide on number of ticks to get a good effect and set the ticks manually
Render the image to color EPS
In MATLAB code, this should look somewhat like the following:
function [] = nice_figure ( render )
%
% invisible figure, good for batch renders.
f = figure('Visible', 'Off');
% make plots look nice in output PDF.
set(f, ...
'DefaultAxesFontSize', 11, ...
'DefaultAxesLineWidth', 0.7, ...
'DefaultLineLineWidth', 0.8, ...
'DefaultPatchLineWidth', 0.7);
% actual plot to render.
a = axes('Parent', f);
% show whatever it is we need to show.
render(a);
% save file.
saveas(f, 'plot.eps', 'epsc');
% collect garbarge.
close(f);
end
Then, you can draw some fancy plot using:
function [] = some_line_plot ( a )
%
% render data.
x = -3 : 0.001 : +3;
y = expm1(x) - x - x.^2;
plot(a, x, y, 'g:');
title('f(x)=e^x-1-x-x^2');
xlabel('x');
ylabel('f(x)');
% force use of 'n' ticks.
n = 5;
xlimit = get(a, 'XLim');
ylimit = get(a, 'YLim');
xticks = linspace(xlimit(1), xlimit(2), n);
yticks = linspace(ylimit(1), ylimit(2), n);
set(a, 'XTick', xticks);
set(a, 'YTick', yticks);
end
And render the final output using:
nice_figure(#some_line_plot);
With such code, you don't need to worry about the window size at all. Notice that I haven't even showed the window for you to play with its size. Using this code, I always get beautiful output and small EPS and PDF file sizes (much smaller than when using PNG).
The only thing this solution does not address is adding more ticks when the plot is made larger in the latex code, but that can't be done anyways.

Resources