Extract plots from PDFs - image

We have a pdf page which contains one or more figures which are two-dimensional plots of experimental results. The figures may or may not be embedded in text. Each plot has the x and y axis with their labels and unit measurements marked in the plot. Inside each figure are one or more plots, each with a different color.
How can we convert the plot into a table of corresponding x and y values (say for 100 points) ?
I have already tried WebPlotDigitizer but it works only when the input is a standalone picture of a plot.
What I think I'll have to do is extract the plots from the PDF and process it further. Now, I am not able to find a tool for doing that. I have attached a sample PDF from which the plots have to be extracted.
Note that the 2 plots in the last page of the PDF are images and can be extracted readily(I've found a couple of software for those).The other plots are not images and the software are not able to extract them.
Is there any open source software that can achieve that?

Plots in this PDF file you have provided are made with vector drawings, so the only way to extract them is to convert PDF into image (i.e. render pages). Try ImageMagick's convert command line, see this answer

As Photoshop is very well scriptable, it is actually possible to extract images from a PDF programmatically (as opposed to pages; see Photoshop JavaScript documentation).
Then you have the whole set of instruments to adjust the images, so that further processing (interpretation) is easier to accomplish.

Related

Why everybody convert image to gray-scale before performing operations on openCV

been trying to find the answer to why everybody converts an image to grayscale before processing?
For example, this website with instructions teaching people how to build a simple scanning program converts photo to greyscale first before passing commands to manipulate the image itself.
In the second example, this thread on stackoverflow shows a person also converts the image to grayscale before extracting text from his image.
Does this process make the image easier to manipulate? Or does it give better results when extracting text? If so, shouldn't a binary image give the best result in the case of extracting text?
More often than not, grayscale has all the relevant information to complete a particular task. So reducing the image to grayscale greatly simplifies calculations and removes redundancies.
Binary image is great too but it sacrifices too many information for it to be useful in many cases. And most library supports a minimum of 8 bit image processing anyway for a true binary data structure to be useful.
Imagine having to create a program to recognize text on paper. Having a color image doesn't help you to better read the text. The text can be in various color but you can read the text even if its in black and white. You can argue that binary image should also give the same performance and that is true IF there are no noise such as shadow on the paper.
Once there are noise elements exist on the image, you will need more information to separate text from noise and that is when grayscale is useful.
Moreover the most used and reliable information for advanced image processing is the edges and its textures. Both which can be obtained from a grayscale image.

Equalize contrast and brightness across multiple images

I have roughly 160 images for an experiment. Some of the images, however, have clearly different levels of brightness and contrast compared to others. For instance, I have something like the two pictures below:
I would like to equalize the two pictures in terms of brightness and contrast (probably find some level in the middle and not equate one image to another - though this could be okay if that makes things easier). Would anyone have any suggestions as to how to go about this? I'm not really familiar with image analysis in Matlab so please bear with my follow-up questions should they arise. There is a question for Equalizing luminance, brightness and contrast for a set of images already on here but the code doesn't make much sense to me (due to my lack of experience working with images in Matlab).
Currently, I use Gimp to manipulate images but it's time consuming with 160 images and also just going with subjective eye judgment isn't very reliable. Thank you!
You can use histeq to perform histogram specification where the algorithm will try its best to make the target image match the distribution of intensities / histogram of a source image. This is also called histogram matching and you can read up about it on my previous answer.
In effect, the distribution of intensities between the two images should hopefully be the same. If you want to take advantage of this using histeq, you can specify an additional parameter that specifies the target histogram. Therefore, the input image would try and match itself to the target histogram. Something like this would work assuming you have the images stored in im1 and im2:
out = histeq(im1, imhist(im2));
However, imhistmatch is the more better version to use. It's almost the same way you'd call histeq except you don't have to manually compute the histogram. You just specify the actual image to match itself:
out = imhistmatch(im1, im2);
Here's a running example using your two images. Note that I'll opt to use imhistmatch instead. I read in the two images directly from StackOverflow, I perform a histogram matching so that the first image matches in intensity distribution with the second image and we show this result all in one window.
im1 = imread('http://i.stack.imgur.com/oaopV.png');
im2 = imread('http://i.stack.imgur.com/4fQPq.png');
out = imhistmatch(im1, im2);
figure;
subplot(1,3,1);
imshow(im1);
subplot(1,3,2);
imshow(im2);
subplot(1,3,3);
imshow(out);
This is what I get:
Note that the first image now more or less matches in distribution with the second image.
We can also flip it around and make the first image the source and we can try and match the second image to the first image. Just flip the two parameters with imhistmatch:
out = imhistmatch(im2, im1);
Repeating the above code to display the figure, I get this:
That looks a little more interesting. We can definitely see the shape of the second image's eyes, and some of the facial features are more pronounced.
As such, what you can finally do in the end is choose a good representative image that has the best brightness and contrast, then loop over each of the other images and call imhistmatch each time using this source image as the reference so that the other images will try and match their distribution of intensities to this source image. I can't really write code for this because I don't know how you are storing these images in MATLAB. If you share some of that code, I'd love to write more.

Can pdfbox extract vector images?

As per my understanding,
1. .eps format images are vector images.
2. When we draw something in word (like a flowchart) that is stored
as a vector image.
I am almost sure about the first, not sure about the second. Please correct me if I am wrong.
Assuming this two things, when a latex file (where .eps images are inserted) or a word file (that contains vector images) is converted into pdf, do the images get converted into raster images?
Also, I think PDFBox/xpdf can only extract raster images from the pdf (as they are embedded as XObjects), not vector images. Is that understanding correct? This question in stackoverflow is related, but have not been answered yet.
Your point 1 is incorrect, eps files are PostScript programs, they may contain vector information, or text or image data, or all of the above.
point 2 In PDF there isn't a 'vector image', an image means a bitmap and therefore cannot be vector.
If you convert a PostScript program to a PDF file, then the result depends entirely on the conversion program you use. In general vectors will be retained as vectors, and text as text. However it is entirely possible that an application might render the entire PostScript program and insert the result as an image in the PDF.
So the answer to your first question ("do the images get converted into raster images") is 'maybe, but probably not'.
I'm afraid I have no idea about the capabilities of PDFBox/xpdf, but since collections of vectors may not be arranged as 'images' (they could be held as Form XObjects, or Patterns) in any atomic fashion, there isn't any obvious way to know when to stop extracting. And what format would you store the result in anyway ?

Optimize the SVG output from Gnuplot

I've been trying to plot a dataset containing about 500,000 values using gnuplot. Although the plotting went well, the SVG file it produced was too large (about 25 MB) and takes ages to render. Is there some way I can improve the file size?
I have vague understanding of the SVG file format and I realize that this is because SVG is a vector format and thus have to store 500,000 points individually.
I also tried Scour and re-printing the SVG without any success.
The time it takes to render you SVG file is proportional to the amount of information in it. Thus, the only way to speed up rendering is to reduce the amount of data
I think it is a little tedious to fiddle with an already generated SVG file. I would suggest to reduce the amount of data for gnuplot to plot.
Maybe every or some other reduction of data can help like splitting the data into multiple plots...
I would recommend keeping it in vector graphic format and then choosing a resolution for the document that you put it in later.
Main reason for doing this is that you might one day use that image in a poster (for example) and print it at hundreds of times the current resolution.
I normally convert my final pdf into djvu format.
pdf2djvu --dpi=600 -o my_file_600.djvu my_file.pdf
This lets me specify the resolution of the document as a whole (including the text), rather than different resolutions scattered throughout.
On the downside it does mean having a large pdf for the original document. However, this can be mitigated against if you are latex to make your original pdf - for you can use the draft option until you have finished, so that images are not imported in your day-to-day editing of the text (where rendering large images would be annoying).
Did you try printing to PDF and then convert to SVG?
In Linux, you can do that with imagemagick, which you even be able to use to reduce the size of your original SVG file.
Or there are online converters, such as http://image.online-convert.com/convert-to-svg

Image Compression Algorithm - Breaking an Image Into Squares By Color

I'm trying to develop a mobile application, and I'm wondering the easiest way to convert an image into a text file, and then be able to recreate it later in memory said text. The image(s) in question will contain no more than 16 or so colors, so it would work out fine.
Basically, brute-forcing this solution would require me saving each individual's pixel color data into a file. However, this would result in a HUGE file. I know there's a better way - like, if there's a huge portion of the image that consists of the same color, breaking up the area into smaller squares and rectangles and saving their coordinates and size to file.
Here's an example. The image is supposed to be just black/white. The big color boxes represent theoretical 'data points' in the outputted text file. These boxes would really state their origin, size, and what color they should be.
E.g., top box has an origin of 0,0, a size of 359,48, and it represents the color black.
Saved in a text file, the data would be 0,0,359,48,0.
What kind of algorithm would this be?
NOTE: The SDK that I am using cannot return a pixel's color from an X,Y coordinate. However, I can load external information into the program from a text file and manipulate it that way. This data that I need to export to a text file will be from a different utility that will have the capability to get a pixel's color from X,Y coordinates.
EDIT: Added a picture
EDIT2: Added constraints
Could you elaborate on why you want to save an image (or its parts) as plain text? Can't you use a binary representation instead? Also, if images typically have lots of contiguous runs of pixels of same color, you may want to use the so-called run-length encoding (RLE). Alternatively, one of Lempel-Ziv-something compression algorithms could be used (LZ77, LZ78, LZW).
Compress the image into a compressed format (e.g. JPEG, PNG, GIF, etc) and then save it as a .txt file or whatever. To recreate the image, just read in the file into your program using whatever library function suits your particular needs.
If it's necessary that the .txt file have some textual meaning, then you may be in some trouble.
In cs there is an algorithm like spatial index to recursivley subdivide a plane into 4 tiles. If the cell has the same size it looks like a quadtree. If want you to subdivide a plane into pattern (of colors) you can use this tiling idea to dynamically change the size of the cell. A good start to look at is a z-curve or a hilbert curve.

Resources