Unable to extract text and images from specific PDF - ruby

Can anyone please let me know how I can extract all the text and images from a PDF. I am able to extract images in scenario like, which I created a PDF with few lines of text and 2 png images using Google Docs. But, I am unable to extract images from a sample pdf.
I have tried with the following:
In Ruby:
1) "pdf-reader" gem, it is supporting extraction of only few formats of images.
2) "docsplit" gem, it is only able to extract text and unable to extract images.
Command-line utility:
1) "pdfimages" tool, it is supporting extraction of only few formats of images.
Java library:
1) "pdfbox" library, it is supporting extraction of only few formats of images.

1.
Extracting text:
pdftotext -layout the.pdf -
Extract all pages' text to <stdout>.
pdftotext -layout -nopgbrk the.pdf the-3-5.txt
Extract all pages' text to file the.txt, and don't insert these pesky ^L characters signifying new pages.
pdftotext -f 3 -l 5 -layout the.pdf -
Extract pages' 3--5 text to the-3-5.txt.
2.
Extracting images
pdfimages -f 4 -l 7 -j the.pdf myprefix--
Extract all images from pages 4 through 7 as JPEGs (if possible!) and name them with the prefix myprefix---.
If extracting as JPEGs is not possible, the images will be extracted as pure raster PPM or PGM.
The latest versions of pdfimages (Poppler fork) lets you specify -png (and more) to get all images as PNGs.
Using the latest version of pdfimages gives you these options:
$ pdfimages -h
pdfimages version 0.33.0
Copyright 2005-2015 The Poppler Developers - http://poppler.freedesktop.org
Copyright 1996-2011 Glyph & Cog, LLC
Usage: pdfimages [options] <PDF-file> <image-root>
-f <int> : first page to convert
-l <int> : last page to convert
-png : change the default output format to PNG
-tiff : change the default output format to TIFF
-j : write JPEG images as JPEG files
-jp2 : write JPEG2000 images as JP2 files
-jbig2 : write JBIG2 images as JBIG2 files
-ccitt : write CCITT images as CCITT files
-all : equivalent to -png -tiff -j -jp2 -jbig2 -ccitt
-list : print list of images instead of saving
-opw <string> : owner password (for encrypted files)
-upw <string> : user password (for encrypted files)
-p : include page numbers in output file names
-q : don't print any messages or errors
[....]
What more image formats do you want? If you need other formats use ImageMagick's convert command.
Also, there are no other "formats" embedded in PDFs.
Basically, the only compression methods for images embedded in PDFs are:
JPEG (then /DCTDEcode filter is mentioned as uncompression hint to the PDF viewer),
JBIG2 (/JBIG2Encode),
Fax compression (CCITTFaxDecode) and
JPEG2000 (JPXDecode).
All other images embedded in PDFs basically are pure raster data anyway (PPM or PGM), and their PDF-internal compression is one of the other standard compression methods available for general stream compression:
/FlateDecode (ZIP/Deflate algorithm),
/LZWDecode (Lempel-Ziv-Welch algorithm) and
/RunLengthDecode.
Update
I only now had time to look at your linked sample PDF, sorry.
As #mkl wrote in his comment, what looks like an image isn't always an image in PDF technical parlance. For example, on your PDF's page 7 there is the (famous) tiger head. This is completely composed from vector elements, which are placed inline into the page's /Contents stream.
The same is true for the depicted chess board.
I believe the tiger image was designed with the help some vector graphics program a few decades ago (Adobe Illustator?) when it had freshly been released, and exported to EPS. A PDF viewer in may cases has now way to identify inline vector elements (which could be simple horizontal lines) from other contents. Unless these vector elements are "grouped" into an XObject (which pdfimages would no be able to extract either, but which would help with manual isolation and extraction...)
These vector elements cannot be automatically extracted by any (Free and Open Source Software, or gratis closed source software) tool I know.
A "real" image in PDF parlance is a rectangle of pixel data. These are the only type of images which can be extracted by a tool like pdfimages.

Related

Converting PDF to images of original size

I have a PDF file which is made of photographs of a book connected in a single PDF file. I'm trying to convert it back to single images in PNG format, every tool I tried asks me to set DPI which alters the size of resulting images, is there a way to get images of the exact same pixel size the original images were?
Most PDFs of books contain a single image per page and depending on the scanner these images can basically be in three different formats: JPEG, JPEG2000 or TIFF. JPEG2000 is rarely used, so your PDF probably contains JPEG and/or TIFF images.
The good thing about JPEG (and JPEG2000) images is that they can be embedded as-is into a PDF! So you can extract the images as they are stored in the PDF. With TIFF this is also sometimes possible (but I don't think always...).
As mentioned by Tim Roberts you should try using pdfimages or hexapdf images to view and extract the images stored in the PDF. This will give you the best result.

How to recognize an image file format using its contents?

If a Image file is of format .png then it will contain ‰PNG, at the beginning of the file. (when read in Text mode)
If a Image file is of format .bmp then it will contain BM, at the beginning of the file. (when read in Text mode)
I know that Image formats contain text (data) of certain size (bytes) in the beginning of the file, which is used as metadata of the Image file?
My Questions are:-
Is this behavior same in all image file formats (or formats in general)?
Could a image file (of no extension) be recognized just using this data?
Is there information available on how this metadata is broken down? By that I mean, data at which position in the metadata has what meaning?
Is this behavior same in all image file formats (or formats in
general)?
For most of them, yes. There are some proprietary formats (e.g. for games) that might have very short or no metadata. Also, metadata might be in another file (e.g. animations together with XML metadata).
Could a image file (of no extension) be recognized just using this
data?
Yes. In fact, most image viewers will warn you if an image file has an incorrect extension and ask you if they should fix it.
On Unix systems, there's a file command that identifies files based on their metadata. There is a better tool specific for images called identify (part of ImageMagick) that returns more detailed information on resolution, bitdepth, etc.
Is there information available on how this metadata is broken down? By
that I mean, data at which position in the metadata has what meaning?
There are books about (image) file formats and for most formats, this information is available in official specifications (e.g. RFC 2083 for PNG). They list all of the (optional) file contents, describe the compressions and what a viewer/decoder/encoder can/must/should do with the data. A good starting point might be the Wikipedia list of image file formats.
Note that based on the examples you gave I suppose you opened files with a text editor which is not the ideal tool for that task. It's better to use a hex-editor for this. Text editors won't show most bytes (e.g. 255) by default and interprete others (e.g. tab or line feed). They might be good enough to see magic text strings like "BM" and "PNG", but with a hex editor, you can see both these text parts and their numerical representation - e.g. allowing you to extract image width and height. For this, some tool to convert hexademical values to decimal is useful, most calculators can do this.
As an example, let's look at the beginning of a PNG file with a resolution of 6146 x 14293 in both a text editor and a hex editor:
You can see that the file is a PNG image in both of them, that's correct. But the marked part in the hex editor view will show the width and height of the image (matching the PNG chunk specification of the "IHDR" part) - 0x00001802 is 6146 in decimal, 0x000037D5 is 14293. There's no way to do this in the text editor.
Also note that even if you don't know an image format, you might be lucky with just guessing it's uncompressed data (this often works for some game image file formats, most notable Unity's "assets"). E.g. if you rename files to ".raw", the image viewer IrfanView will give you a dialog (see the screenshot below) where you can guess width, height and bit depth of the image and see if the result looks good. This requires some experience in interpreting the outcome though, if width and bitdepth don't match, images will look like noise, warped, or have wrong colors.
This "image geometry guessing" can be improved/automated by trying different widths and computing the correlation coefficent between two lines. The tool raw2tiff can do this. Quote from the site:
There is no magic, it is just a mathematical statistics, so it can be
wrong in some cases. But for most ordinary images guessing method will
work fine.
Using Imagemagick, you can get that information (if available) for formats that Imagemagick can read from its "magick" data in the header file as follows:
convert image -format "%m\n" info:
For example:
convert lena.png -format "%m\n" info:
PNG
convert lena.jpg -format "%m\n" info:
JPEG
convert lena.pnm -format "%m\n" info:
PPM
Even if the suffix is removed, this still works:
convert lena_copy -format "%m\n" info:
PNG

Generate all the files (.vtt + sprite) for the Tooltip Thumbnails options of Jwplayer

What is the best way to generate the ".VTT" file and the jpg sprite attached with it for the Tooltip Thumbnails of Jwplayer (http://www.jwplayer.com/blog/building-tooltip-thumbnails-with-encodingcom/- ?
I know how to make an image sprite with php, but i dont know how to make the screenshots of each video with the time in second.. I think there must be a server tool to do all the tasks it but i cant find it.
Thanks
I wrote a script to do this task. Given a video file (MP4 or M4v), generate thumbnail images, compress into a sprite, and generate a VTT file compatible with JWPlayer tooltip thumbnails. All of the image manipulation uses tools from ffmpeg, ImageMagick, and optionally sips and optipng. The WebVTT generation part, I had to write.
You will have to install ffmpeg & imagemagick, at a minimum to use this.
Github code is here: https://github.com/vlanard/videoscripts (under sprites/).
The basic gist is:
Create a bunch of thumbnails, e.g. every 45th second from a video
ffmpeg -i ../archive/myvideofile.mp4 -f image2 -bt 20M -vf fps=1/45 thumbs/myvideofile/tv%03d.png
Resize those thumbnails to be small, e.g. 100pixels wide
sips --resampleWidth 100 thumbs/myvideofile/tv001.png thumbs/myvideofile/tv002.png thumbs/myvideofile/tv003.png
OR if sips not available, use imageMagick utility:
mogrify -geometry 100x thumbs/myvideofile/tv001.png thumbs/myvideofile/tv002.png thumbs/myvideofile/tv003.png
Get the height & width dimensions of one of the thumbnails to use as the basis of our grid coordinates, using ImageMagick utility
identify -format "%g - %f" thumbs/myvideofile/tv001.png
which returns output like :
100x55+0+0 - tv001.png
from which we parse 100 and 55 as our Width & Height, and the general geometry of each thumbnail (W, H, X, Y)
We then generate our single spritemap from the individual thumbnails. We determine the target grid size (e.g. 2x2, 8x8) to suit the number of thumbnails we generated for this video, as well as passing in the sprite geometry, using an ImageMagick utility
montage thumbs/myvideofile/tv*.png -tile 2x2 -geometry 100x55+0+0 thumbs/myvideofile/myvideofile_sprite.png
Optionally we can run an extra compression step here to make the sprite smaller
optipng thumbs/myvideofile/myvideofile_sprite.png
We then generate a VTT file based on the number of thumbnails we created, using
the interval that we used to space out the thumbnails to label each time segment, and
using the known coordinates of each consecutive image within our sprite that maps to
the associated segment.
I've developed a Ruby gem to easily create .VTT file and sprite of thumbnails.
Thanks for inspiring #randalv!
You can take a look at it here:
https://github.com/scaryguy/jwthumbs
Usage
Instantiate your video file:
movie = Jwthumbs::Movie.new("YOUR_VIDEO.mp4")
Jwthumbs::Movie.new accepts second parameter as a options hash. You can configure several stuff at the same time you instantiate your video like this:
movie = Jwthumbs::Movie.new("YOUR_VIDEO.mp4", seconds_between: 60, sprite_name: "my_sprite_name.jpg")
or after you instentiated your video, you can use Jwthumbs::Movie file to configure things:
movie = Jwthumbs::Movie.new("YOUR_VIDEO.mp4")
movie.seconds_between = 60
movie.sprite_name = "my_sprite_name.jpg"
and then to create your thumbnails and .VTT file just run this command.
movie.create_thumbs!
I know this is already a few years old but I had the same problem and found a command line tool which generates sprites pretty fast and since 1.0.6 supports WebVTT creation out of the box. The name is mt and you can check it here.
Quoting from their documentation you can use it like this:
just run mt and provide any video file as args: mt video.avi
Some of the settings can be changed through runtime flags provided
directly to mt for more information just run mt --help
Option 1 :
You can use the encoding.com's API and tell them to export vtt file too
I recommend to read "How can I create time synced thumbnails for use in JW player?" explanation from encoding.com's Knowledge base
Option 2 :
use movie thumbnailer (mtn), this is a command line tools running on UNIX, Windows systems. But you will have to write a custom script to generate the VTT file corresponding
Super fast! Thanks to FFmpeg's libavcodec.
Command line program: canbe used on remote connections to co-location servers, or used in scripts.
Batch mode: recursively search directories for movie files. Run at lower priority (nice 10 on Linux, idle on Windows) by default.
To run at normal priority use -n option.
Thumbnails are group together in one jpeg file and can be saved individually too (-I
option).
Work fine with Unicode filenames in both Linux & Windows
(might need to change the font with -f fontfile).

Ghostscript Stamp Image on PDF

Is there any way to stamp or overlap a tiff image on a existing PDF file and output the result using Ghostscript?
I have two PDF which i want to merge in a result PDF with one over the other using ghostscript. I want to know if this can be done and how, or if it may work with one PDF as tiff image on top of the base PDF.
Can ghostscript make this stamp using layers in the PDF?
Thank you for your answers
The pdfwrite device in Ghostscript doesn't really support layers, so you can't use that. Also its unclear why you think layers would help.
TIFF isn't part of PostScript (or PDF), so you can't directly read a TIFF file into GS. I have elsewhere posted a PostScript program which reads TIFF files and renders them for output. You could use that to read a TIFF file.
However, you would have to mess about with either the PDF interpreter or a custom EndPage procedure in order to read and render the TIFF file. And unless you take specific kinds of action, it will be opaque, which may well not be what you want.
The Ghostscript PDF interpreter doesn't really lend itself to this kind of manipulation, have you considered using pdftk instead ?

How to convert an image (i.e. pdf) for use in a LaTeX document?

What is the preferred way to convert various images, bitmap and vector, for use in a LaTeX and PDFLaTeX document?
There are many ways to do this, some make use of standard inclusions in the various LaTeX packages, others give better results.
You can include a PDF image directly into a LaTeX document if you want to produce your final output using pdflatex, but not if you want to produce a dvi file.
pdflatex can use PDF, PNG, and JPEG
latex/dvips can use PS, EPS
See more details:
Including images in LaTeX files
Watch what you name graphics files in LaTeX
I convert bitmaps into PNG, and vector graphics (e.g. SVG) into PDF. pdflatex understand both PNG and PDF.
If you have an image "as PDF", and you don't want to include it as pdf, you may want to extract the complete image data first with pdfimages. Other conversions may render the image only with reduced resolution.
My current preferred way is using bmeps and epstopdf included in MikTeX. For the generation of pdf and eps versions of a png.
In a file called convertimage.bat,
bmeps -p3 -c -e8f -tpng %1.png > %1.eps
epstopdf %1.eps
Use by including in the path and writing convertimage.bat filenameminusextension
Include in the documents using,
\begin{figure}[h]
\begin{center}
\includegraphics[scale=0.25]{path/to/fileminuxextension}
\caption{My caption here}
\label{somelabelforreference}
\end{center}
\end{figure}
I only use Encapsulated PostScript (.eps) figures (converting bitmaps with NetPBM first), since I always use dvips + ps2pdf anyway, and then I do \includegraphics{file}.
As John D. Cook says, your available image formats depend on whether you are using latex or pdflatex.
I find ImageMagick a useful tool for converting images between formats. Handles bitmap images, plus ps/pdf/eps (with ghostscript) and a zillion others. Available through apt, macports, etc.
I use a mac so I use GraphicConverter to load images and export as PDFs.
When I draw diagrams, I use Omnigraffle which lets me export as PDFs.
On windows I used to use Visio which supported EPSs which I also had no problems embedding.
The basic issues are that a) you want to handle raster and vector images differently and b) this introduces potential pitfalls.
The "right" thing to do depends a bit on your final output.
If your final output is going to be a .pdf file, and you don't need pstricks or anything else that these days you're probably better off just using pdflatex to directly produce the file.
In this case:
store all vector figures as .pdf
store all raster figures as .png (or jpeg if they were originally jpeg)
use graphicx package and \includegraphics{filename-without-suffix}
If you don't do the above, your raster figures will be converted to jpegs and may gain compression artifacts. png is the best bet if you can choose output.
If you are headed for .dvi file you're going to want .eps for everything. (You can gzip these files as long as you generate a bounding box file).
If you're careful you can do both. I store all vector figures as (compressed) .eps because there are a few things .pdf can't do that .eps can. I store all raster figures as .png. Using make, I can have temporary copies of these canonical versions generated on the fly for .dvi or .pdf output as needed.
Someone above pointed out the filename issue. You want to avoid "." in the file names, and avoid suffixes always in your latex file itself.
I always include images in PNG format.
If you compile your code with pdflatex, then you also can use the \includegraphics to include images in pdf (you have to include the package graphix

Resources