Image Conversion library: Word, PDF, Excel to Images - image

We have a requirement to convert any incoming documents which are either in Excel, PDF and Word to images. Any recommendation?
I am NOT sure whether ImageMagik would do this but my understanding it is ONLY for format conversion of images and I guess handles PDF as well. What about Excel and Word?
Thanks in advance

You could convert everything to pdf first using:
$ libreoffice --headless --invisible --convert-to pdf *.libreofficeextension
and then use imagemagick...
you might have some formatting issues in word and especially in powerpoint

You're correct -- imagemagick won't handle the MS Office formats because it only handles image format conversion.
For PDFs, can just use imagemagick directly:
convert -density 400 filename.pdf filename.jpeg
It will give you files:
filename[0].jpg
filename[2].jpg
...
filename[N-1].jpg
Where N was the number of pages in your document. pdf2ps will achieve the same thing, but you'll need to play around with the command-line parameters to get the same output quality.
For the MS Office products, I remember that there is some sort of API that allows you access to the suite's features (this was MS Office 2007, from memory), like opening a file and exporting it to PDF. If you can get things out to PDF, then you can use the method above to convert it to images. Some negative points:
This was many years ago at my previous job, and I can't remember what exactly it was called or how to use it.
I remember the output PDF formatting wasn't great (not 100% like it appears on the screen) but it readable. This may have improved since I last used it.
I have a vague recollection of it firing up an Excel window in the background, so it's not entirely a command-line solution (may be unsuitable for servers)

Quite old question still this is how I solved:
use Windows machine
Install MS Office suit
Use https://officetopdf.codeplex.com/ for converting any office format to PDF
Use Imagemagick for pdf to image format.
Hope it helps someone.

Related

Separator line between columns in MS-Access report

I have successfully built a two column report in MS-Access (2013, if that matters). What I can't seem to find out, is how to get a separator line between the columns.
I have tried to draw a vertical line in the details area, but this is not working when some fields in the detail area can grow. The line does not grow, and there doesn't seem to be a "grow" option for lines.
What am I missing? If MS-Access does not offer this, is there a way to do this programmatically?
I think your easiest solution is to create an image - about the size of your page - with a vertical line in the middle, then use that image as a page background for your report.
From my experience, the best quality/wheight you can get is when your image is in the EMF (enhanced Meta file) format, a Microsoft vector format.
This is not supported by many tools. I use Inkscape (free) when I need that. And from my limited experience, (free) Inkscape gives better results than (expensive) Illustrator for that use case.
Alternatively, you can create such background image as a Word document, save to PDF, then find an online PDF to EMF converter.

Difference between text as image and graphics as image

The question seems to be weird, but I need to ask this, since I am witnessing a quite interesting output when I compare text as image and graphics as image.
Ideally I am in process of identifying an tool, or algorithm to compare two pdfs, generate output which will highlight the difference between them.
There are possibilities in pdfs, which will have text as image format (legacy text on papers, are converted to pdfs).
and we are doing migration of those legacy pdfs, and finally we are comparing with legacy and converted pdf output.
I am evaluating couple of tools like Adobe dc pro, i-net pdfc and power pdf etc, for comparing two pdfs.
While evaluating, I am able to see graphic images are getting compared(not accurate either) on either side of the pdfs. Where as text as images are completely ignored, unanimously same results in all the tools.
But I am more interested in text as image, since we deal more of legacy text pdfs.
Below, is attached graphic image comparison result, where it could able to capture the differences between the images.
But when I compare text image, differences are not highlighted in the tool.
What I understand from this, text is not compared as image graphics, and tool is completely ignoring the comparison. I would like have clarification whether my assumption is correct.
Secondly, I would like to know how to compare text image in pdfs to generate the differences?.
I'm working for the company that is author of i-net PDFC so I'll answer your first question as well:
Your assumption is correct. i-net PDFC is able to compare images and shapes, but it cannot detect if some content completely changed it's meaning, e.G. a line shape that is used to draw a letter or in your case an image that has to be recognized as text. Recognizing ASCII art as image won't work for the same reason either. Such cases will always be detected as differences even though their visual appearance is similar.
On your second question: Using an OCR conversion tool for one or both documents is a common solution to this problem. A simple image comparison of the compared pages in unlikely to work due to the different font styles and line wrappings in the converted file.
Please note that most OCR applications will use the rendered page images for the recognition. This may lead to incorrect recognition results even if there are no images in the PDF file.
i-net Software is aware of this general issue and an OCR module is currently in development. It'll provide an option to apply the recognition solely to the images in the PDF files.

markdown or markup to powerpoint?

I need to maintain some slides in both latex beamer and in powerpoint. (This is to make slides available for instructors elsewhere, too, 90% of which do not know how to use latex and are unwilling to learn it. and I am a latex guy on linux.)
I have tried the route via Libreoffice (and opendocument), but this did not come out well. right now, the best method that I have found is to author pdf in beamer, then run it through a nuance OCR program to get MS Word...and not even go all the way to Powerpoint (which is where I really need to be).
If I only had a markup language that produced nice Powerpoint, I could probably code a perl translator from markdown to this intermediate markup language. (going from markdown to latex beamer is relatively easy.)
I don't think this exists, but hope springs eternal. after all, it is almost 2014 now. does anyone know of a solution?
One solution is to use odpdown: It converts markdown to the OpenOffice Presenter format, which can be imported into PowerPoint.
It is not yet complete, i.e. table support is missing and possibly not running on certain Windows setups, but nevertheless it could be a start. Possibly, you have Linux running, where it seems to work.
Steve Rindsberg's answer in the comments works on PP 2007 works! Let me repeat it here:
I suspect that PowerPoint is the likeliest solution. ;-) But what sort
of slides are you creating? If they're simple heading and bullet point
slides, all you need to produce is a simple text file. Any text that
starts in the left column will be the heading of a new slide. Indent
one tab and it becomes a first-level bullet point under the current
heading; indent two tabs, it becomes a second level bullet point and
so on. Simply use File | Open on the text file to pull it into PPT.
Steve: Is this all that PP converts? Or is there a reference of other "sneaky" markup that PP knows about?
(pandoc: unfortunately, the conversion from libreoffice to powerpoint is pretty poor when I tried it last. I also tried to save and understand the powerpoint xml format, but that was REAL bad.)
The easiest way to handle this is to work with:
RStudio (and R if not already installed)
RMarkdown
Pandoc 2.0.5 (minimum)
Install those 3 (or 4) items, then read: https://bookdown.org/yihui/rmarkdown/powerpoint-presentation.html
The installation time is worth the time saved copy-pasting everything from scratch.
I also am a Linux guy and I also use LateX engines to create nice documents. Based on my experience, here's what you should do :
Stop writing directly in LaTeX and start using org-mode to write documents instead (I spent years writing in LaTeX and now it's over (except when I use modernv package))
Org supports latex math formulas and .org files are easily exported in .tex files
Org can also be easily exported in markdown
Once you have your markdown, there are several tools that will allow you to create a PowerPoint. Two of them are pandoc and md2pptx

How to import GIF files into Beamer presentation?

I need to import animations from Maple into my LaTeX/Beamer presentation. I save a file in GIF format. But later I have problems converting that file into PNG. All I get is a static PNG file and can't proceed ((( What's the full code to do that in LaTeX?
You can use the animate package to animate a series of PNGs. To get the series of PNGs from an animated GIF, use a tool like ImageMagick's convert.
Does this help: LINK? (This is the same answer as marcog... just wanted to provide a reference to it being asked previously -- the solution was the same: the animate package).
Also, your OS will matter. I don't know that Linux (not saying you're using it) has any ability to play animated PDFs. I've tried embedding movies using LaTeX and while it "works," you can't actually view them in anything Linux offers yet. Okular is working on it, but last I checked (couple months?) it's not possible yet.
Anyway, just wanted to add that just in case you were doing everything completely right and by chance are not seeing the fruits of your labor since you're using a Linux viewer. Check your work with Acrobat on Windows to be sure.

Convert a .doc or .pdf to an image and display a thumbnail in Ruby?

Convert a .doc or .pdf to an image and display a thumbnail in Ruby?
Does anyone know how to generate document thumbnails in Ruby (or C, python...)
A simple RMagick example to convert a PDF to a PNG would be:
require 'RMagick'
pdf = Magick::ImageList.new("doc.pdf")
thumb = pdf.scale(300, 300)
thumb.write "doc.png"
To convert a MS Word document, it won't be as easy. Your best option may be to first convert it to a PDF before generating the thumbnail. Your options for generating the PDF depend heavily on the OS you're running on. One might be to use OpenOffice and the Python Open Document Converter. There are also online conversion services you could try, including http://Zamzar.com.
Sample code to answer the comment by #aisensiy above :
require 'rmagick'
pdf_path = "/path/to/interesting/file.pdf"
page_index_path = pdf_path + "[0]" # first page in PDF
pdf_page = Magick::Image.read( page_index_path ).first # first item in Magick::ImageList
pdf_page.write( "/tmp/indexed-page.png" ) # implicit conversion based on file extension
Based on the path clue in answer to another question :
https://stackoverflow.com/a/6369524/765063
Not sure about .doc support in any open source library but ImageMagick (and the RMagick gem) can be compiled with pdf support (I think it's on by default)
PDF support is a little buggy in ImageMagick - but it's by far the best OS way for ruby. There's also a google summer of code project for pure Ruby PDF support.
I've read stuff about using OpenOffice without the GUI to transform .doc files - but it'll be complicated at best.
As the 2 previous posters said, ImageMagick's probably the easiest way to generate the thumbnails.
You could exec something like:
´convert -size 300x300 doc.pdf doc.png´
(The backquotes tell Ruby to shell it out).
If you don't want to use exec to do the conversion you could use the RMagick gem to do it for you but it's probably a bit more of code.
If you don't mind paying for Imgix, it handles PDFs too. You get all the benefits of a fast CDN with it.

Resources