Pandoc - Images in Word file are not extracted into media folder at the time of the filter execution - pandoc

I've got some MS Word files(docx), and I convert them into markdown files. And later, those markdown files get converted into PDF and HTML files. All of the conversions are made with the help of pandoc.
When the word file is getting converted into Markdown, my python pandoc filter needs to get the width and height information(in inches) of the image from the AST file. This is working fine I'm able to get this information from AST.
{
"t": "Image",
"c": [
[
"",
[],
[
["width", "5.113165354330708in"],
["height", "3.063299212598425in"]
]
],
[],
["media/image1.png", ""]
]
}
But also it needs to get the actual image using pillow library and get the image size(in pixels) and DPI information from the file system for some calculations.
But the problem is, when I try to create this markdown image link in my pandoc filter that I use when converting docx to markdown, when I get the image with the python package pillow, it says
FileNotFoundError: [Errno 2] No such file or directory: 'C:/Users/mertcan.segmen/Desktop/doc/media/image1.png'
Which probably means that pandoc does not extract the images from Word file before executing the pandoc filter. Is this normal? If not, any advice on how I can achieve what I have in mind?

I found some sort of a workaround, I'm running pandoc --extract-media MyDocxFile.docx ./ right before converting my docx to markdown. This only extracts images from docx file into the media folder and then I run my pandoc command for the conversion. Since the images were extracted before, my filter has access to them.

Related

Pandoc .docx to .md with math and images

I am trying to export .docx file to a .md file with math and images. However, only the image is present on the new file, but not the math in mathjax. Any idea how to solve this?
The code I use is: pandoc --extract-media=.-s file.docx -t markdown -o file.md

How to ensure that image files from Sphinx Documentation are copied "Automatically" in LaTeX pdf

In my Sphinx documentation project, I am using images like this:
.. image:: /_static/carousel_filling.png
:width: 300px
:height: 450px
:scale: 100 %
:alt: Image here
:align: right
In the Sphinx HTML docs generated, the images are perfectly displayed in the html pages. However, when I generate pdf documents using make latexpdf, I am coming up with the following error:
'LaTeX Warning: File `{carousel_filling}.png' not found on input line ...'
I tried to find documentation related to outputting images however I came up only with this:
Excertps from:
latex_additional_files A list of file names, relative to the
configuration directory, to copy to the build directory when building
LaTeX output. This is useful to copy files that Sphinx doesn’t copy
automatically, e.g. if they are referenced in custom LaTeX added in
latex_elements. Image files that are referenced in source files (e.g.
via .. image::) are copied automatically.
So as per this, the image files should get automatically added to the output pdf file. But this is not happening. In the pdf file where the image should be there only a blank rectangle can be seen.
Interestingly, I can see that the image file has been copied to the folder _build/latex, so it means that pdflatex is able to access the image file!!
Question
How do I correctly output the images included in my Sphinx documentation in generated pdf file?
Edit 1:
In the terminal I can see the following warning:
LaTeX Warning: File `{carousel_filling}.png' not found on input line 931.
! Package pdftex.def Error: File `"""{carousel_filling}".png' not found: using dra
ft setting.
See the pdftex.def package documentation for explanation.
Type H <return> for immediate help.
...
l.931 ...t=450\sphinxpxdimen]{{carousel_filling}.png}
?
[21]
Edit 2:
In place of the image (where the rectangle outline has been output in pdf file) I can see this:
"""{carousel_filling}".png
Don't place your images under _static. It is a special-purpose folder, not for images. E.g. create img/ at the level of your rst files, move image there, and .. image:: img/my-image.png.

Using wildcards with "convert". Or "convert"ing a group of files

I regularly scan in my Homework for class. My scanner exports raw jpg files to usb, and from there I can use gimp to edit and save the files as a pdf. One time saver I've found is to export my multi-page homeworks as a .mng file and then use the convert function to turn it into a pdf. I do it this way because Gimp automatically merges all layers when exporting to a pdf.
convert HW.mng HW.pdf
this works well for individual files, but at the end of every week I can have dozens of files to convert.
I have tried using wildcards in the filenames for convert:
convert *.mng *.pdf
This always runs successfully and never throws an error, but never produces any pdfs.
Both
convert HW*.mng HW*.pdf
and
convert "HW*.mng" "HW*.pdf"
yeild the error
convert: unable to open image `HW*.pdf': Invalid argument # error/blob.c/OpenBlob/2712.
which I think means the error lies in exporting with a wildcard.
Is there any way to convert all of a specific file type to another using convert? Or should I try using a different program?
You can see this StackExchange post. The accepted answer basically does what you want.
for file in *.mng; do convert -input "$file" -output "${file/%mng/pdf}"; done
For convert in particular, use mogrify (which is part of ImageMagick as well) as suggested by Mark Setchell in a comment. mogrify can be used to edit/convert files in batches. The command for your case would be
mogrify -format pdf -- *.mng

Pandoc convert docx to markdown with embedded images

When converting .docx file to markdown, the embedded image is not extracted from the docx archive, yet the output contains ![](media/image1.png){width="6.291666666666667in"
height="3.1083333333333334in"}
Is there a parameter that needs to be set in order to get the embedded pictures extracted?
pandoc --extract-media ./myMediaFolder input.docx -o output.md
From the manual:
--extract-media=DIR Extract images and other media contained in or linked from the source document to the path DIR, creating it if necessary, and adjust the images references in the document so they point to the extracted files. Media are downloaded, read from the file system, or extracted from a binary container (e.g. docx), as needed. The original file paths are used if they are relative paths not containing ... Otherwise filenames are constructed from the SHA1 hash of the contents.
Referring to the comment by gridtrak and the problem of an unnecessarily deep directory strucutre (e.g. media/media/image2.jpeg), use the current directory as path DIR, then a folder media is created within the current directory (e.g. media/image2.jpeg):
pandoc --extract-media=. input.docx -o output.md

openFile with pandoc 1.13.2 - Windows 8.1

sorry for my english in my post (it is my first on this forum, and my question is perhaps stupid).
I encounter a problem in converting a html file to pdf file with pandoc.
Here is my code in the console
set Path=%Path%;C:\Users\nicolas\AppData\Local\Pandoc
(redirecting to Pandoc directory)
followed by
pandoc --data-dir=C:\Users\nicolas\Desktop essai.html -o essai.pdf
As indicated, my file is in the Desktop, but I got the following error:
pandoc: essai.html: openFile: does not exist (No such file or directory)
I get the same error if i do (with the file essai.html in the same folder as pandoc.exe):
pandoc essai.html -o essai.pdf
Have you any idea of the cause of my problem? (I precise that the file's name i want to convert is correct).
Remark: My original problem was to create a pdf faithful to the beautiful html file generated by Ipython Notebook via pandoc but I encounter the same kind of problem when i want to convert a .ipynb file in pdf with nbconvert.
I finally solve my problem by adding the full paths to my files (But I have used wkhtmltopdf which is simpler to use for a good result.)

Resources