With Pandoc, how to converting between different formats with additional rules? - pandoc

I have some existing Mediawiki format texts that contain categories tokens like
[[Category:XXX]]
[[Category:YYY]]
I'd like to convert them to Markdown texts. The basic command for doing that with Pandoc is
pandoc -f mediawiki -t markdown -s mytext.mediawiki -o mytext.md
The resultant Markdown text is mostly usable except that it converts the category tokens to
<Category:XXX> <Category:YYY>
which isn't really what I need. Instead, I need
[[!tag XXX YYY]]
because I'm using the resultant Markdown files as source files in a special content management system called Ikiwiki which has its idiosyncratic format for tags. How to do that with Pandoc?

It's probably easiest to do this as a second step with a search and replace on <Category:XXX>. Note that pandoc without the -o option writes to standard-out, so you can pipe it directly to some custom post-processing script.
[[Category:XXX]] is converted by pandoc internally to a link along the lines of Category:XXX (try pandoc -f mediawiki -t native).
So generally, additional rules for elements are implemented through custom scripts that match on Pandoc's internal data types, see Pandoc scripting. So you could match on those kind of links. It's more work (the first time), but makes quite sure you don't replace false positives.

Related

Why does Pandoc not include abstract when LaTeX is converted to plain text?

I have a latex document (example.tex):
\documentclass{article}
\begin{document}
\begin{abstract}
This does not get exported.
\end{abstract}
But this is fine.
\end{document}
And want to export it to plain text:
pandoc --to=plain example.tex
However, only the body of the document gets exported, the abstract is excluded from the export.
How do I make pandoc export my abstract too?
I believe that this is answered somewhere in the docs, but I just fail to find where for a relatively long time.
This is a fun question and goes deeply into the way pandoc represents documents. tldr: abstracts are treated as metadata, which are not included in plain output per default.
Here are the details: pandoc uses a simple document model which is built from the main text plus additional metadata like title and author. We can see this when running pandoc -s -t gfm example.tex, which will output a GitHub Flavored Markdown and include the metadata in a YAML block.
---
abstract: |
This does not get exported.
---
But this is fine.
The abstract is there, but in the YAML metadata. Whether and how the metadata is included in the output depends on the output format and the template that pandoc uses. Pandoc fills the template with data from the document. One can check the default template that's used by running pandoc -D plain, but there is some magic behind the scenes that adds special variables, so the output is only moderately instructive.
What's important for us is that we can use a custom template to include the abstract:
ABSTRACT
--------
$abstract$
$body$
Then if we run pandoc with
pandoc -t plain --template=OUR-TEMPLATE.plain example.tex
we get what we want:
ABSTRACT
--------
This does not get exported.
But this is fine.

Can pandoc generate a bibliography with the references in order of citation?

Pandoc noob here. I am trying to convert a LaTeX file into a Word document for submission to a picky journal. They are requiring that my references appear in the bibliography in the order in which they are cited. This is no problem in LaTeX, but when I use Pandoc to convert to Word my references appear in alphabetical order. I am using the basic command:
pandoc my.tex --bibliography=my.bib -o my.docx
Is there any way to force Pandoc to print the references in the order in which they appear in-text? Ideally, the references would appear in-text as numbers (bracketed, superscripted, I don't care) and the list of references would be numbered accordingly.
Any help in the direction of reducing the amount of manual work I will have to do is much appreciated.

Broken cross-document links with pandoc when converting markdown to other formats

Wenn converting markdown files with cross document links to html, docs or pdf the links get broken in the process.
I use pandoc 1.19.1 and MikTex.
This is my testcase:
File1: doc1.md
[link1](/doc2.md)
File2: doc2.md
[link2](/doc1.md)
The result in html with this call to pandoc:
pandoc doc1.md doc2.md -o test.html
looks like this:
<p>link1 link2</p>
As pdf a link is created but it does not work. Exported as docx it looks the same.
I would have asumed that when multiple files are processed and concatenated into the same output file, then the result should contain page internal links like anchor links for html-output. But instead the link it created in the output file like it was in the input files. Even the original file extension .md is preserved in the created links.
What am I doing wrong ?
My problem looks a bit like this:
pandoc command line parameters for resolving internal links
In the comments of this question the bug is said to be fixed by a pull request in May. But the bug still seems to exist.
Greetings
Georg
I had a similar problem when trying to export a Gitlab wiki to PDF. There links between pages look like filename-of-page#anchor-name and links within a page look like #anchor-name. I wrote a (finicky and fragile) pandoc filter that solved that problem for me, who knows it's useful to others.
Example files
To explain my solution I'll have two test files, 101-first-page.md:
# First page // Gitlab automatically creates an anchor here named #first-page
Some text.
## Another section // Gitlab automatically creates an anchor here named #another-section
A link to the [first section](#first-page)
and 102-second-page.md:
# Second page // Gitlab automatically creates an anchor here named #second-page
Some text and [a link to the first page](101-first-page#first-page).
When concatenating them to render as one document in pandoc, links between pages break as anchors change. Below the concatenated file with the anchors in comments.
# First page // anchor=#first-page
Some text.
## Another section anchor=#another-section
A link to the [first section](#first-page)
# Second page // anchor=#second-page
Some text and [a link to the first page](101-first-page#first-page). // <-- this anchor no longer exists.
The link from the second to the first page breaks as the link target is incorrect.
Solution
By pre-processing all markdown files first individually via a pandoc filter, and then concatenating the resulting json files I was able to get all links working.
Requirements
pandoc
latex
python
pandocfilters
Every file should start with a level 1 header that matches the filename (except for the number at the beginning). E.g. the file 101-A file on the wiki.md should have a first level one header named A file on the wiki.
Filter
The filter itself (together with the pandoc script) is available in this gist.
What it does is:
It gets the label of the first level 1 header, e.g. first-page
It prepends that label to all other labels in the same file, e.g. first-page-another-section.
It renames all links to the same file such that the prefix is taken into account, e.g. #first-page-first-page
It renames all links to other files such that the (assumed) prefix of the other files is taken into account, e.g. 101-first-page#first-page becomes #first-page-first-page.
After it has run every markdown file through this filter individually and converted them to json files, it concatenates the json's and converts that to a PDF.
As the pandoc README states:
If multiple input files are given, pandoc will concatenate them all (with blank lines between them) before parsing.
So for the parsing done by pandoc, it sees it as one document... so you'll have to construct your links in multiple files as if it they were all in one file, see also this answer for details.

Pandoc, Markdown to Doc, how to use variables?

AFAIK, variables can be defined in a YAML external file or inside the Markdown file in a header.
Then they can be used in the document. I have found examples with two different sytaxes:
$variable$ will convert variable to math mode, which is great (i.e. I want to keep that behaviour).
#{variable} does nothing.
Questions:
Is it possible to use variables in the pandoc conversion from markdown to .docx?
If so, how?
Pandoc variables can only be used in pandoc templates, not the document itself (there's an open issue about that).
For that you should check out a preprocessor like gpp or use a pandoc filter like pandoc-mustache or this lua-filter.

Convert HTML and inline Mathjax math to LaTeX with pandoc ruby

I'm building a Rails app and I'm looking for a way to convert database entries with html and inline MathJax math (TeX) to LaTeX for pdf creation.
I found similar questions like mine:
Convert html mathjax to markdown with pandoc
How to convert HTML with mathjax into latex using pandoc?
and I see two options here:
Create a Haskell executable which leaves stuff like \(y=f(x)\) alone when converting html to LaTeX
Write a ruby method which does the following things:
Take the string and split it into an array with a regex (string.split(regex))
loop through the created array and if content matches regex convert the parts to LaTeX which do not include inline math with PandocRuby.html(string).to_latex
concatenate everything back together (array.join)
I would prefer the ruby method solution because I'm hosting my application on Heroku and I don't like to checkin binaries into git.
Note: the pandoc binary is implemented this way http://www.petekeen.net/introduction-to-heroku-buildpacks)
So my question is: what should the regex look like to split the string by \(math\).
E.g. string can look like this: text \(y=f(x) \iff \log_{10}(b)\) and \(a+b=c\) text
And for the sake of completeness, how should the Haskell script be written to leave \(math\) alone when converting to LaTeX and the ruby method is not a possible solution?
Get the very latest version of pandoc (1.12.2). Then you can do
pandoc -f html+tex_math_dollars+tex_math_single_backslash -t latex

Resources