Why does Pandoc not include abstract when LaTeX is converted to plain text? - pandoc

I have a latex document (example.tex):
\documentclass{article}
\begin{document}
\begin{abstract}
This does not get exported.
\end{abstract}
But this is fine.
\end{document}
And want to export it to plain text:
pandoc --to=plain example.tex
However, only the body of the document gets exported, the abstract is excluded from the export.
How do I make pandoc export my abstract too?
I believe that this is answered somewhere in the docs, but I just fail to find where for a relatively long time.

This is a fun question and goes deeply into the way pandoc represents documents. tldr: abstracts are treated as metadata, which are not included in plain output per default.
Here are the details: pandoc uses a simple document model which is built from the main text plus additional metadata like title and author. We can see this when running pandoc -s -t gfm example.tex, which will output a GitHub Flavored Markdown and include the metadata in a YAML block.
---
abstract: |
This does not get exported.
---
But this is fine.
The abstract is there, but in the YAML metadata. Whether and how the metadata is included in the output depends on the output format and the template that pandoc uses. Pandoc fills the template with data from the document. One can check the default template that's used by running pandoc -D plain, but there is some magic behind the scenes that adds special variables, so the output is only moderately instructive.
What's important for us is that we can use a custom template to include the abstract:
ABSTRACT
--------
$abstract$
$body$
Then if we run pandoc with
pandoc -t plain --template=OUR-TEMPLATE.plain example.tex
we get what we want:
ABSTRACT
--------
This does not get exported.
But this is fine.

Related

Writing thanks and keywords from YAML header in Markdown file to docx document through Pandoc conversion

After reading the online Pandoc manual and browsing pages such as knitr-pandoc-article-template-supporting-keywords and Keywords in Pandoc 2, I haven't figured out yet how to write the values of the thanks and keywords YAML fields from the header of a Markdown file to a docx document through Pandoc conversion. My working version of Pandoc is 2.18.
I have thought that a Lua filter might be the way to proceed, but my knowledge of both Lua and the Pandoc framework at the programmatic level is quite limited.
Any help in this regard would be greatly appreciated.
Although my actual setup is more complex, the following Markdown lines with a YAML header should do for an MWE:
---
title: The Title
author: The Author
thanks: |
The author wishes to thank certain support.
abstract: |
This is the abstract.
keywords: [one, two, three, four]
---
# A Heading
Text body.
The answer to this depends a little on how you'd want the thanks to be viewed. E.g., if you'd like it to be presented as a footnote to the author, you'd use a Lua filter like this:
function Meta (meta)
meta.author = meta.author .. {pandoc.Note(meta.thanks)}
return meta
end
The approach can be adapted to match different requirements.

Pandoc, Markdown to Doc, how to use variables?

AFAIK, variables can be defined in a YAML external file or inside the Markdown file in a header.
Then they can be used in the document. I have found examples with two different sytaxes:
$variable$ will convert variable to math mode, which is great (i.e. I want to keep that behaviour).
#{variable} does nothing.
Questions:
Is it possible to use variables in the pandoc conversion from markdown to .docx?
If so, how?
Pandoc variables can only be used in pandoc templates, not the document itself (there's an open issue about that).
For that you should check out a preprocessor like gpp or use a pandoc filter like pandoc-mustache or this lua-filter.

With Pandoc, how to converting between different formats with additional rules?

I have some existing Mediawiki format texts that contain categories tokens like
[[Category:XXX]]
[[Category:YYY]]
I'd like to convert them to Markdown texts. The basic command for doing that with Pandoc is
pandoc -f mediawiki -t markdown -s mytext.mediawiki -o mytext.md
The resultant Markdown text is mostly usable except that it converts the category tokens to
<Category:XXX> <Category:YYY>
which isn't really what I need. Instead, I need
[[!tag XXX YYY]]
because I'm using the resultant Markdown files as source files in a special content management system called Ikiwiki which has its idiosyncratic format for tags. How to do that with Pandoc?
It's probably easiest to do this as a second step with a search and replace on <Category:XXX>. Note that pandoc without the -o option writes to standard-out, so you can pipe it directly to some custom post-processing script.
[[Category:XXX]] is converted by pandoc internally to a link along the lines of Category:XXX (try pandoc -f mediawiki -t native).
So generally, additional rules for elements are implemented through custom scripts that match on Pandoc's internal data types, see Pandoc scripting. So you could match on those kind of links. It's more work (the first time), but makes quite sure you don't replace false positives.

Convert HTML and inline Mathjax math to LaTeX with pandoc ruby

I'm building a Rails app and I'm looking for a way to convert database entries with html and inline MathJax math (TeX) to LaTeX for pdf creation.
I found similar questions like mine:
Convert html mathjax to markdown with pandoc
How to convert HTML with mathjax into latex using pandoc?
and I see two options here:
Create a Haskell executable which leaves stuff like \(y=f(x)\) alone when converting html to LaTeX
Write a ruby method which does the following things:
Take the string and split it into an array with a regex (string.split(regex))
loop through the created array and if content matches regex convert the parts to LaTeX which do not include inline math with PandocRuby.html(string).to_latex
concatenate everything back together (array.join)
I would prefer the ruby method solution because I'm hosting my application on Heroku and I don't like to checkin binaries into git.
Note: the pandoc binary is implemented this way http://www.petekeen.net/introduction-to-heroku-buildpacks)
So my question is: what should the regex look like to split the string by \(math\).
E.g. string can look like this: text \(y=f(x) \iff \log_{10}(b)\) and \(a+b=c\) text
And for the sake of completeness, how should the Haskell script be written to leave \(math\) alone when converting to LaTeX and the ruby method is not a possible solution?
Get the very latest version of pandoc (1.12.2). Then you can do
pandoc -f html+tex_math_dollars+tex_math_single_backslash -t latex

Markdown to plain text in Ruby?

I'm currently using BlueCloth to process Markdown in Ruby and show it as HTML, but in one location I need it as plain text (without some of the Markdown). Is there a way to achieve that?
Is there a markdown-to-plain-text method? Is there an html-to-plain-text method that I could feel the result of BlueCloth?
RedCarpet gem has a Redcarpet::Render::StripDown renderer which "turns Markdown into plaintext".
Copy and modify it to suit your needs.
Or use it like this:
Redcarpet::Markdown.new(Redcarpet::Render::StripDown).render(markdown)
Converting HTML to plain text with Ruby is not a problem, but of course you'll lose all markup. If you only want to get rid of some of the Markdown syntax, it probably won't yield the result you're looking for.
The bottom line is that unrendered Markdown is intended to be used as plain text, therefore converting it to plain text doesn't really make sense. All Ruby implementations that I have seen follow the same interface, which does not offer a way to strip syntax (only including to_html, and text, which returns the original Markdown text).
It's not ruby, but one of the formats Pandoc now writes is 'plain'. Here's some arbitrary markdown:
# My Great Work
## First Section
Here we discuss my difficulties with [Markdown](http://wikipedia.org/Markdown)
## Second Section
We begin with a quote:
> We hold these truths to be self-evident ...
then some code:
#! /usr/bin/bash
That's *all*.
(Not sure how to turn off the syntax highlighting!) Here's the associated 'plain':
My Great Work
=============
First Section
-------------
Here we discuss my difficulties with Markdown
Second Section
--------------
We begin with a quote:
We hold these truths to be self-evident ...
then some code:
#! /usr/bin/bash
That's all.
You can get an idea what it does with the different elements it parses out of documents from the definition of plainify in pandoc/blob/master/src/Text/Pandoc/Writers/Markdown.hs in the Github repository; there is also a tutorial that shows how easy it is to modify the behavior.

Resources