create table of contents in EPUB when converting from docx [pandoc] - pandoc

I want to automate progress of converting some EPUBs to EPUBS with toc.
I first convert them to docx and by some tricks I change some paragraphs to heading 1 when I convert them again to EPUB the headings have their heading style but there is nothing in table of contents.
If I create a word from scratch and use its default heading 1 pandoc easily converts that to EPUB and create TOC as it should be.
Now I want a way to convert pandoc styles to something like word styles.
This is my progress:
$ pandoc base.epub -o temp.docx
$ python do_magic_to_change_paragraphs_to_heading.py temp.docx # using docx package and I'm sure it's not the problem
$ pandoc temp.docx -o final.epub --toc

Related

Converting from docx to markdown how to get rid of span underline in links?

Since a recent pandoc update (now I'm at 2.2.1) the links in a docx document are converted to [<span class="underline">graphic novel hero</span>](https://www.amazon.com/exec/obidos/ASIN/1596432594/braipick-20) adding a unneeded span to link labels. Is there any black magic (besides adding a sed call to the pipeline) to get rid of them and returning to pure commonmark?
The pandoc options I use are: pandoc -f docx --atx-headers --wrap=none --extract-media=. -t commonmark-smart myFile.docx
Thanks for clarifying!
If you use -t commonmark the spans that the docx-reader generates are converted to raw HTML, so you could use:
pandoc -t commonmarkd-raw_html
Alternatively, use the markdown-writer, which is more flexible in terms of extensions (but as of 2018 not yet 100%-commonmark-compliant):
pandoc -t markdown-bracketed_spans-raw_html-native_spans
See the MANUAL for more details.

Can I set command line arguments using the YAML metadata

Pandoc supports a YAML metadata block in markdown documents. This can set the title and author, etc. It can also manipulate the appearance of the PDF output by changing the font size, margin width and the frame sizes given to figures that are included. Lots of details are given here.
I'd like to use the metadata block to remember the command line arguments that I'm supposed to be using, such as --toc and --number-sections. I tried this, adding the following to the top of my markdown:
---
title: My Title
toc: yes
number-sections: yes
---
Then I used the command line:
pandoc -o guide.pdf articheck_guide.md
This did produce a table of contents, but didn't number the sections. I wondered why this was, and if there is a way I can specify this kind of thing from the document so that I don't need to add it on the command line.
YAML metadata are not passed to pandoc as arguments, but as variables. When you call pandoc on your MWE, it does not produce this :
pandoc -o guide.pdf articheck_guide.md --toc --number-sections
as we think it would. rather, it calls :
pandoc -o guide.pdf articheck_guide.md -V toc:yes -V number-sections:yes
Why, then, does you MWE produces a toc? Because the default latex template makes use of a toc variable :
~$ pandoc -D latex | grep toc
$if(toc)$
\setcounter{tocdepth}{$toc-depth$}
So setting toc to any value should produce a table of contents, at least in latex output. In this template, there is no number-sections variables, so this one doesn't work. However, there is a numbersections variable :
~$ pandoc -D latex | grep number
$if(numbersections)$
Setting numbersections to any value will produce numbering in a latex output with the default template
---
title: My Title
toc: yes
numbersections: yes
---
The trouble with this solution is that it only works with some output format. I thought I had read somewhere on the pandoc mailing-list that we soon would be able to use metadata in YAML blocks as intended (ie. as arguments rather than variables), but I can't find it anymore, so maybe it won't happen very soon.
Have a look at panzer (GitHub repository).
This was recently announced and released by Mark Sprevak -- a piece of software, that adds the notion of 'styles' to Pandoc.
It's basically a wrapper around Pandoc. It exploits the concept of YAML metadata blocks to the maximum.
The 'styles' provide a way to set all options for a Pandoc document conversion process with one line ("I want this document be an article/CV/notes/letter.").
You can regard this as more general abstraction than Pandoc templates. Styles are combinations of...
...Pandoc command line options,
...metadata settings,
...templates,
...instructions to run filters, and
...instructions to run pre/postprocessors.
These settings can be customized on a per output type as well as a per document basis. Styles can be...
...combined and
...can bear inheritance relations to each other.
panzer styles simplify Makefiles: they bundle everything concerning the look of a document in one place -- the YAML metadata (a block in the Markdown file, or a separate file).
You just add one line of metadata (style: ...) to your document, and it will be treated as a letter/article/CV/notebook or whatever.

How can I suppress the date when using pandoc to convert md to pdf?

I would like to create a simple pdf file from a markdown file with a title and author but no date. I cannot figure out how to suppress the date without having to edit an intermediate tex file.
---
title: Test Doc
author: My Name
---
# Some Heading Here
Text here.
When you try the command pandoc test.md -o test.pdf
The date always appears in the pdf. I have tried setting the date: yaml block to all sorts of spaces, blanks, and other combinations, but cannot figure out how to get it to be blank.
Thank you.
Pandoc uses templates. To generate PDFs, by default it uses a LaTeX template, which you can print with pandoc -D latex. In an older pandoc version, this template contained:
$if(date)$
\date{$date$}
$endif$
which causes your issue because for some reason, LaTeX prints the date if you leave the \date{} command out. So either upgrade your pandoc version or modify your template manually to contain just
\date{$date$}
or use ConTeXt instead of LaTeX:
pandoc -s -t context test.md -o test.tex && context test.tex

What can I control with YAML header options in pandoc?

Only by chance did I see an example document using the toc: true line in their YAML header options in a Markdown file to be processed by Pandoc. And the Pandoc docs didn't mention this option to control table of contents using the YAML header. Furthermore, I see somewhat arbitrary lines in example documents on the same Pandoc readme site.
Main question:
What Pandoc options are available using the YAML header?
Meta-question:
What determines the available Pandoc options that are available to set using the YAML header?
Note: my workflow is to use Markdown files (.md) and process them through Pandoc to get PDF files. It has hierarchically organized manuscript writing with math. Such as:
pandoc --standalone --smart \
--from=markdown+yaml_metadata_block \
--filter pandoc-citeproc \
my_markdown_file.md \
-o my_pdf_file.pdf
Almost everything set in the YAML metadata has only an effect through the pandoc template in use.
Pandoc templates may contain variables. For example in your HTML template, you could write:
<title>$title$</title>
These template variables can be set with the --variable KEY[=VAL] option.
However, they are also set from the document metadata, which in turn can be set either by using:
the --metadata KEY[=VAL] option,
a YAML metadata block, or
the --metadata-file option.
The --variable options inserts strings verbatim into the template, while --metadata escapes strings. Strings in YAML metadata (also when using --metadata-file) are interpreted as markdown, which you can circumvent by using pandoc markdown's generic raw attributes. For example for HTML output:
`<script>alert()</script>`{=html}
See this table for a schematic:
| | --variable | --metadata | YAML metadata and --metadata-file |
|------------------------|-------------------|-------------------|-----------------------------------|
| values can be… | strings and bools | strings and bools | also YAML objects and lists |
| strings are… | inserted verbatim | escaped | interpreted as markdown |
| accessible by filters: | no | yes | yes |
To answer your question: the template determines what fields in the YAML metadata block have an effect. To view, for example, the default latex template, use:
$ pandoc -D latex
To see some variables that are set automatically by pandoc, see the Manual. Finally, other behaviours of pandoc (such as markdown extensions, etc) can only be set as command-line options (except when using a wrapper script).
It is a rather long list that you can browse by running man pandoc in the command line and navigating to "Variables set by pandoc" section under "TEMPLATES."
The top of the list includes the following among many other options:
Variables set by pandoc
Some variables are set automatically by pandoc. These vary somewhat depending on the
output format, but include metadata fields as well as the following:
title, author, date
allow identification of basic aspects of the document. Included in PDF metadata
through LaTeX and ConTeXt. These can be set through a pandoc title block, which
allows for multiple authors, or through a YAML metadata block:
---
author:
- Aristotle
- Peter Abelard
...
subtitle
document subtitle; also used as subject in PDF metadata
abstract
document summary, included in LaTeX, ConTeXt, AsciiDoc, and Word docx
keywords
list of keywords to be included in HTML, PDF, and AsciiDoc metadata; may be
repeated as for author, above
header-includes
contents specified by -H/--include-in-header (may have multiple values)
toc non-null value if --toc/--table-of-contents was specified
toc-title
title of table of contents (works only with EPUB and docx)
include-before
contents specified by -B/--include-before-body (may have multiple values)
include-after
contents specified by -A/--include-after-body (may have multiple values)
body body of document
```
You can see the documentation of pandoc for a clue: http://pandoc.org/getting-started.html
But to know exactly where it will be used you can look for templates sources of pandoc: https://github.com/jgm/pandoc-templates
For example, for the html5 output the file is: https://github.com/jgm/pandoc-templates/blob/master/default.html5
Here's an section of the code:
<title>$if(title-prefix)$$title-prefix$ - $endif$$pagetitle$</title>
As you can see it has title-prefix and pagetitle.
You can look the documentation, but the best solution is to look for the source code of the version you are using.
The pandoc main page now contains a list of options and explanations for them:
https://pandoc.org/MANUAL.html#variables
It seems to be the same as the one when looking at man pandoc.

How to specify numbered sections in Pandoc's front matter?

I would like to specify numbered sections via Pandoc's support for YAML front matter. I know that the flag for the command-line usage is --number-sections, but something like
---
title: Test
number-sections: true
---
doesn't produce the desired result. I know that I am close because you can do this with the geometry package (e.g. geometry: margin=2cm). I wish there was a definitive guide on how Pandoc YAML front matter handling. For example, the following is very useful (avoids templates), but its discoverability is low:
header-includes:
- \usepackage{some latex package}
In order to turn on numbered-sections in latex output you need to use numbersections in your YAML block. If you ever want to "discover" things like this with pandoc just poke around the templates:
$ grep -i number default.latex
$if(numbersections)$
$ grep -i number default.html*
$
As you can see this option does not work with html.
Markdown and YAML I tested with:
---
title: Test
numbersections: true
---
# blah
Text is here.
## Double Blah
Twice the text is here
If you need it to work with more than beamer,latex,context,opendoc you will need to file a bug at github.
In order to show section number in the produced output pdf, there are two choices.
In YAML front matter
Add the following setting to begin of markdown file
---
numbersections: true
---
In command line
We can also use the command option to generate pdf with numbered section. According to Pandoc documentation, the correct options is --number-sections or simply -N,
pandoc test.md -o test.pdf --number-sections
# pandoc test.md -o test.pdf -N

Resources