Compile multiple files into one with title blocks - pandoc

I'd like to know how to compile multiple pandoc files into one output document, where each input file has a title block.
E.g. suppose I have two files:
ch1.md:
% Chapter 1
% John Doe
% 1 Jan 2014
Here is chapter 1.
ch2.md:
% Chapter 2
% Jane Smith
% 3 Jan 2014
Here is chapter 2.
Typically with multiple input files you can compile them by providing them to pandoc:
pandoc ch1.md ch2.md --standalone -o output.html
However pandoc concatenates the input files before compiling, meaning only the first title block (from ch1.md) is styled appropriately.
I would like each title block to be styled appropriately (e.g. in html, the first line of the title block is styled with <h1 class="title">, the second <h2 class="author"> and so on).
(Note: I have also tried compiling each chapter as standalone separately, then concatenating these together using pandoc. This removes the title styling for chapters after 1, though keeps styling for the authors/date).
Why? I can:
compile each chapter in its own separate document and the author/title/date is marked up appropriately
compile the entire document together and author/title/date is marked up appropriately for each chapter (can use the --chapters option)
I could just specify the heading with '#' (h1), author with '##' (h2), and date with '###' (h3) in each chapter file directly but this means pandoc doesn't "know" what the title/heading/date of my document are, so (e.g.) if I compile to latex it won't use the \date{} or \author{} tags appropriately.

I wrote a pandoc filter that when run on each individual chapter's file, inserts the title block as headings (level 1 for title, level 2 for author, level 3 for date. This is what the HTML writer does).
This lets you run pandoc on each chapter individually (to produce the pandoc'd output plus the formatted title block), and then run pandoc on all the chapters together to compile the single document.
The filter is here on gist (I take no responsibility for malfunctioning code, etc): https://gist.github.com/mathematicalcoffee/e4f25350449e6004014f
You could modify it if you wanted it to format differently (for example like this the author/date appear in the table of contents since they are headings, which is not quite right... but that's a different problem as it happens with the default HTML writer too).
My workflow is now something like this:
FORMAT=latex # as understood by -t <format> in pandoc
FLAGS=--toc # other flags for pandoc, --smart, etc
OUT=pdf # output extension
for f in Chapter*.md; do \
pandoc $FLAGS -t $FORMAT --filter ./chapter.hs $f; \
echo ""; \
done | pandoc $FLAGS --standalone -o thesis.$OUT
where I've chmod +x chapter.hs and it's in the current directory.
(I additionally have a title.txt that I stick out the front with the entire thesis' title block (as opposed to each chapter's title block)).
I received some help from the pandoc google group which was great.

You can't do this with the % title blocks, but you can do it with the new YAML title blocks.
Start each document like this:
---
title: Chapter One
author: Me
date: June 4
...
When the documents are concatenated together, the first value set will take precedence over the others, so the subsequent YAML lines using the same parameter (e.g. "title:") will be ignored. (See the readme under "Extension: yaml_metadata_block".)

Related

Smart Quotes and Ligatures in pandoc

I have a file text.txt which contains very basic latex/markdown. For example, it might be the following.
Here is some basic maths: $f(x) = ax + b$ defines a straight line, often called a "linear" function---but it's not _actually_ a linear function, eg $f(0) \ne 0$.
I would like to convert this into html using WebTeX. However, I don't want smart quotes---" should be outputted as basic straight lines, not curved on either end---or smart dashes------ should be literally three dashes, not an em-dash.
It seems that the smart option is good for this: pandoc manual, github 1, github 2. However, I can't quite work out the correct syntax. I have tried, for example, the following.
pandoc text.txt -f markdown-smart -t markdown-smart -s --webtex -o tex.html
Unfortunately this doesn't work.
I solved this while writing the question, so I'll post the answer below! (Spoiler alert: simply remove -t markdown-smart.)
Simply remove -t markdown-smart.
pandoc text.txt -f markdown-smart -s --webtex -o tex.html
I believe that this -t is saying "to markdown without smart". We are not trying to output markdown, but rather html. If the version with -t is viewed, then one sees that the code for embedding various images is included. If this is pasted into a markdown editor, then it should show up.
To get html, simply remove this.

Is there a way to position new text within an existing pdf keeping such that it's editable after?

I want to create a pdf with pictures and text formatted a certain way — they all have to be arbitrarily positioned. I want to create such pdf on the server with terminal commands.
As I searched for the solution, I didn't find one. So I compiled some ideas to solve the problem; however, I still looking for a more direct one. I used the postscript compiler ENSCRIPT as a filter and then created a pdf by using the PS2PDF tool. Then I used the PDFJAM tool to position text in arbitrary position within the existing pdf. The following is the code I use to produce such a document. Although even this solution would help people who have the same problem, I welcome your suggestions of more efficient ways to handle this.
#!/bin/bash
# the following creates PDF with text
# options:
# -B — no header
# -f — font family and size
# --columns — make text grouped in columns
# -o - — to pass output
echo "Quick brown fox jumps over the lazy dog" | \
enscript -B -f "Times-Roman12" --columns=5 -o - | \
ps2pdf - pdf_with_text.pdf
# the following tool creates pdf_with_text-pdfjam.pdf from the pdf_with_text.pdf moving it in the exact position of the page
# options:
# -q — quiet mode
# --scale — scaling pdf_with_text.pdf if you need that
# --paper — resulting paper type
# --landscape — resulting paper orientation
# --offset — the most important part, set the offset of the top left corner of the source document in the resulting document. 0,0 point is in the middle (weird!)
pdfjam -q --scale 0.20 pdf_with_text.pdf --paper letter \
--landscape --offset "2cm -3cm"
# then we combine a new document and the existing one:
pdftk ./existing.pdf stamp pdf_with_text-pdfjam.pdf output new_combined.pdf
=========================
I got the combined pdf:
new_combined.pdf
with the text, I echoed in the first string, with the applied offset from the middle of the page.

Can I set command line arguments using the YAML metadata

Pandoc supports a YAML metadata block in markdown documents. This can set the title and author, etc. It can also manipulate the appearance of the PDF output by changing the font size, margin width and the frame sizes given to figures that are included. Lots of details are given here.
I'd like to use the metadata block to remember the command line arguments that I'm supposed to be using, such as --toc and --number-sections. I tried this, adding the following to the top of my markdown:
---
title: My Title
toc: yes
number-sections: yes
---
Then I used the command line:
pandoc -o guide.pdf articheck_guide.md
This did produce a table of contents, but didn't number the sections. I wondered why this was, and if there is a way I can specify this kind of thing from the document so that I don't need to add it on the command line.
YAML metadata are not passed to pandoc as arguments, but as variables. When you call pandoc on your MWE, it does not produce this :
pandoc -o guide.pdf articheck_guide.md --toc --number-sections
as we think it would. rather, it calls :
pandoc -o guide.pdf articheck_guide.md -V toc:yes -V number-sections:yes
Why, then, does you MWE produces a toc? Because the default latex template makes use of a toc variable :
~$ pandoc -D latex | grep toc
$if(toc)$
\setcounter{tocdepth}{$toc-depth$}
So setting toc to any value should produce a table of contents, at least in latex output. In this template, there is no number-sections variables, so this one doesn't work. However, there is a numbersections variable :
~$ pandoc -D latex | grep number
$if(numbersections)$
Setting numbersections to any value will produce numbering in a latex output with the default template
---
title: My Title
toc: yes
numbersections: yes
---
The trouble with this solution is that it only works with some output format. I thought I had read somewhere on the pandoc mailing-list that we soon would be able to use metadata in YAML blocks as intended (ie. as arguments rather than variables), but I can't find it anymore, so maybe it won't happen very soon.
Have a look at panzer (GitHub repository).
This was recently announced and released by Mark Sprevak -- a piece of software, that adds the notion of 'styles' to Pandoc.
It's basically a wrapper around Pandoc. It exploits the concept of YAML metadata blocks to the maximum.
The 'styles' provide a way to set all options for a Pandoc document conversion process with one line ("I want this document be an article/CV/notes/letter.").
You can regard this as more general abstraction than Pandoc templates. Styles are combinations of...
...Pandoc command line options,
...metadata settings,
...templates,
...instructions to run filters, and
...instructions to run pre/postprocessors.
These settings can be customized on a per output type as well as a per document basis. Styles can be...
...combined and
...can bear inheritance relations to each other.
panzer styles simplify Makefiles: they bundle everything concerning the look of a document in one place -- the YAML metadata (a block in the Markdown file, or a separate file).
You just add one line of metadata (style: ...) to your document, and it will be treated as a letter/article/CV/notebook or whatever.

What can I control with YAML header options in pandoc?

Only by chance did I see an example document using the toc: true line in their YAML header options in a Markdown file to be processed by Pandoc. And the Pandoc docs didn't mention this option to control table of contents using the YAML header. Furthermore, I see somewhat arbitrary lines in example documents on the same Pandoc readme site.
Main question:
What Pandoc options are available using the YAML header?
Meta-question:
What determines the available Pandoc options that are available to set using the YAML header?
Note: my workflow is to use Markdown files (.md) and process them through Pandoc to get PDF files. It has hierarchically organized manuscript writing with math. Such as:
pandoc --standalone --smart \
--from=markdown+yaml_metadata_block \
--filter pandoc-citeproc \
my_markdown_file.md \
-o my_pdf_file.pdf
Almost everything set in the YAML metadata has only an effect through the pandoc template in use.
Pandoc templates may contain variables. For example in your HTML template, you could write:
<title>$title$</title>
These template variables can be set with the --variable KEY[=VAL] option.
However, they are also set from the document metadata, which in turn can be set either by using:
the --metadata KEY[=VAL] option,
a YAML metadata block, or
the --metadata-file option.
The --variable options inserts strings verbatim into the template, while --metadata escapes strings. Strings in YAML metadata (also when using --metadata-file) are interpreted as markdown, which you can circumvent by using pandoc markdown's generic raw attributes. For example for HTML output:
`<script>alert()</script>`{=html}
See this table for a schematic:
| | --variable | --metadata | YAML metadata and --metadata-file |
|------------------------|-------------------|-------------------|-----------------------------------|
| values can be… | strings and bools | strings and bools | also YAML objects and lists |
| strings are… | inserted verbatim | escaped | interpreted as markdown |
| accessible by filters: | no | yes | yes |
To answer your question: the template determines what fields in the YAML metadata block have an effect. To view, for example, the default latex template, use:
$ pandoc -D latex
To see some variables that are set automatically by pandoc, see the Manual. Finally, other behaviours of pandoc (such as markdown extensions, etc) can only be set as command-line options (except when using a wrapper script).
It is a rather long list that you can browse by running man pandoc in the command line and navigating to "Variables set by pandoc" section under "TEMPLATES."
The top of the list includes the following among many other options:
Variables set by pandoc
Some variables are set automatically by pandoc. These vary somewhat depending on the
output format, but include metadata fields as well as the following:
title, author, date
allow identification of basic aspects of the document. Included in PDF metadata
through LaTeX and ConTeXt. These can be set through a pandoc title block, which
allows for multiple authors, or through a YAML metadata block:
---
author:
- Aristotle
- Peter Abelard
...
subtitle
document subtitle; also used as subject in PDF metadata
abstract
document summary, included in LaTeX, ConTeXt, AsciiDoc, and Word docx
keywords
list of keywords to be included in HTML, PDF, and AsciiDoc metadata; may be
repeated as for author, above
header-includes
contents specified by -H/--include-in-header (may have multiple values)
toc non-null value if --toc/--table-of-contents was specified
toc-title
title of table of contents (works only with EPUB and docx)
include-before
contents specified by -B/--include-before-body (may have multiple values)
include-after
contents specified by -A/--include-after-body (may have multiple values)
body body of document
```
You can see the documentation of pandoc for a clue: http://pandoc.org/getting-started.html
But to know exactly where it will be used you can look for templates sources of pandoc: https://github.com/jgm/pandoc-templates
For example, for the html5 output the file is: https://github.com/jgm/pandoc-templates/blob/master/default.html5
Here's an section of the code:
<title>$if(title-prefix)$$title-prefix$ - $endif$$pagetitle$</title>
As you can see it has title-prefix and pagetitle.
You can look the documentation, but the best solution is to look for the source code of the version you are using.
The pandoc main page now contains a list of options and explanations for them:
https://pandoc.org/MANUAL.html#variables
It seems to be the same as the one when looking at man pandoc.

How to specify numbered sections in Pandoc's front matter?

I would like to specify numbered sections via Pandoc's support for YAML front matter. I know that the flag for the command-line usage is --number-sections, but something like
---
title: Test
number-sections: true
---
doesn't produce the desired result. I know that I am close because you can do this with the geometry package (e.g. geometry: margin=2cm). I wish there was a definitive guide on how Pandoc YAML front matter handling. For example, the following is very useful (avoids templates), but its discoverability is low:
header-includes:
- \usepackage{some latex package}
In order to turn on numbered-sections in latex output you need to use numbersections in your YAML block. If you ever want to "discover" things like this with pandoc just poke around the templates:
$ grep -i number default.latex
$if(numbersections)$
$ grep -i number default.html*
$
As you can see this option does not work with html.
Markdown and YAML I tested with:
---
title: Test
numbersections: true
---
# blah
Text is here.
## Double Blah
Twice the text is here
If you need it to work with more than beamer,latex,context,opendoc you will need to file a bug at github.
In order to show section number in the produced output pdf, there are two choices.
In YAML front matter
Add the following setting to begin of markdown file
---
numbersections: true
---
In command line
We can also use the command option to generate pdf with numbered section. According to Pandoc documentation, the correct options is --number-sections or simply -N,
pandoc test.md -o test.pdf --number-sections
# pandoc test.md -o test.pdf -N

Resources