Pandoc HTML to Markdown - Non-Html tables - pandoc

I use the following Pandoc command to convert HTML to Markdown
pandoc -f html -t commonmark myfile.html >myfile.md
It works great but for some reason it always converts a table to an html coded table rather than a "markdown" table (with no html tags in it). Does anyone know how I can force Pandoc to produce a non-html coded table?

that is perfectly ok because you defined commonmark for output, simply because the original markdown did not have tables and everything there was not already was adviced to do in the surrounding language. that is html in this case.
read https://daringfireball.net/projects/markdown/syntax and you will see html is allowd within markdown.
to achieve the extended markdown output as mentioned in the pandoc manual: pandoc -f html -t markdown myfile.html >myfile.md works here
result:
--- --- ---
1 2 3
1 2 3
--- --- ---
myfile.html:
<html><body>
<table>
<tr><td>1</td><td>2</td><td>3</td></tr>
<tr><td>1</td><td>2</td><td>3</td></tr>
</table>
</body></html>

Related

Not able to use titlesec with markdown and pandoc?

When I used titlesec in my markdown document as below:
---
header-includes:
- \usepackage{titlesec}
---
when processing it by pandoc, I got the following error:
pandoc try.md -o try.pdf
! Argument of \paragraph has an extra }.
<inserted text>
\par
l.1290 \ttl#extract\paragraph
pandoc: Error producing PDF
by searching, I found the following work-around for R-markdown:
Can't knit to pdf with custom styles
I wonder how can I implement a similar work-around with markdown and YAML headers?
I also found and verified the following approach would work:
pandoc --variable=subparagraph try.md -o try.pdf
But it's harder for the user, as one might forget the work-around.
There are some discussion of the work-around https://www.bountysource.com/issues/40574981-latex-template-incompatible-with-titlesec,
but it's beyond my knowledge
Thanks for your help
This is because the default LaTeX template redefines \paragraph. To disable this behaviour, you can use the subparagraph variable in pandoc. You could supply this at the command-line:
pandoc --variable subparagraph -o file.pdf file.md
Or you could embed it in the document's YAML metadata, with any non-null value:
---
subparagraph: yes
---
From man pandoc (and the user's guide):
subparagraph
disables default behavior of LaTeX template that redefines (sub)paragraphs as sections, changing the appearance of nested headings in some classes
After this, titlesec.sty should work.

How can I suppress the date when using pandoc to convert md to pdf?

I would like to create a simple pdf file from a markdown file with a title and author but no date. I cannot figure out how to suppress the date without having to edit an intermediate tex file.
---
title: Test Doc
author: My Name
---
# Some Heading Here
Text here.
When you try the command pandoc test.md -o test.pdf
The date always appears in the pdf. I have tried setting the date: yaml block to all sorts of spaces, blanks, and other combinations, but cannot figure out how to get it to be blank.
Thank you.
Pandoc uses templates. To generate PDFs, by default it uses a LaTeX template, which you can print with pandoc -D latex. In an older pandoc version, this template contained:
$if(date)$
\date{$date$}
$endif$
which causes your issue because for some reason, LaTeX prints the date if you leave the \date{} command out. So either upgrade your pandoc version or modify your template manually to contain just
\date{$date$}
or use ConTeXt instead of LaTeX:
pandoc -s -t context test.md -o test.tex && context test.tex

What can I control with YAML header options in pandoc?

Only by chance did I see an example document using the toc: true line in their YAML header options in a Markdown file to be processed by Pandoc. And the Pandoc docs didn't mention this option to control table of contents using the YAML header. Furthermore, I see somewhat arbitrary lines in example documents on the same Pandoc readme site.
Main question:
What Pandoc options are available using the YAML header?
Meta-question:
What determines the available Pandoc options that are available to set using the YAML header?
Note: my workflow is to use Markdown files (.md) and process them through Pandoc to get PDF files. It has hierarchically organized manuscript writing with math. Such as:
pandoc --standalone --smart \
--from=markdown+yaml_metadata_block \
--filter pandoc-citeproc \
my_markdown_file.md \
-o my_pdf_file.pdf
Almost everything set in the YAML metadata has only an effect through the pandoc template in use.
Pandoc templates may contain variables. For example in your HTML template, you could write:
<title>$title$</title>
These template variables can be set with the --variable KEY[=VAL] option.
However, they are also set from the document metadata, which in turn can be set either by using:
the --metadata KEY[=VAL] option,
a YAML metadata block, or
the --metadata-file option.
The --variable options inserts strings verbatim into the template, while --metadata escapes strings. Strings in YAML metadata (also when using --metadata-file) are interpreted as markdown, which you can circumvent by using pandoc markdown's generic raw attributes. For example for HTML output:
`<script>alert()</script>`{=html}
See this table for a schematic:
| | --variable | --metadata | YAML metadata and --metadata-file |
|------------------------|-------------------|-------------------|-----------------------------------|
| values can be… | strings and bools | strings and bools | also YAML objects and lists |
| strings are… | inserted verbatim | escaped | interpreted as markdown |
| accessible by filters: | no | yes | yes |
To answer your question: the template determines what fields in the YAML metadata block have an effect. To view, for example, the default latex template, use:
$ pandoc -D latex
To see some variables that are set automatically by pandoc, see the Manual. Finally, other behaviours of pandoc (such as markdown extensions, etc) can only be set as command-line options (except when using a wrapper script).
It is a rather long list that you can browse by running man pandoc in the command line and navigating to "Variables set by pandoc" section under "TEMPLATES."
The top of the list includes the following among many other options:
Variables set by pandoc
Some variables are set automatically by pandoc. These vary somewhat depending on the
output format, but include metadata fields as well as the following:
title, author, date
allow identification of basic aspects of the document. Included in PDF metadata
through LaTeX and ConTeXt. These can be set through a pandoc title block, which
allows for multiple authors, or through a YAML metadata block:
---
author:
- Aristotle
- Peter Abelard
...
subtitle
document subtitle; also used as subject in PDF metadata
abstract
document summary, included in LaTeX, ConTeXt, AsciiDoc, and Word docx
keywords
list of keywords to be included in HTML, PDF, and AsciiDoc metadata; may be
repeated as for author, above
header-includes
contents specified by -H/--include-in-header (may have multiple values)
toc non-null value if --toc/--table-of-contents was specified
toc-title
title of table of contents (works only with EPUB and docx)
include-before
contents specified by -B/--include-before-body (may have multiple values)
include-after
contents specified by -A/--include-after-body (may have multiple values)
body body of document
```
You can see the documentation of pandoc for a clue: http://pandoc.org/getting-started.html
But to know exactly where it will be used you can look for templates sources of pandoc: https://github.com/jgm/pandoc-templates
For example, for the html5 output the file is: https://github.com/jgm/pandoc-templates/blob/master/default.html5
Here's an section of the code:
<title>$if(title-prefix)$$title-prefix$ - $endif$$pagetitle$</title>
As you can see it has title-prefix and pagetitle.
You can look the documentation, but the best solution is to look for the source code of the version you are using.
The pandoc main page now contains a list of options and explanations for them:
https://pandoc.org/MANUAL.html#variables
It seems to be the same as the one when looking at man pandoc.

How to specify numbered sections in Pandoc's front matter?

I would like to specify numbered sections via Pandoc's support for YAML front matter. I know that the flag for the command-line usage is --number-sections, but something like
---
title: Test
number-sections: true
---
doesn't produce the desired result. I know that I am close because you can do this with the geometry package (e.g. geometry: margin=2cm). I wish there was a definitive guide on how Pandoc YAML front matter handling. For example, the following is very useful (avoids templates), but its discoverability is low:
header-includes:
- \usepackage{some latex package}
In order to turn on numbered-sections in latex output you need to use numbersections in your YAML block. If you ever want to "discover" things like this with pandoc just poke around the templates:
$ grep -i number default.latex
$if(numbersections)$
$ grep -i number default.html*
$
As you can see this option does not work with html.
Markdown and YAML I tested with:
---
title: Test
numbersections: true
---
# blah
Text is here.
## Double Blah
Twice the text is here
If you need it to work with more than beamer,latex,context,opendoc you will need to file a bug at github.
In order to show section number in the produced output pdf, there are two choices.
In YAML front matter
Add the following setting to begin of markdown file
---
numbersections: true
---
In command line
We can also use the command option to generate pdf with numbered section. According to Pandoc documentation, the correct options is --number-sections or simply -N,
pandoc test.md -o test.pdf --number-sections
# pandoc test.md -o test.pdf -N

Pandoc: no line wrapping when converting to HTML

I am converting from Markdown to HTML like so:
pandoc --columns=70 --mathjax -f markdown input.pdc -t html -Ss > out.html
Everything works fine, except for the fact that the text doesn't get wrapped. I tried different columns lengths, no effect. Removed options, no go. Whatever I tried, the HTML just doesn't get wrapped. I search the bug tracker, but there don't seem to be any open bugs relating to this issue. I also checked the documentation, but as far as I could glean, the text ought be line-wrapped... So, have I stumbled into a bug?
I'm using pandoc version 1.12.4.2.
Thanks in advance for your help!
Pandoc puts newlines in the HTML so the source code is easier to read. By default, it doesn't insert <br>-tags.
If you want to preserve line breaks from markdown input:
pandoc -f markdown+hard_line_breaks input.md output.html
However, usually a better approach to limit the text width when opening the HTML file in the browser is to adapt the HTML template (pandoc -D html5) and add some CSS, like:
<!DOCTYPE html>
<html$if(lang)$ lang="$lang$"$endif$>
<head>
<style>
body {
width: 46em;
}
</style>
...
It is not clear what text should get wrapped but does not as you did not provide a sample.
Pandoc supports several line breaking scenarios in markdown documents.
What you may be looking for is the hard_line_breaks extension
If it is so then your command should look like
pandoc --columns=70 --mathjax -f markdown+hard_line_breaks input.pdc -t html -Ss > out.html
I'd recommend you to read about all the markdown-relevant options and configure pandoc to match your input markdown flavor

Resources