Broken cross-document links with pandoc when converting markdown to other formats - pandoc

Wenn converting markdown files with cross document links to html, docs or pdf the links get broken in the process.
I use pandoc 1.19.1 and MikTex.
This is my testcase:
File1: doc1.md
[link1](/doc2.md)
File2: doc2.md
[link2](/doc1.md)
The result in html with this call to pandoc:
pandoc doc1.md doc2.md -o test.html
looks like this:
<p>link1 link2</p>
As pdf a link is created but it does not work. Exported as docx it looks the same.
I would have asumed that when multiple files are processed and concatenated into the same output file, then the result should contain page internal links like anchor links for html-output. But instead the link it created in the output file like it was in the input files. Even the original file extension .md is preserved in the created links.
What am I doing wrong ?
My problem looks a bit like this:
pandoc command line parameters for resolving internal links
In the comments of this question the bug is said to be fixed by a pull request in May. But the bug still seems to exist.
Greetings
Georg

I had a similar problem when trying to export a Gitlab wiki to PDF. There links between pages look like filename-of-page#anchor-name and links within a page look like #anchor-name. I wrote a (finicky and fragile) pandoc filter that solved that problem for me, who knows it's useful to others.
Example files
To explain my solution I'll have two test files, 101-first-page.md:
# First page // Gitlab automatically creates an anchor here named #first-page
Some text.
## Another section // Gitlab automatically creates an anchor here named #another-section
A link to the [first section](#first-page)
and 102-second-page.md:
# Second page // Gitlab automatically creates an anchor here named #second-page
Some text and [a link to the first page](101-first-page#first-page).
When concatenating them to render as one document in pandoc, links between pages break as anchors change. Below the concatenated file with the anchors in comments.
# First page // anchor=#first-page
Some text.
## Another section anchor=#another-section
A link to the [first section](#first-page)
# Second page // anchor=#second-page
Some text and [a link to the first page](101-first-page#first-page). // <-- this anchor no longer exists.
The link from the second to the first page breaks as the link target is incorrect.
Solution
By pre-processing all markdown files first individually via a pandoc filter, and then concatenating the resulting json files I was able to get all links working.
Requirements
pandoc
latex
python
pandocfilters
Every file should start with a level 1 header that matches the filename (except for the number at the beginning). E.g. the file 101-A file on the wiki.md should have a first level one header named A file on the wiki.
Filter
The filter itself (together with the pandoc script) is available in this gist.
What it does is:
It gets the label of the first level 1 header, e.g. first-page
It prepends that label to all other labels in the same file, e.g. first-page-another-section.
It renames all links to the same file such that the prefix is taken into account, e.g. #first-page-first-page
It renames all links to other files such that the (assumed) prefix of the other files is taken into account, e.g. 101-first-page#first-page becomes #first-page-first-page.
After it has run every markdown file through this filter individually and converted them to json files, it concatenates the json's and converts that to a PDF.

As the pandoc README states:
If multiple input files are given, pandoc will concatenate them all (with blank lines between them) before parsing.
So for the parsing done by pandoc, it sees it as one document... so you'll have to construct your links in multiple files as if it they were all in one file, see also this answer for details.

Related

Better way to include content as-is with AsciiDoc include directive

Context
I am making a script that dynamically inserts include directives in code blocks on an AsciiDoc file and then generates a PDF out of that. A generated AsciiDoc file could look like this:
= Title
[source,java]
---
include::foo.java[]
---
I want the user to be free to include whatever char-based file he or she wants, even other AsciiDoc files.
Problems
My goal is to show the contents as are of these included files. I run into problems when the included file:
is recognized as AsciiDoc beacuse of its extension, an thus any include directives it has are interpreted. I don't want nested includes, just to show the include directive in the code block. Example of undesired behaviour:
contains the code block delimiter ----, as seen on the image above when I end up with two code blocks instead of the intended single one. In this case, it does not matter if the file is recognized as an AsciiDoc file, the problem persists.
My workaround
The script I am writing uses AsciidoctorJ and I am leveraging that I can control how the content of each file is included by using an include processor. Using the include processor I wrap each line of each file with the pass:[] macro. Additionally, I activate macro substitution on the desired code block. A demonstration of this idea is shown in the image above.
Is there a better way to show the exact contents of a file? This works, but it seems like a hack. I would much rather prefer not having to change the read lines as I am currently doing.
EDIT for futher information
I would like to:
not have to escape the block delimiter. I am not exclusively referring to ----, but whatever the delimiter happens to be. For example, the answer by cirrus still has the problem when a line of the included file has .....
not have to escape the include directives in files recognized as AsciiDoc.
In a general note, I don't want to escape (or modify in any way) any lines.
Problem I found with my workaround:
If the last char of a line is a backslash (\), it escapes the closing bracket of the pass:[] macro.
You can try using a literal block. Based on your above example:
a.adoc:
= Title
....
include::c.adoc[]
....
If you use include:: in c.adoc, asciidoctor will still try to find and include the file. As such you will need to replace include:: with \include::
c.adoc:
\include::foo.txt[]
----
----
Which should output the following pdf:

Can a Java source file also be a valid AsciiDoc document with a table of contents?

I have done some experiments with writing programs that are also at the same time valid documentation that can be rendered as README's by e.g. Github - this ensures that code snippets are up to date and valid - and had some very interesting findings with Markdown. Unfortunately that format does not support having an automatically generated table of contents, so we looked into AsciiDoc which does.
I managed to copy an example using :toc: macro (to be able to place it after the opening summary), and then went on to make it valid Java, which essentially mean that you have to start the file with the /* characters but then I cannot make the table of contents appear any more.
The snippet starts with:
= Asciidoctor PDF Theming Guide
Dan Allen <https://github.com/mojavelinux[#mojavelinux]>
// Settings:
:idprefix:
:idseparator: -
:toc: macro
:experimental:
ifndef::env-github[:icons: font]
ifdef::env-github[]
:outfilesuffix: .adoc
:!toc-title:
:caution-caption: :fire:
:important-caption: :exclamation:
:note-caption: :paperclip:
:tip-caption: :bulb:
:warning-caption: :warning:
endif::[]
:window: _blank
// Aliases:
:conum-guard-yaml: #
ifndef::icons[:conum-guard-yaml: # #]
ifdef::backend-pdf[:conum-guard-yaml: # #]
:url-fontforge: https://fontforge.github.io/en-US/
:url-fontforge-scripting: https://fontforge.github.io/en-US/documentation/scripting/
:url-prawn: http://prawnpdf.org
////
Topics remaining to document:
* line height and line height length (and what that all means)
* title page layout / title page images (logo & background)
////
[.lead]
The theming system in Asciidoctor PDF is used to control the layout and styling of the PDF file
... blurb removed ...
/* (Experiment with asciidoc)
= Dagger 2 Hello World
// (Important: As an experiment Main.java is also a valid markdown file copied unmodified to README.md, so only edit Main.java)
This project is a single file Hello World Dagger-2 Maven project for
Java 8 and later, while also being its own documentation written in AsciiDoc.
toc::[]
My gut feeling is that the TOC does only work as expected if the file starts with lines parsed by AsciiDoc where this is set up and configured. If any output is generated before the configuration bits (like the Java comment) then the TOC is silently empty.
Hence I would like to know how I should do this correctly. All I want is a functional toc::[] macro in a file starting with /*
Asciidoc markup files are not Java source files. While I understand that this would be a compelling combination of the formats, that capability does not exist.
To keep source files up-to-date, your Asciidoc source files can use the include directive to include a source file. See: https://asciidoctor.org/docs/user-manual/#include-directive
To include, say, a single method, you can use tags to mark the start and end of the method's implementation, and then you can include that tag-delimited code section like this:
[source,java]
----
include::path/to/source.java[tag="method-x"]
----
Note that if the path to the Java source that you want to include is outside of the current directory, you may have to change the safe mode accordingly: https://asciidoctor.org/docs/user-manual/#running-asciidoctor-securely

How to generate a Table of Content with page numbers using MMD Parser

I'm writing a shell script (.sh) to:
Convert a markdown file (README.md) to HTML
Convert a HTML file to latex.
Convert a latex file to PDF.
The shell script uses MultiMarkdown v6 (by Fletcher Penney) for step 1-2 and "pdflatex" for step 3. The files are generated and formatted automatically.
The PDF pages are numbered, however, the pages in the table of content are not and question marks appear in the toc.
I included the metadata at the very top of the README.md. The script uses metadata to generate the latex file. I created the toc using the usual method for Github readme.md.
MMDv6 provides the "{{TOC}} function" (I did not use it). I could not get my head around this function so I just created a toc using the method I mentioned above.
MultiMarkdown User's Guide has a small section about toc (https://fletcher.github.io/MultiMarkdown-6/MMD_Users_Guide.html#tableofcontents).
Useful info about MMD & Latex
(https://github.com/fletcher/MultiMarkdown/wiki/MultiMarkdown-and-LaTeX)
My table of content has the following structure:
Table of Contents
=================
<!--ts-->
1. [Abstract](#Abstract)
2. [Table of Contents](#Table-of-Contents)
3. [System Installation](#System-Installation)
4. [System Architecture](#System-Architecture)
etc...
The script runs well, however, I would expect page numbers in the toc.
The toc on the PDF looks like:
Abstract (??)
System Installation (??)
System Architecture (??)
where (??) corresponds to the page number. How can I fix this? Do you have any suggestion?
Thanks

How to show redundant docs on multiple pages in read the docs

In our read the docs project we have a use case where we need to show some specific docs on multiple pages in the same version of docs. As of now, we do this either by one of the following ways
Copy-pasting the content to each page's rst file
Write it in one of the concerned files with a label and use :std:ref: in rest of the files to redirect it to the main file
I would want to achieve something like writing content only in one file and then showing it (without any redirection for user) in each of the files. Is it possible?
Use the include directive in the parent file.
.. include:: includeme.rst
Note that the included file will be interpreted in the context of the parent file. Therefore section levels (headings) in the included file must be consistent with the parent file, and labels in the included file might generate duplicate warnings.
You can use for this purpose the include directive.
Say that you write the text in dir/text.rst.
The following will include in other documents:
..include :: /dir/text.rst
where the path is either relative (then, with no slash) or absolute which is possible in sphinx (doc)
in Sphinx, when given an absolute include file path, this directive
takes it as relative to the source directory

With Pandoc, how to converting between different formats with additional rules?

I have some existing Mediawiki format texts that contain categories tokens like
[[Category:XXX]]
[[Category:YYY]]
I'd like to convert them to Markdown texts. The basic command for doing that with Pandoc is
pandoc -f mediawiki -t markdown -s mytext.mediawiki -o mytext.md
The resultant Markdown text is mostly usable except that it converts the category tokens to
<Category:XXX> <Category:YYY>
which isn't really what I need. Instead, I need
[[!tag XXX YYY]]
because I'm using the resultant Markdown files as source files in a special content management system called Ikiwiki which has its idiosyncratic format for tags. How to do that with Pandoc?
It's probably easiest to do this as a second step with a search and replace on <Category:XXX>. Note that pandoc without the -o option writes to standard-out, so you can pipe it directly to some custom post-processing script.
[[Category:XXX]] is converted by pandoc internally to a link along the lines of Category:XXX (try pandoc -f mediawiki -t native).
So generally, additional rules for elements are implemented through custom scripts that match on Pandoc's internal data types, see Pandoc scripting. So you could match on those kind of links. It's more work (the first time), but makes quite sure you don't replace false positives.

Resources