iText7 pdfHTML - TOC Generation - itext7

My use case for pdfHTML generation is as follows.
I have several individual reports being generated in HTML format by external API's (say Rep1.html, Rep2.html,..,Rep10.html)
I have an Orchestrating API that need to merge these HTML files as a SINGLE pdf document
Assume that Step 2 can read these files from a source directory
Question: In the pdf Merge process, how do I create a TOC that references all 10 reports? There is an earlier post from Alexy (Adding Table of Content with pdfHTML in iText7) that addresses this for a Single HTML file which entails adding a special tag [data-toc] by parsing the HTML file
In my case, when the individual html files are created, I can also create the required special tags. So if it a given that the individual html files will have the special tag, how do I render the TOC with the page numbers that will reflect the content location of the final pdf doc?

Related

How to create a new document in Sphinx/docutils by API?

I' writing a new extension for Sphinx as a domain offering multiple directives, roles and indices for the hardware description language VHDL. This extension shall be able to auto document language constructs. In VHDL, we have e.g. entities/architectures or packages/package bodies. These could be documented as pairs in individual docutils documents, so each language construct has an individual URL (or page in PDF) in the documentation.
I'm looking for a solution to create new documents in Sphinx/docutils. According to the docutils API description, a document is the root element to a doctree (documentation tree). According to the Sphinx documentation, directives are items in the doctree that get consumed and can emit new nodes to the surrounding doctree. So it's a translation/replacement approach.
Non of the documentation seams to offer a way to create new document.
Looking at the autodoc extensions, there are 2 sides. There is sphinx.ext.autodoc coming with Sphinx. It offers .. auto*** directives to automatically document Python language constructs. It requires a user to place dozens to hundreds of auto-directives into the ReStructured Text. Of cause it automatically document e.g. classes or modules, but for a huge project it's a lot of work.
In addition, there is autoapi, which reads the Python code for a given package or module and generates ReST files on the fly when Sphinx is loaded. Then these files - containing auto-directives - are processed.
As I understand, autoapi workaround the problem of creating now docutils documents, by writing ReST files, so Sphinx generates document instances with a doctree, and then autodoc from Sphinx jumps in and replaces them with content from doc-strings.
So my questions after all investigations I did so far:
How can I create docutils or Sphinx document to get a HTML file per item I want to auto document?
Or do I always need to do a hack like autoapi from Carlos Jenkins and create ReST files as dummies or with auto directives, so I can use the replacement capabilities of Sphinx/autodoc from directive to documentation nodes?
Why don't I like the autoapi approach? I have parse VHDL files as input in form of a Code Document Object Model (CodeDOM). I don't want to serialize parse VHDL file into ReST, to parse it again, construct again a model of my source files, so I can then translate to documentation nodes like sections, paragraphs and lists.
I have all available to generate doc-nodes for docutils, but I need multiple documents, so I can distribute the content to hundreds of documentations file (HTML files).

Can a Java source file also be a valid AsciiDoc document with a table of contents?

I have done some experiments with writing programs that are also at the same time valid documentation that can be rendered as README's by e.g. Github - this ensures that code snippets are up to date and valid - and had some very interesting findings with Markdown. Unfortunately that format does not support having an automatically generated table of contents, so we looked into AsciiDoc which does.
I managed to copy an example using :toc: macro (to be able to place it after the opening summary), and then went on to make it valid Java, which essentially mean that you have to start the file with the /* characters but then I cannot make the table of contents appear any more.
The snippet starts with:
= Asciidoctor PDF Theming Guide
Dan Allen <https://github.com/mojavelinux[#mojavelinux]>
// Settings:
:idprefix:
:idseparator: -
:toc: macro
:experimental:
ifndef::env-github[:icons: font]
ifdef::env-github[]
:outfilesuffix: .adoc
:!toc-title:
:caution-caption: :fire:
:important-caption: :exclamation:
:note-caption: :paperclip:
:tip-caption: :bulb:
:warning-caption: :warning:
endif::[]
:window: _blank
// Aliases:
:conum-guard-yaml: #
ifndef::icons[:conum-guard-yaml: # #]
ifdef::backend-pdf[:conum-guard-yaml: # #]
:url-fontforge: https://fontforge.github.io/en-US/
:url-fontforge-scripting: https://fontforge.github.io/en-US/documentation/scripting/
:url-prawn: http://prawnpdf.org
////
Topics remaining to document:
* line height and line height length (and what that all means)
* title page layout / title page images (logo & background)
////
[.lead]
The theming system in Asciidoctor PDF is used to control the layout and styling of the PDF file
... blurb removed ...
/* (Experiment with asciidoc)
= Dagger 2 Hello World
// (Important: As an experiment Main.java is also a valid markdown file copied unmodified to README.md, so only edit Main.java)
This project is a single file Hello World Dagger-2 Maven project for
Java 8 and later, while also being its own documentation written in AsciiDoc.
toc::[]
My gut feeling is that the TOC does only work as expected if the file starts with lines parsed by AsciiDoc where this is set up and configured. If any output is generated before the configuration bits (like the Java comment) then the TOC is silently empty.
Hence I would like to know how I should do this correctly. All I want is a functional toc::[] macro in a file starting with /*
Asciidoc markup files are not Java source files. While I understand that this would be a compelling combination of the formats, that capability does not exist.
To keep source files up-to-date, your Asciidoc source files can use the include directive to include a source file. See: https://asciidoctor.org/docs/user-manual/#include-directive
To include, say, a single method, you can use tags to mark the start and end of the method's implementation, and then you can include that tag-delimited code section like this:
[source,java]
----
include::path/to/source.java[tag="method-x"]
----
Note that if the path to the Java source that you want to include is outside of the current directory, you may have to change the safe mode accordingly: https://asciidoctor.org/docs/user-manual/#running-asciidoctor-securely

Morningstar xpath return empty in Google Sheet (Imported content is empty) [duplicate]

This question already has answers here:
Scraping data to Google Sheets from a website that uses JavaScript
(2 answers)
Closed last month.
I am trying to pull a number from the Morningstar "Cash Flow" page an arbitrary stock ticker using XPath. I have the tested the XPath on the morningstar website by an XPath tester and it returned desired values. However, when I want to use this value in a google sheet, it returns #N/A (Imported content is empty.).
=IMPORTXML("http://financials.morningstar.com/cash-flow/cf.html?t=fb&region=usa&culture=en-US", "//div[#id='data_tts1']/div")
I did a bit of research on this and find out that data in such websites generated dynamically and downloads the content in stages, Therefore, page needs to be loaded first to be able to pull any data out of it!
I'm wondering if there is any solution to this issue?
You help would much be appreciated.
it's empty as it should be because the content you are trying to scrape is of JavaScript origin. Google Sheets does not support imports of JS elements. you can always test this by disabling JS for a given site and only what's left can be scraped:
It might be possible. But you have to prepare a custom sheet to extract the data. Use IMPORTDATA to parse the .json which contains the data :
http://financials.morningstar.com/ajax/ReportProcess4HtmlAjax.html?&t=XNAS:FB&region=usa&culture=en-US&cur=&reportType=cf&period=12&dataType=A&order=asc&columnYear=5&curYearPart=1st5year&rounding=3&view=raw&r=672024&callback=jsonp1585016592836&_=1585016593002
AFAIK, you couldn't import directly the .csv version (specific headers needed, so curl or other specific tools would be required).
http://financials.morningstar.com/ajax/ReportProcess4CSV.html?&t=XNAS:FB&region=usa&culture=en-US&cur=&reportType=cf&period=12&dataType=A&order=asc&columnYear=5&curYearPart=1st5year&rounding=3&view=raw&r=764423&denominatorView=raw&number=3
Since this .json is very special (contains html tags), i don't think a custom script for GoogleSheets could import it correctly. So once the .json is loaded in GoogleSheets, TRANSPOSE the rows to columns and use formulas to locate your data (target the cells which contain data_s1 and data_s2 for example). Use CONCAT to merge the cells of interest. Then split the result into columns (use a custom separator). SEARCH for the data you want and clean the results with SUBSTITUTE. The method is dirty but i think it could be automated for the whole process.

How to generate a Table of Content with page numbers using MMD Parser

I'm writing a shell script (.sh) to:
Convert a markdown file (README.md) to HTML
Convert a HTML file to latex.
Convert a latex file to PDF.
The shell script uses MultiMarkdown v6 (by Fletcher Penney) for step 1-2 and "pdflatex" for step 3. The files are generated and formatted automatically.
The PDF pages are numbered, however, the pages in the table of content are not and question marks appear in the toc.
I included the metadata at the very top of the README.md. The script uses metadata to generate the latex file. I created the toc using the usual method for Github readme.md.
MMDv6 provides the "{{TOC}} function" (I did not use it). I could not get my head around this function so I just created a toc using the method I mentioned above.
MultiMarkdown User's Guide has a small section about toc (https://fletcher.github.io/MultiMarkdown-6/MMD_Users_Guide.html#tableofcontents).
Useful info about MMD & Latex
(https://github.com/fletcher/MultiMarkdown/wiki/MultiMarkdown-and-LaTeX)
My table of content has the following structure:
Table of Contents
=================
<!--ts-->
1. [Abstract](#Abstract)
2. [Table of Contents](#Table-of-Contents)
3. [System Installation](#System-Installation)
4. [System Architecture](#System-Architecture)
etc...
The script runs well, however, I would expect page numbers in the toc.
The toc on the PDF looks like:
Abstract (??)
System Installation (??)
System Architecture (??)
where (??) corresponds to the page number. How can I fix this? Do you have any suggestion?
Thanks

Broken cross-document links with pandoc when converting markdown to other formats

Wenn converting markdown files with cross document links to html, docs or pdf the links get broken in the process.
I use pandoc 1.19.1 and MikTex.
This is my testcase:
File1: doc1.md
[link1](/doc2.md)
File2: doc2.md
[link2](/doc1.md)
The result in html with this call to pandoc:
pandoc doc1.md doc2.md -o test.html
looks like this:
<p>link1 link2</p>
As pdf a link is created but it does not work. Exported as docx it looks the same.
I would have asumed that when multiple files are processed and concatenated into the same output file, then the result should contain page internal links like anchor links for html-output. But instead the link it created in the output file like it was in the input files. Even the original file extension .md is preserved in the created links.
What am I doing wrong ?
My problem looks a bit like this:
pandoc command line parameters for resolving internal links
In the comments of this question the bug is said to be fixed by a pull request in May. But the bug still seems to exist.
Greetings
Georg
I had a similar problem when trying to export a Gitlab wiki to PDF. There links between pages look like filename-of-page#anchor-name and links within a page look like #anchor-name. I wrote a (finicky and fragile) pandoc filter that solved that problem for me, who knows it's useful to others.
Example files
To explain my solution I'll have two test files, 101-first-page.md:
# First page // Gitlab automatically creates an anchor here named #first-page
Some text.
## Another section // Gitlab automatically creates an anchor here named #another-section
A link to the [first section](#first-page)
and 102-second-page.md:
# Second page // Gitlab automatically creates an anchor here named #second-page
Some text and [a link to the first page](101-first-page#first-page).
When concatenating them to render as one document in pandoc, links between pages break as anchors change. Below the concatenated file with the anchors in comments.
# First page // anchor=#first-page
Some text.
## Another section anchor=#another-section
A link to the [first section](#first-page)
# Second page // anchor=#second-page
Some text and [a link to the first page](101-first-page#first-page). // <-- this anchor no longer exists.
The link from the second to the first page breaks as the link target is incorrect.
Solution
By pre-processing all markdown files first individually via a pandoc filter, and then concatenating the resulting json files I was able to get all links working.
Requirements
pandoc
latex
python
pandocfilters
Every file should start with a level 1 header that matches the filename (except for the number at the beginning). E.g. the file 101-A file on the wiki.md should have a first level one header named A file on the wiki.
Filter
The filter itself (together with the pandoc script) is available in this gist.
What it does is:
It gets the label of the first level 1 header, e.g. first-page
It prepends that label to all other labels in the same file, e.g. first-page-another-section.
It renames all links to the same file such that the prefix is taken into account, e.g. #first-page-first-page
It renames all links to other files such that the (assumed) prefix of the other files is taken into account, e.g. 101-first-page#first-page becomes #first-page-first-page.
After it has run every markdown file through this filter individually and converted them to json files, it concatenates the json's and converts that to a PDF.
As the pandoc README states:
If multiple input files are given, pandoc will concatenate them all (with blank lines between them) before parsing.
So for the parsing done by pandoc, it sees it as one document... so you'll have to construct your links in multiple files as if it they were all in one file, see also this answer for details.

Resources