Partial Markdown parsing - ruby

I have an application that needs to parse a subset of Markdown. I basically only want to support inline elements (bold, italic, links, etc), not block level elements (p, h1, h2, etc).
There are a lot of different libraries, so I need some help narrowing it down (and a code sample would be helpful). I started using RedCarpet until I realized that I can't specify which elements I want to parse.
What Ruby Markdown library can I use to achieve this?

I haven't found a library that allows you to specify on a granular level what parts of Markdown syntax are allowed. RDiscount has some configurability, however it doesn't take into account block level elements.
You could also give Sanitize a try (I know, parsing twice isn't exactly an ideal solution) and strip out the elements you don't want afterward.

Related

How to customize Asciidoctor's DocBook converter?

For example, when the goal is to represent several lines of a program's source code, then the AsciiDoc block markup choices are using
a literal block (....), which is converted to the literallayout DocBook tag
a listing block (----), which is converted to the screen DocBook tag
These are solid defaults, but the proper semantic markup would be programlisting in this case.
Using passthrough blocks is one solution, but at the cost of polymorphism (or convenience, if I decide to pepper my documents with ifdefs / ifndefs):
WARNING
Using passthroughs to pass content (without substitutions) can couple your content to a specific output format, such as HTML. In these cases, you should use conditional preprocessor directives to route passthrough content for different output formats based on the current backend.
I don't mind using the conditionals, but simply wondering if I am missing something obvious or straightforward? Such as passing semantic tag names to a block attrlist or something. Read through the entire AsciiDoc manual, the asciidoctor man page, the Generate DocBook from AsciiDoc article, and tried keyword searches, but couldn't find a thing. (It is highly probably though that I missed it..:)

Rename latexmath macro in asciidoc

Is there a way I can rename or alias latexmath in an AsciiDoc document?
In an ideal world, I'd like to set up an AsciiDoc such that $...$ is interpreted as LaTeX math, and
$$...$$ is interpreted as a block equation. In general, I'm just trying to reduce the number of characters involved in defining a math block since
where $c$ is the speed of light and $m_0$ is the rest mass
is significantly more readable (to someone who's used LaTeX for years) than
where latexmath:[$c$] is the speed of light and latexmath:[$m_0$] is the rest mass
The use case I have is that I'm writing technical documentation for upload to a GitLab repository. I'd like to be able to exploit GitLab's ability to automatically render AsciiDoc format files. However, these documents are math heavy, so I find the large numbers of latexmath:[...] blocks hard to read while editing.
latexmath is the name of the macro that handles parsing the LaTeX markup. If you don't specify latexmath, asciidoctor doesn't know to pass control to an alternate parser.
You could achieve what you're after by writing a pre-processing step that identifies contiguous blocks of LaTeX markup and wraps that markup in latexmath:[...]. The updated markup can then be processed by Asciidoctor like normal (assuming that your LaTeX markup identification is accurate). How you go about implementing that is up to you.
Another way, assuming you have some Ruby skills, would be to modify the extension that implements the latexmath macro such that it was called, say, L. Then your markup would be the more concise:
where L:[$c$] is the speed of light and L:[$m_0$] is the rest mass

How to get list of figures in Asciidoc

I am using asciid for an article. In the end of my document I want to have a list of figures. How to I create a list of figures? Did not find something useful in the documentation for me.
Nope there isn't one at the time of answer. I checked the docs (which you indicated you did as well) and I also grepped the codebase. There is good news though! You should be able to do this with an extension.
Extensions can be written in any JVM language if you're using asciidoctorj, or in Ruby if you're using the core asciidoctor (I'm not sure about JavaScript for asciidoctorjs). You'll need to create two extensions probably: a TreeProcessor extension to go through the whole AST looking for images and pulling them out into a storage structure. Then you'll also need to create either an inline or block macro to actually place it within the page.
I strongly recommend examining the API for the nodes and functions you'll want to make use of. There are some other examples of processors that may also be helpful to examine.

How do I extract a HTML topic heading from a web page?

Given a page like "What popular startup advice is plain wrong?", I'd like to be able to extract the first topic under the topic heading on the upper right hand side, in this case, "Common Misconceptions".
What's the best way for me to do this in Ruby? Is it with Nokogiri or a regex? Presumably I need to do some HTML parsing?
First, you almost never, ever, want to use regular expressions to parse/extract/fold/spindle/mutilate XML or HTML. There are too many ways it can go wrong. Regular expressions are great for some jobs, but XML/HTML extractions are not a good fit.
That said, here's what I'd do using Nokogiri:
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(open('http://www.quora.com/What-popular-startup-advice-is-plain-wrong'))
topic = doc.at('span a.topic_name span').content
puts topic
Running that outputs:
Common Misconceptions
The code is taking a couple shortcuts, that should work consistently:
Using Ruby's OpenURI allows easy accessing of Internet resources. It's my go-to for most simple to average apps. There are more powerful tools but none as convenient.
doc.at tells Nokogiri to traverse the document, and find the first occurrence of the CSS accessor 'span a.topic_name span', which should be consistent in that page as the first entry.
Note that Nokogiri supports some variants of searching for a node: at vs. search. at and % and things like css_at find the first occurrence and return a Node, which is an individual tag or text or comment. search, /, and those variants return a NodeSet which is like an array of Nodes. You'll have to walk that list or extract the individual nodes you want using some sort of Array accessor. In the above code I could have said doc.search(...).first to get the node I wanted.
Nokogiri also supports using XPath accessors, but for most things I'll usually go with CSS. It's simpler, and easier to read, but your mileage might vary.

ruby markdown parser with WikiWord support?

I am using git-wiki for my personal note storage. It works very well, except that WikiWords are converted to links before the markdown parsing stage, using a regular expression. This messes up scores of things, for instance links that point to outside wiki pages, or block quotes (if I am quoting something, I do not want a WikiWord to be changed into a link).
Are there ruby-based Markdown parsers that understand WikiLinks?
The best parser around is the C-based one (upskirt/sundown), whose ruby iteration is red carpet:
https://github.com/tanoku/redcarpet
It is better for performance and security reasons.
For the wiki links, pre-process them before sending your text to the markdown parser.

Resources