Reformat Markdown files to a specific code style - shell

I'm working on a book which had a couple of people writing and editing the text. Everything is Markdown. Unfortunately, there is a mix of different styles and lines widths. Technically this isn't a problem but it's not nice in terms of aesthetics.
What is the best way to reformat those files in e.g. GitHub markdown style? Is there a shell script for this job?

You might want to look at Pandoc; it understands several flavors of Markdown.
pandoc -f markdown -t gfm foobar.md
Having written a markup converter years ago in Perl, I would not want to approach such a task without a decent lexical analyzer, which is a bit beyond shell scripting.

I wrote a tool called tidy-markdown that will reformat any Markdown (including GFM) according to this styleguide.
$ tidy-markdown < ./ugly-markdown.md > ./clean-markdown.md
It handles conversion of inline HTML to Markdown, normalization of syntactic elements like code blocks (converting them to fenced), lists, block-quotes, front-matter, headers, and will even attempt to standardize code-block language identifiers.

Related

How to use pandoc filter to change a RawBlock to RawInline?

I am now using pandoc to convert markdown to HTML. I would like to convert some inline LaTeX environment to, for example, SVG. I could do this for RawBlocks, by using a Pandoc filter to transform RawBlock to Para [Image]. But I have a problem:
➜ pandoc -R -t native
A command \foo{bar}. An environment \begin{test} test \end \end{test} appears here.
\begin{rawblock}
test
\end{rawblock}
[Plain [Str "A",Space,Str "command",Space,RawInline (Format "tex") "\\foo{bar}",Str ".",Space,Str "An",Space,Str "environment"]
,RawBlock (Format "latex") "\\begin{test} test \\en \\end{test}"
,Para [Str "appears",Space,Str "here."]
,RawBlock (Format "latex") "\\begin{rawblock}\ntest\n\\end{rawblock}"]
As shown above, inline environment will also be parsed as RawBlock rather than RawInline, hence the text after the inline environment will become a new paragraph.
So my question is:
Is it feasible to take inline LaTeX environment to RawInline, like how Pandoc deal with inline command.
How to implement this using a Pandoc filter (better in python)?
Sorry about this not-really-an-answer, but I can't comment yet.
Pandoc has a predefined list of environments that it recognizes as inline. All other environments default to block-level. Since LaTeX is rather lax with its syntax concerning block-level environments, pandoc really has no way to know if a given environment is inline or block-level.
If you really want to use LaTeX environments, you can, but writing context-sensitive python filter is not exactly easy (it's somewhat easier with Haskell, but I assume that's not an option)
There is, however, an easier option: use spans instead of inline LaTeX environments and divs instead of block-level ones. Syntax is a bit more clunky, but writing a filter that will work with spans with a given class is relatively simple with any supported language.
pandoc -t native <<< "Replace inline environments with spans: <span class='span-class'>like this</span>"
[Para [Str "Replace",Space,Str "inline",Space,Str "environments",Space,Str "with",Space,Str "spans:",Space,Span ("",["span-class"],[]) [Str "like",Space,Str "this"]]]

Markdown syntax checking for continous integration?

Short story: I'm using Markdown to write a novel. Long story is here. In this site I typo-check the text using a Perl module (which I also developed), but I'd like to check MD syntax too. However, most markdown tool seem to be too lenient on errors, letting go stuff like this
This is an *error
This would be [another error](
Besides, there is no "check-only" option that returns false when there's an error, so that it can be used in continuous integration tests. The only one that balks out in this stuff is maruku. Kramdown, pandoc, marked, markdown (for nodejs), all of them let it go without a glitch.
Question is, is there a markdown syntax validator or checker in any language I can use easily in CI? Or should I go with maruku, despite being considered obsolete by his authors?
As pointed out in this answer, "it is impossible to write "invalid" markdown only markdown that wont do what you want it to." Every string is valid markdown.
You could, however define a subset of markdown that excludes markdown like the examples you mentioned in the question, and modify an existing parser to adhere to that subset.

where is a list of markdown tags supported by redcarpet gem

Is there is list of the markdown tags supported by the redcarpet gem?
For example, some markdown implementations support centering text, some don't. Rather than trial and error experimentation, it seems like such a popular gem would be documented somewhere?
I don't think redcarpet is responsible for the markdown - it's simply a renderer; it uses some libraries to interpret the required code
After some research, it seems all of the markdown interpreters are originally based on the UpSkirt library, which was derived from this Daring Fireball project:
Markdown is a text-to-HTML conversion tool for web writers. Markdown
allows you to write using an easy-to-read, easy-to-write plain text
format, then convert it to structurally valid XHTML (or HTML).
Thus, “Markdown” is two things: (1) a plain text formatting syntax;
and (2) a software tool, written in Perl, that converts the plain text
formatting to HTML. See the Syntax page for details pertaining to
Markdown’s formatting syntax. You can try it out, right now, using the
online Dingus.
You can find the sytnax here

Convert HTML and inline Mathjax math to LaTeX with pandoc ruby

I'm building a Rails app and I'm looking for a way to convert database entries with html and inline MathJax math (TeX) to LaTeX for pdf creation.
I found similar questions like mine:
Convert html mathjax to markdown with pandoc
How to convert HTML with mathjax into latex using pandoc?
and I see two options here:
Create a Haskell executable which leaves stuff like \(y=f(x)\) alone when converting html to LaTeX
Write a ruby method which does the following things:
Take the string and split it into an array with a regex (string.split(regex))
loop through the created array and if content matches regex convert the parts to LaTeX which do not include inline math with PandocRuby.html(string).to_latex
concatenate everything back together (array.join)
I would prefer the ruby method solution because I'm hosting my application on Heroku and I don't like to checkin binaries into git.
Note: the pandoc binary is implemented this way http://www.petekeen.net/introduction-to-heroku-buildpacks)
So my question is: what should the regex look like to split the string by \(math\).
E.g. string can look like this: text \(y=f(x) \iff \log_{10}(b)\) and \(a+b=c\) text
And for the sake of completeness, how should the Haskell script be written to leave \(math\) alone when converting to LaTeX and the ruby method is not a possible solution?
Get the very latest version of pandoc (1.12.2). Then you can do
pandoc -f html+tex_math_dollars+tex_math_single_backslash -t latex

Markdown to plain text in Ruby?

I'm currently using BlueCloth to process Markdown in Ruby and show it as HTML, but in one location I need it as plain text (without some of the Markdown). Is there a way to achieve that?
Is there a markdown-to-plain-text method? Is there an html-to-plain-text method that I could feel the result of BlueCloth?
RedCarpet gem has a Redcarpet::Render::StripDown renderer which "turns Markdown into plaintext".
Copy and modify it to suit your needs.
Or use it like this:
Redcarpet::Markdown.new(Redcarpet::Render::StripDown).render(markdown)
Converting HTML to plain text with Ruby is not a problem, but of course you'll lose all markup. If you only want to get rid of some of the Markdown syntax, it probably won't yield the result you're looking for.
The bottom line is that unrendered Markdown is intended to be used as plain text, therefore converting it to plain text doesn't really make sense. All Ruby implementations that I have seen follow the same interface, which does not offer a way to strip syntax (only including to_html, and text, which returns the original Markdown text).
It's not ruby, but one of the formats Pandoc now writes is 'plain'. Here's some arbitrary markdown:
# My Great Work
## First Section
Here we discuss my difficulties with [Markdown](http://wikipedia.org/Markdown)
## Second Section
We begin with a quote:
> We hold these truths to be self-evident ...
then some code:
#! /usr/bin/bash
That's *all*.
(Not sure how to turn off the syntax highlighting!) Here's the associated 'plain':
My Great Work
=============
First Section
-------------
Here we discuss my difficulties with Markdown
Second Section
--------------
We begin with a quote:
We hold these truths to be self-evident ...
then some code:
#! /usr/bin/bash
That's all.
You can get an idea what it does with the different elements it parses out of documents from the definition of plainify in pandoc/blob/master/src/Text/Pandoc/Writers/Markdown.hs in the Github repository; there is also a tutorial that shows how easy it is to modify the behavior.

Resources