Markdown to plain text in Ruby? - ruby

I'm currently using BlueCloth to process Markdown in Ruby and show it as HTML, but in one location I need it as plain text (without some of the Markdown). Is there a way to achieve that?
Is there a markdown-to-plain-text method? Is there an html-to-plain-text method that I could feel the result of BlueCloth?

RedCarpet gem has a Redcarpet::Render::StripDown renderer which "turns Markdown into plaintext".
Copy and modify it to suit your needs.
Or use it like this:
Redcarpet::Markdown.new(Redcarpet::Render::StripDown).render(markdown)

Converting HTML to plain text with Ruby is not a problem, but of course you'll lose all markup. If you only want to get rid of some of the Markdown syntax, it probably won't yield the result you're looking for.
The bottom line is that unrendered Markdown is intended to be used as plain text, therefore converting it to plain text doesn't really make sense. All Ruby implementations that I have seen follow the same interface, which does not offer a way to strip syntax (only including to_html, and text, which returns the original Markdown text).

It's not ruby, but one of the formats Pandoc now writes is 'plain'. Here's some arbitrary markdown:
# My Great Work
## First Section
Here we discuss my difficulties with [Markdown](http://wikipedia.org/Markdown)
## Second Section
We begin with a quote:
> We hold these truths to be self-evident ...
then some code:
#! /usr/bin/bash
That's *all*.
(Not sure how to turn off the syntax highlighting!) Here's the associated 'plain':
My Great Work
=============
First Section
-------------
Here we discuss my difficulties with Markdown
Second Section
--------------
We begin with a quote:
We hold these truths to be self-evident ...
then some code:
#! /usr/bin/bash
That's all.
You can get an idea what it does with the different elements it parses out of documents from the definition of plainify in pandoc/blob/master/src/Text/Pandoc/Writers/Markdown.hs in the Github repository; there is also a tutorial that shows how easy it is to modify the behavior.

Related

Reformat Markdown files to a specific code style

I'm working on a book which had a couple of people writing and editing the text. Everything is Markdown. Unfortunately, there is a mix of different styles and lines widths. Technically this isn't a problem but it's not nice in terms of aesthetics.
What is the best way to reformat those files in e.g. GitHub markdown style? Is there a shell script for this job?
You might want to look at Pandoc; it understands several flavors of Markdown.
pandoc -f markdown -t gfm foobar.md
Having written a markup converter years ago in Perl, I would not want to approach such a task without a decent lexical analyzer, which is a bit beyond shell scripting.
I wrote a tool called tidy-markdown that will reformat any Markdown (including GFM) according to this styleguide.
$ tidy-markdown < ./ugly-markdown.md > ./clean-markdown.md
It handles conversion of inline HTML to Markdown, normalization of syntactic elements like code blocks (converting them to fenced), lists, block-quotes, front-matter, headers, and will even attempt to standardize code-block language identifiers.

How can I represent a space at the end of a span of code in markdown?

I'm trying to represent a short inline span of code with a significant space or two at the end, using Markdown. If I were to put it in a stand-alone code block, it might look like this:
cd
Frustratingly, Markdown transforms `cd ` into <code>cd</code>, deleting the space at the end. How can I do it?
Unfortunately, I'm not aware of a way to do this with backticks in vanilla Markdown. If you need to use your current parser, you can use literal <code> HTML tags to generate correct output:
Normal text. <code>Inline code block with a space: </code> Normal text.
If you're able and interested, switching to a more consistent/expanded parser (such as kramdown, which I've tested, but potentially also MultiMarkdown and others) correctly interprets terminal spaces in code blocks and doesn't truncate them.
I don't believe markdown has anything for that specifically but if you use it will add a non breaking space. In your case replace
`cd `
with
`cd `
and it should work as intended.

Markdown syntax checking for continous integration?

Short story: I'm using Markdown to write a novel. Long story is here. In this site I typo-check the text using a Perl module (which I also developed), but I'd like to check MD syntax too. However, most markdown tool seem to be too lenient on errors, letting go stuff like this
This is an *error
This would be [another error](
Besides, there is no "check-only" option that returns false when there's an error, so that it can be used in continuous integration tests. The only one that balks out in this stuff is maruku. Kramdown, pandoc, marked, markdown (for nodejs), all of them let it go without a glitch.
Question is, is there a markdown syntax validator or checker in any language I can use easily in CI? Or should I go with maruku, despite being considered obsolete by his authors?
As pointed out in this answer, "it is impossible to write "invalid" markdown only markdown that wont do what you want it to." Every string is valid markdown.
You could, however define a subset of markdown that excludes markdown like the examples you mentioned in the question, and modify an existing parser to adhere to that subset.

Syntax Highlighting in Sublime Text 2

So I have been trying to figure out how to add syntax highlighting for the name of typedef's in c++ files, in sublime text.
For example, if I have typedef long long integer; I want integer to be highlighted (preferably the same color as the other types: int, bool, etc.). I went looked at the C.tmLanuage file, and tried to add the following regex code ^typedef.*?\s(\w+)\s*; to storage.type.c (line 49), but it didn't work. If I add the word string, it will highlight all instances of the word string. I tried going in the C++.tmLanguage file, and adding the regex code to storage.type.c++, but it still did not work.
Does anybody know how to get typedef's highlighted in sublime text?
Also, is there a way to get syntax highlighting for class name? Let's say I declare a string or vector, I would like either string or vector to be highlighted.
That regex would work (I believe) if you had something along the lines of typedef foo; To get the behavior you want, you will have to create a slightly more complex pattern entry in the tmLanguage file. As the language file is based on TextMates, you will want to have this as a reference (http://manual.macromates.com/en/language_grammars#language_grammars). I would also recommend using PlistJsonConverter (working in JSON is easier for me than working in XML). You will probably need to define begin and end patterns (begin will probably be typedef end will probably be ;). You can then apply whatever patterns you want to that group.
As for the class name highlighting, I would look to see what, if any scopes are being applied. If none are, you will have to come up with a regex to apply the scope to those. You can then add a color entry, or use a defined one from the color scheme.
Edit:
Actually they don't appear to be JSON. I see () rather than []. JSON is pretty simple to understand. You can look for something more in depth, but wikipedia is a good place to start. What you would probably be interested in are the things under the "Rule Keys" section. I did some searching (because I knew there were some better examples out there), and came across http://docs.sublimetext.info/en/latest/extensibility/syntaxdefs.html . It goes over syntax definitions from scratch, but the most relevant section is probably http://docs.sublimetext.info/en/latest/extensibility/syntaxdefs.html#analyzing-patterns. I don't have a regex to find class names, so you would have to come up with one yourself. If you haven't already though, you may want to search around to see if someone else has implemented a language file in a way that works for you.
You will want to start with the built in tmLanguage file and convert that from a Plist to json. You can then edit that file and move it back.

Using a modified Nokogiri to parse Wikitext?

Apologies for the length of this question, it's more of a "is this possible" than "how do I do this".
My objective is to remove everything but plain text from Wikipedia markup -- tables, templates, formatting. Whether these are in wikitext markup (e.g. ''bold text'') or HTML (<b>bold text</b>).
Wikipedia text is a mix of custom tags: templates {{ ... }}, tables {| ... |}, links [[ ... ]] and HTML elements. Parsing it is kind of a nightmare. You can't use regular expressions because the tags can be nested, and it can contain HTML so almost anything is possible. Some of the text within the HTML I'd want to keep (stuff within bold text), but other things like tables would need to be stripped entirely.
I thought about re-purposing an XML parser like Nokogiri, adding {{/}} as alternatives to <x>/</x>.
Does anyone who knows Nokogiri (or another Ruby XML parser) know if this is possible or even a good idea?
My alternative is to repurpose an existing parser like WikiCloth for the wiki markup, and then try to remove any leftover HTML via another method.
This sounds like a good idea. However, it would not be possible for you to 'patch' Nokogiri, "adding {{/}} as alternatives to <x>/</x>". This is because the bulk of the work done by Nokogiri—parsing and XPath and generating the string representation of a DOM—is actually done by libxml2 in the back end. You'd have to patch and recompile libxml2 (and then rebuild Nokogiri against your new version)…but at that point I have no idea how Nokogiri would behave.
You might have better luck trying to patch REXML, since that is written in pure Ruby.

Resources