ruby markdown parser with WikiWord support? - ruby

I am using git-wiki for my personal note storage. It works very well, except that WikiWords are converted to links before the markdown parsing stage, using a regular expression. This messes up scores of things, for instance links that point to outside wiki pages, or block quotes (if I am quoting something, I do not want a WikiWord to be changed into a link).
Are there ruby-based Markdown parsers that understand WikiLinks?

The best parser around is the C-based one (upskirt/sundown), whose ruby iteration is red carpet:
https://github.com/tanoku/redcarpet
It is better for performance and security reasons.
For the wiki links, pre-process them before sending your text to the markdown parser.

Related

where is a list of markdown tags supported by redcarpet gem

Is there is list of the markdown tags supported by the redcarpet gem?
For example, some markdown implementations support centering text, some don't. Rather than trial and error experimentation, it seems like such a popular gem would be documented somewhere?
I don't think redcarpet is responsible for the markdown - it's simply a renderer; it uses some libraries to interpret the required code
After some research, it seems all of the markdown interpreters are originally based on the UpSkirt library, which was derived from this Daring Fireball project:
Markdown is a text-to-HTML conversion tool for web writers. Markdown
allows you to write using an easy-to-read, easy-to-write plain text
format, then convert it to structurally valid XHTML (or HTML).
Thus, “Markdown” is two things: (1) a plain text formatting syntax;
and (2) a software tool, written in Perl, that converts the plain text
formatting to HTML. See the Syntax page for details pertaining to
Markdown’s formatting syntax. You can try it out, right now, using the
online Dingus.
You can find the sytnax here

Is there a Spinx option to prevent parsing of URL's globally?

related to: How do I prevent sphinx from making a url a hyperlink?
In the question above we learn how to escape individual URL's in reStructured Text to prevent Sphinx from turning them into hyperlinks when converting to HTML. However I have a lot of URL's and I would like to keep my .rst files as clean as possible. It is an API documentation, so adding backslashes or quotes makes it less readable. Is there a config option to prevent Sphinx from parsing URL's altogether?
Unfortunately, I don't think there's any easy way. Implicitly detecting URIs happens at the reST-parsing layer:
https://github.com/qsnake/docutils/blob/68af50cccd2c8bb88264bffad44faa8e47e5d7dc/docutils/parsers/rst/states.py#L627
Sphinx is a set of predefined domains & related tooling on top of docutils' reST implementation, so this is lower-level than the things it provides config options for.
There might be some way of getting the HTML writer to not emit <a> tags out the output side of things, but my guess is that even if it's possible, it's likely to be pretty involved.

Using a modified Nokogiri to parse Wikitext?

Apologies for the length of this question, it's more of a "is this possible" than "how do I do this".
My objective is to remove everything but plain text from Wikipedia markup -- tables, templates, formatting. Whether these are in wikitext markup (e.g. ''bold text'') or HTML (<b>bold text</b>).
Wikipedia text is a mix of custom tags: templates {{ ... }}, tables {| ... |}, links [[ ... ]] and HTML elements. Parsing it is kind of a nightmare. You can't use regular expressions because the tags can be nested, and it can contain HTML so almost anything is possible. Some of the text within the HTML I'd want to keep (stuff within bold text), but other things like tables would need to be stripped entirely.
I thought about re-purposing an XML parser like Nokogiri, adding {{/}} as alternatives to <x>/</x>.
Does anyone who knows Nokogiri (or another Ruby XML parser) know if this is possible or even a good idea?
My alternative is to repurpose an existing parser like WikiCloth for the wiki markup, and then try to remove any leftover HTML via another method.
This sounds like a good idea. However, it would not be possible for you to 'patch' Nokogiri, "adding {{/}} as alternatives to <x>/</x>". This is because the bulk of the work done by Nokogiri—parsing and XPath and generating the string representation of a DOM—is actually done by libxml2 in the back end. You'd have to patch and recompile libxml2 (and then rebuild Nokogiri against your new version)…but at that point I have no idea how Nokogiri would behave.
You might have better luck trying to patch REXML, since that is written in pure Ruby.

Partial Markdown parsing

I have an application that needs to parse a subset of Markdown. I basically only want to support inline elements (bold, italic, links, etc), not block level elements (p, h1, h2, etc).
There are a lot of different libraries, so I need some help narrowing it down (and a code sample would be helpful). I started using RedCarpet until I realized that I can't specify which elements I want to parse.
What Ruby Markdown library can I use to achieve this?
I haven't found a library that allows you to specify on a granular level what parts of Markdown syntax are allowed. RDiscount has some configurability, however it doesn't take into account block level elements.
You could also give Sanitize a try (I know, parsing twice isn't exactly an ideal solution) and strip out the elements you don't want afterward.

Extracting a Hostname's TLD with a Regular Expression

Extracting an accurate representation of the top-level domain of a hostname is complicated by the fact that each top-level domain registry is free to make up its own policies regarding how domains are issued and what subdomains are defined. As there doesn't appear to be any standards body coordinating these or establishing standards, this has made determining the actual TLD a somewhat complicated affair.
Since web browsers assign cookies only to registered domains, and for security reasons must be vigilant about ensuring cookies cannot be assigned on a broader level, these browsers typically contain a database of all known TLDs in some form. I've found that Firefox has a fairly complete database:
http://hg.mozilla.org/mozilla-central/raw-file/3f91606bd115/netwerk/dns/effective_tld_names.dat
I have two specific questions:
Although it is fairly trivial to convert this listing into a regular expression, is there a gem or reference regexp that's a better solution than rolling your own? The tld gem only provides country-level info for the root-level domain.
Is there a better reference than the Firefox TLD listing? All of the local Google sites are correctly parsed by this specification, but that's hardly an exhaustive test.
If there's nothing out there, is anyone interested in a gem that performs this kind of operation? This sort of thing should be present in the URI module but is apparently missing.
Here's my take on converting this file into a usable Regexp in Ruby:
TLD_SPEC = Regexp.new(
'[^\.]+\.(' + %q[
// ***** BEGIN LICENSE BLOCK *****
// ... (Rest of file)
].split(/\n/).collect do |line|
line.sub(%r[//.*], '').sub(/\s+$/, '')
end.reject(&:blank?).collect do |s|
Regexp.escape(s).sub(/^\\\*\\\./, '[^\.]+\.')
end.join('|') + ')$'
)
You might want to look into using Addressable to see if that has what you need. It's got a lot more features than Ruby's default URI library. In particular, its template ability might help you.
From the docs:
Addressable is a replacement for the URI implementation that is part of Ruby's standard library. It more closely conforms to the relevant RFCs and adds support for IRIs and URI templates. Additionally, it provides extensive support for URI templates.
With the recent opening of the new TLDs, it's going to be a nightmare for a while. Check out the related list to the right to see how many people are trying to find a solution. Regex to match Domain.CCTLD recommends using a function to break it down into smaller steps and is what I'd do. Trying to do this with a regex assumes you can do it all in one expression, which starts to smell like using regex to parse XML or HTML. The target is too wiggly for a single pattern, or at least for a single maintainable pattern.
That answer mentions the public TLD list. Using the information there you could quickly use Ruby's Regexp.escape and Regexp.union methods to build a reasonably good regex on the fly. It'd be nice if we had Perl's Regexp::Assemble module available to us, but we don't so union will have to do. (See "Is there an efficient way to perform hundreds of text substitutions in Ruby?" for a way to work around this.)
There is another flat-file db here at http://guava-libraries.googlecode.com/svn-history/r42/trunk/src/com/google/common/net/TldPatterns.java
Perhaps you could combine the 2, and upload it to somewhere like OData.org, github, sourceforge, etc.
There's a gem called public-suffix-list which provides access to a more formalized version of the Mozilla listing.

Resources