Parse Markdown that is separated by `#` with regex pattern - ruby

I'd like to parse Markdown that is separated by # (single hash).
I've been try to that with Ruby.
Below code outputs ["# Bob's markdown header 1\n\nsomething here.\n\n", "#", "# kitty's header 1\n\nmeow.\n\n"]
p \
arrayobj = <<-EOS.scan(/^#[^#]*/m)
# Bob's markdown header 1
something here.
## this is markdown header 2
yeah.
# kitty's header 1
meow.
EOS
However what I wanted is below.
["# Bob's markdown header 1\n\nsomething here.\n\n## this is markdown header 2\n\nyeah.\n\n", "# kitty's header 1\n\nmeow.\n\n"]
In that case, how do you parse the Markdown?

You may match a line starting with # not followed with another # and any amount of subsequent lines that do not start with such a standalone # char:
.scan(/^#(?!#).*(?:\R(?!#(?!#)).*)*/)
See the Ruby demo online.
Pattern details
^ - start of a line
#(?!#) - a # not followed with #
.* - the rest of the line
(?:\R(?!#(?!#)).*)* - zero or more consecutive occurrences of:
\R(?!#(?!#)) - any line break sequence (use \n for old Ruby versions) that is not followed with a standalone #
.* - the rest of the line.

Related

How to use gsub to simplify regular expressions

I would like to escape the # with \ when they appear in \href commands.
Normally I would write a regex such as s/(\\href\{.*?)#(.*?)\}/\1\\#\2/g, but I imagine gsub would we a good choice here to first extract the \href content and then replace # with \#.
Here is some text with a \href{./file.pdf#section.1.5}{link} to section 1.5.
There can be multiple links in one line.
Question
Can gsub simplify these sorts of problems?
Except if one or several of the urls contained inside \href{..}s has a password part enclosed between quotes like http://username:"sdkfj#lkn#"#domainname.org/path/file.ext, the only possible place for the character # in a url is at the end and delimits the fragment part: ./path/path/file.rb?val=toto#thefragmentpart.
In other words, if I am not wrong there's max one # to escape per href{...}. Then you can simply do that:
text.gsub(/\\href{[^#}]*\K#/, "\\#")
The character class [^#}] forbids the character } and ensures that you are always between curly brackets.
You could use two gsubs : one with an argument and a block (for href{...}), one with 2 arguments (to replace # with \#):
text = %q(Here is some text with a \href{./file.pdf#section.1.5}{link} to section 1.5.)
puts text.gsub(/href{[^}]+}/){ |href| href.gsub('#', '\#') }
#=> Here is some text with a \href{./file.pdf\#section.1.5}{link} to section 1.5.
If you want to launch it from a terminal with ruby -e for a test.txt file, you can use:
ruby -pe '$_.gsub(/href{[^}]+}/){ |href| href.gsub(%q|#|, %q|\#|) }' test.txt
# Here is some text with a \href{./file.pdf#section.1.5}{link} to section 1.5.
# Here is some text with a \href{./file.pdf#section.1.6}{link} to section 1.6.
# Here is some text with a \href{./file.pdf#section.1.7}{link} to section 1.7.
or
ruby -e 'puts ARGF.read.gsub(/href{[^}]+}/){ |href| href.gsub(%q|#|, %q|\#|) }' test.txt
# Here is some text with a \href{./file.pdf#section.1.5}{link} to section 1.5.
# Here is some text with a \href{./file.pdf#section.1.6}{link} to section 1.6.
# Here is some text with a \href{./file.pdf#section.1.7}{link} to section 1.7.
Do not mix ruby -pe and ARGF.read, it would only read the first line of your file!

Ruby advanced gsub

I've got a string like this one below:
My first LINK
and my second LINK
How do I substitute all the links in this string from href="URL" to href="/redirect?url=URL" so that it becomes
My first LINK
and my second LINK
Thanks!
Given your case we can construct following regex:
re = /
href= # Match attribute we are looking for
[\'"]? # Optionally match opening single or double quote
\K # Forget previous matches, as we dont really need it
([^\'" >]+) # Capture group of characters except quotes, space and close bracket
/x
Now you can replace captured group with string you need (use \1 to refer a group):
str.gsub(re, '/redirect?url=\1')
gsub allows you to match regex patterns and use captured substrings in the substitution:
x = <<-EOS
My first LINK
and my second LINK
EOS
x.gsub(/"(.*)"/, '"/redirect?url=\1"') # the \1 refers to the stuff captured
# by the (.*)

How to add line break in rdoc

I have a code and comments for RDoc like this:
# first line comment
# second line comment
def foo
end
When I output document by rdoc foo.rb, then line break are ignored in HTML file.
To add line break I can write like:
# first line comment<br>
# second line comment
or
# first line comment
#
# second line comment
but I feel both way are not simple enough.
Is there other simple way to add line break in RDoc?
Just add two or more spaces to the end of the line and it will work.
#first comment
#second comment
def foo
end
The first line has 2 spaces after comment.
Add answer regarding add line break in common text:
Make heading with full white space
 "===  " # <= the quoted part
At least this works on github.

Remove YAML header from markdown file

How to remove a YAML header like this one from a text file in Ruby:
---
date: 2013-02-02 11:22:33
title: "Some Title"
Foo: Bar
...
---
(The YAML is surrounded by three dashes (-))
I tried
text.gsub(/---(.*)---/, '') # text is the variable which contains the full text of the file
but it didn't work.
The solution mentioned above will match from the first occurrence of --- to the last occurrence of --- and everything in between. That means if --- appears later on in your file you'll strip out not only the header, but some of the rest of the content.
This regex will only remove the yaml header:
/\A---(.|\n)*?---/
The \A ensures that it starts matching against the very first instance of --- and the ? makes the * be non-greedy, which makes it stop matching at the second instance of ---.
Found a solution, regex should be:
/---(.|\n)*---/

How can I match a URL but exclude terminators from the match?

I want to match urls in text and replace them with anchor tags, but I want to exclude some terminators just like how Twitter matches urls in tweets.
So far I've got this, but it's obviously not working too well.
(http[s]?\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(\/\S*)?)
EDIT: Some example urls. In all cases below I only want to match "http://www.example.com"
http://www.example.com.
http://www.example.com:
"http://www.example.com"
http://www.example.com;
http://www.example.com!
[http://www.example.com]
{http://www.example.com}
http://www.example.com*
I looked into this very issue last year and developed a solution that you may want to look at - See: URL Linkification (HTTP/FTP) This link is a test page for the Javascript solution with many examples of difficult-to-linkify URLs.
My regex solution, written for both PHP and Javascript - (but could easily be translated to Ruby) is not simple (but neither is the problem as it turns out.) For more information I would recommend also reading:
The Problem With URLs by Jeff Atwood, and
An Improved Liberal, Accurate Regex Pattern for Matching URLs by John Gruber
The comments following Jeff's blog post are a must read if you want to do this right...
Ruby's URI module has a extract method that is used to parse out URLs from text. Parsing the returned values lets you piggyback on the heuristics in the module to extract the scheme and host information from a URL, avoiding reinventing the wheel.
text = '
http://www.example.com.
http://www.example.com:
"http://www.example.com"
http://www.example.com;
http://www.example.com!
[http://www.example.com]
{http://www.example.com}
http://www.example.com*
http://www.example.com/foo/bar?q=foobar
http://www.example.com:81
'
require 'uri'
puts URI::extract(text).map{ |u| uri = URI.parse(u); "#{ uri.scheme }://#{ uri.host[/(^.+?)\.?$/, 1] }" }
# >> http://www.example.com
# >> http://www.example.com
# >> http://www.example.com
# >> http://www.example.com
# >> http://www.example.com
# >> http://www.example.com
# >> http://www.example.com
# >> http://www.example.com
# >> http://www.example.com
# >> http://www.example.com
The only gotcha, is that a period '.' is a legitimate character in a host name, so URI#host won't strip it. Those get caught in the map statement where the URL is rebuilt. Note that URI is stripping off the path and query information.
A pragmatic and easy understandable solution is:
regex = %r!"(https?://[-.\w]+\.\w{2,6})"!
Some notes:
With %r we can choose the start and end delimiter. In this case I used exclamation mark, since I want to use slash unescaped in the regex.
The optional quantifier (i.e. '?') binds only to the preceding expression, in this case 's'. There's no need to put the 's' in a character class [s]?. It's the same as s?.
Inside the character class [-.\w] we don't need to escape dash and dot in order to make them match dot and dash literally. Dash should be first, however, to not mean range.
\w matches [A-Za-z0-9_] in Ruby. It's not exactly the full definition of URL characters, but combined with dash and dot it may be enough for our needs.
Top domains are between 2 and 6 characters long, e.g. '.se' and '.travel'
I'm not sure what you mean by I want to exclude some terminators but this regex matches only the wanted one in your example.
We want to use the first capture group, e.g. like this:
if input =~ %r!"(https?://[-.\w]+.\w{2,6})"!
match = $~[1]
else
match = ""
end
What about this?
%r|https?://[-\w.]*\w|

Resources