Using regex to find all code identified with 4 spaces - ruby

Given a textarea, similar to StackOverflow, I'd like to wrap code (indented by 4 spaces) with a pre/code block. I'm trying to use the following regex to find the code:
re = / # Match a MARKDOWN CODE section.
(\r?\n) # $1: CODE must be preceded by blank line
( # $2: CODE contents
(?: # Group for multiple lines of code.
(?:\r?\n)+ # Each line preceded by a newline,
(?:[ ]{4}|\t).* # and begins with four spaces or tab.
)+ # One or more CODE lines
\r?\n # CODE folowed by blank line.
) # End $2: CODE contents
(?=\r?\n) # CODE folowed by blank line.
/x
result = subject.gsub(re, '\1<pre>\2</pre>')
But this isn't working, here's the example in Rubular:
http://rubular.com/r/l5faSjR8ya
Any suggestions on how to have this Regex, match the code allow me to wrap a pre/code tags around the code? Thanks

I think there is an escape out of the code mode with any trailing newline not followed by tab or 4 spaces. Not sure but successive newlines would not be included in the code block.
I don't get Ruby's regex options too well, but this seems to work: http://rubular.com/r/BlbreoO3sn
((?:^(?:[ ]{4}|\t).*$(?:\r?\n|\z))+) Theorhetically, its in multi-line mode.
Just make the replacement <pre>\1</pre>
EDIT
#Rachela Meadows - After further examination, this is a fairly difficult regex.
I managed to exactly duplicate the functionality of the <pre><code> block features of the online editor here on SO.
After obtaining each block and before wrapping in a <pre><code>, all markup entities should be converted (ie; like < to <, etc). That being said, I didn't do that step in the Ruby code sample below. I do have the regex's to do that though.
A special note about trimming: The main regex below does not include residual trailing newlines. Nor does the SO functionality. So the code block is correct top to bottom.
However, the leading 4 spaces (or tab) that could be contained in the body can't be trimmed (and they should be) in the main regex. For that it needs a callback.
Playing around with the gsub block mode, its easy to trim those leading spaces/tab.
Let me know if you have any problems with this.
Links -
Rubular (for the regex): http://rubular.com/r/pp9oRLQ0xo
Ideone (for the working Ruby code): http://ideone.com/aA9it
Regex compressed -
(^\s*$\n|\A)(^(?:[ ]{4}|\t).*[^\s].*$\n?(?:(?:^\s*$\n?)*^(?:[ ]{4}|\t).*[^\s].*$\n?)*)
Regex expanded -
(^\s*$\n|\A) # Capt grp 1, block is preceeded by a blank line or begin of string
( # Begin "Capture group 2", start of pre/code block
^(?:[ ]{4}|\t) .* [^\s] .* $ \n? # First line of code block (note - lines must contain at least 1 non-whitespace character)
(?: # Start "Optionally, get more lines of code"
(?: ^ \s* $ \n? )* # Many optional blank lines
^(?:[ ]{4}|\t) .* [^\s] .* $ \n? # Another line of code
)* # End "Optionally, get more lines of code", do 0 or more times
) # End "Capture group 2", end of pre/code block
Ruby code -
regex = /(^\s*$\n|\A)(^(?:[ ]{4}|\t).*[^\s].*$\n?(?:(?:^\s*$\n?)*^(?:[ ]{4}|\t).*[^\s].*$\n?)*)/;
data = '
Hello Worldsasdasdffasdfasdf asdf
thisdqweee
asdfasdfasdfasdf
sdfg
#YYYY {
height: 100%;
min-height: 800px;
margin-right: 20px;
position: relative;
}
#ZZZZZZ {
height: 100%;
overflow: hidden;
}';
# ---
result = data.gsub(regex) {
||
x=$2;
## Construct the return value '\1<pre&gt<code&gt\2</code&gt</pre&gt'.
## But, trim each line with 1 to 4 leading spaces (or a tab with regex on the bottom).
## They are not necessary now, they are replaced with a code block.
$1 + '<pre&gt<code&gt' + x.gsub(/^[ ]{1,4}/, '') + '</code&gt</pre&gt'
};
# Note - Tabs can be trimed too, use : x.gsub(/^(?:[ ]{1,4}|\t)/,'') in the above
print result;

If you're looking to match full lines, don't explicitly match for (?:\r?\n)+, rather use ^ and $. Try
(\r?\n)((?:(?:^[ ]{4}|\t).*$)+)(?=\r?\n)

Im think your pattern require two new lines in the beginning to match.
Maybe like this? ((?:(?:[ ]{4}|\t).*(?:\r?\n|$))+)?
$ is used to match if last line is indented and have not new line)
http://rubular.com/r/Vg9HnJpjbw
Ruby:
s = "before\n indent1\n indent2\nmiddle\n indent1\nafter"
p s.gsub(/((?:(?:[ ]{4}|\t).*(?:\r?\n|$))+)/x, '<pre>\1</pre>')
Output:
"before\n<pre> indent1\n indent2\n</pre>middle\n<pre> indent1\n</pre>after"

I think one of your newline captures is redundant. You can use ^ and $ with the s flag turned off to match EOL rather than EOL, this is a better pattern than trying to match newlines.
Try this pattern:
/(?:^(?:[ ]{4}|\t).*$[\n\r]*)+/

Related

Matching the word without space and has to include certain start of the word

I am trying to match
driver. in
def fun
driver.find_element(:link_text, "Standard Menu Rates").click
driver.find_element(:id, "jpform:fromStation").send_keys("HOSUR - HSRA")
#driver.find_element(:id, "jpform:toStation").send_keys("SATUR - SRT")
So I have to written the following regular expression
^driver.
But driver. is having some space infront of the word, so it's not matching, How would I eliminate the space as well as stick to the start of the string as driver not #driver or not any other word?
Input
def fun
driver.find_element(:link_text, "Standard Menu Rates").click
driver.find_element(:id, "jpform:fromStation").send_keys("HOSUR - HSRA")
#driver.find_element(:id, "jpform:toStation").send_keys("SATUR - SRT")
output
driver.find_element(:link_text, "Standard Menu Rates").click
driver.find_element(:id, "jpform:fromStation").send_keys("HOSUR - HSRA")
And also,
I know to match those words inside the "" but how would I match those words which are outside the double quote?
Input
# 0 = {String#3546} "Policy Duration (Days)"
# 1 = {String#3547} "Related Proposal Nr."
Ouput
# 0 = {String#3546}
# 1 = {String#3547}
As per your comments, you want to match the start of the line, then any number of whitespaces on the same line, then driver and then a dot.
You need to use [[:blank:]]* (it will match any 0+ Unicode horizontal whitesdace chars). Note also, the . should be escaped to match a literal ..
Use
/^[[:blank:]]*driver\./
See the Rubular demo
Details
^ - start of a line
[[:blank:]]* - 0+ horizontal whitespace chars
driver - a literal substring
\. - a dot.
As for the second part, you may remove "..." substrings from the strings using
s.gsub(/[[:blank:]]*"[^"]*"$/, '')
See this Rubular demo
Alternatively, if you want to match a line part up to the first ", you may use
/^[^"\r\n]+/
See this Rubular demo
you can use the regex
^\s*\bdriver\.
where \b is represents a boundary. check the regex101 demo
for the 2nd part, you can replace the string inside the quotes. The remaining string would be the required string see the regex101 demo

Regular Expression replacement to convert Less mixins to Scss

I'm looking to convert Less mixin calls to their equivalents in Scss:
.mixin(); should become #mixin();
.mixin(0); should become #mixin(0);
.mixin(0; 1; 2); should become #mixin(0, 1, 2);
I'm having the most difficulty with the third example, as I essentially need to match n groups separated by semicolons, and replace those with the same groups separated by commas. I suppose this relies on some sort of repeating groups functionality in regexes that I'm not familiar with.
It's not simply enough to simply replace semicolons within paren - I need a regex that will only match the \.[\w\-]+\(.*\) format of mixins, but obviously with some magic in the second match group to handle the 3rd example above.
I'm doing this in Ruby, so if you're able to provide replacement syntax that's compatible with gsub, that would be awesome. I would like a single regex replacement, something that doesn't require multiple passes to clean up the semicolons.
I suggest adding two capturing groups round the subvalues you need and using an additional gsub in the first gsub block to replace the ; with , only in the 2nd group.
See
s = ".mixin(0; 1; 2);"
puts s.gsub(/\.([\w\-]+)(\(.*\))/) { "##{$1}#{$2.gsub(/;/, ',')}" }
# => #mixin(0, 1, 2);
The pattern details:
\. - a literal dot
([\w\-]+) - Group 1 capturing 1 or more word chars ([a-zA-Z0-9_]) or -
(\(.*\)) - Group 2 capturing a (, then any 0+ chars other than linebreak symbols as many as possible up to the last ) and the last ). NOTE: if there are multiple values, use lazy matching - (\(.*?\)) - here.
Here you go:
less_style = ".mixin(0; 1; 2);"
# convert the first period to #
less_style.gsub! /^\./, '#'
# convert the inner semicolons to commas
scss_style = less_style.gsub /(?<=[\(\d]);/, ','
scss_style
# => "#mixin(0, 1, 2);"
The second regex is using positive lookbehinds. You can read about those here: http://www.regular-expressions.info/lookaround.html
I also use this neat web app to play around with regexes: http://rubular.com/
This will get you a single pass through gsub:
".mixin(0; 1; 2);".gsub(/(?<!\));|\./, ";" => ",", "." => "#")
=> "#mixin(0, 1, 2);"
It's an OR regex with a hash for the replacement parameters.
Assuming from your example that you just want to replace semicolons not following close parens(negative lookbehind): (?<!\));
You can modify/build on this with other expressions. Even add more OR conditions to the regex.
Also, you can use the block version of gsub if you need more options.

Regex matching chars around text

I have a string with chars inside and I would like to match only the chars around a string.
"This is a [1]test[/1] string. And [2]test[/2]"
Rubular http://rubular.com/r/f2Xwe3zPzo
Currently, the code in the link matches the text inside the special chars, how can I change it?
Update
To clarify my question. It should only match if the opening and closing has the same number.
"[2]first[/2] [1]second[/2]"
In the code above, only first should match and not second. The text inside the special chars (first), should be ignored.
Try this:
(\[[0-9]\]).+?(\[\/[0-9]\])
Permalink to the example on Rubular.
Update
Since you want to remove the 'special' characters, try this instead:
foo = "This is a [1]test[/1] string. And [2]test[/2]"
foo.gsub /\[\/?\d\]/, ""
# => "This is a test string. And test"
Update, Part II
You only want to remove the 'special' characters when the surrounding tags match, so what about this:
foo = "This is a [1]test[/1] string. And [2]test[/2], but not [3]test[/2]"
foo.gsub /(?:\[(?<number>\d)\])(?<content>.+?)(?:\[\/\k<number>\])/, '\k<content>'
# => "This is a test string. And test, but not [3]test[/2]"
\[([0-9])\].+?\[\/\1\]
([0-9]) is a capture since it is surrounded with parentheses. The \1 tells it to use the result of that capture. If you had more than one capture, you could reference them as well, \2, \3, etc.
Rubular
You can also use a named capture, rather than \1 to make it a little less cryptic. As in: \[(?<number>[0-9])\].+?\[\/\k<number>\]
Here's a way to do it that uses the form of String#gsub that takes a block. The idea is to pull strings such as "[1]test[/1]" into the block, and there remove the unwanted bits.
str = "This is a [1]test[/1] string. And [2]test[/2], plus [3]test[/99]"
r = /
\[ # match a left bracket
(\d+) # capture one or more digits in capture group 1
\] # match a right bracket
.+? # match one or more characters lazily
\[\/ # match a left bracket and forward slash
\1 # match the contents of capture group 1
\] # match a right bracket
/x
str.gsub(r) { |s| s[/(?<=\]).*?(?=\[)/] }
#=> "This is a test string. And test, plus [3]test[/99]"
Aside: When I first heard of named capture groups, they seemed like a great idea, but now I wonder if they really make regexes easier to read than \1, \2....

use of the ampersand here means pre_match?

What is the ampersand doing in the code below?
s.reverse.gsub( /\d{3}(?=\d)/, '\&,' ).reverse
One would think, after attempting to look up such things, that it is a special variable meaning post_match or pre_match, but the docs say nothing about ampersands - only dollar signs either followed by or preceded by a tick mark.
\& defines the whole string that is matched by the regex. see this simplified example:
s = "p1:1 1:1";
print s.gsub( /[a-z]/, '[\&],' ) ## only p is matched
output: [p],1:1 1:1
Similarly, the \1 defines the first group that is matched from the regex. (Similar goes for \2,\3... so on). An example:
s = "p1:1 1:1";
print s.gsub( /(\d:\d)/, '[\1]' )
output: p[1:1] [1:1]

How do I fix this multiline regular expression in Ruby?

I have a regular expression in Ruby that isn't working properly in multiline mode.
I'm trying to convert Markdown text into the Textile-eque markup used in Redmine. The problem is in my regular expression for converting code blocks. It should find any lines leading with 4 spaces or a tab, then wrap them in pre tags.
markdownText = '# header
some text that precedes code
var foo = 9;
var fn = function() {}
fn();
some post text'
puts markdownText.gsub!(/(^(?:\s{4}|\t).*?$)+/m,"<pre>\n\\1\n</pre>")
Intended result:
# header
some text that precedes code
<pre>
var foo = 9;
var fn = function() {}
fn();
</pre>
some post text
The problem is that the closing pre tag is printed at the end of the document instead of after "fn();". I tried some variations of the following expression but it doesn't match:
gsub!(/(^(?:\s{4}|\t).*?$)+^(\S)/m, "<pre>\n\\1\n</pre>\\2")
How do I get the regular expression to match just the indented code block? You can test this regular expression on Rubular here.
First, note that 'm' multi-line mode in Ruby is equivalent to 's' single-line mode of other languages. In other words; 'm' mode in Ruby means: "dot matches all".
This regex will do a pretty good job of matching a markdown-like code section:
re = / # Match a MARKDOWN CODE section.
(\r?\n) # $1: CODE must be preceded by blank line
( # $2: CODE contents
(?: # Group for multiple lines of code.
(?:\r?\n)+ # Each line preceded by a newline,
(?:[ ]{4}|\t).* # and begins with four spaces or tab.
)+ # One or more CODE lines
\r?\n # CODE folowed by blank line.
) # End $2: CODE contents
(?=\r?\n) # CODE folowed by blank line.
/x
result = subject.gsub(re, '\1<pre>\2</pre>')
This requires a blank line before and after the code section and allows blank lines within the code section itself. It allows for either \r\n or \n line terminations. Note that this does not strip the leading 4 spaces (or tab) before each line. Doing that will require more code complexity. (I am not a ruby guy so can't help out with that.)
I would recommend looking at the markdown source itself to see how its really being done.
/^(\s{4}|\t)+.+\;\n$/m
works a little better, still picks up a newline that we don't want.
here it is on rubular.
This is working for me with your sample input.
markdownText.gsub(/\n?((\s{4}.+)+)/, "\n<pre>#{$1}\n</pre>")
Here's another one that captures all the indented lines in a single block
((?:^(?: {4}|\t)[^\n]*$\n?)+)

Resources