How do I fix this multiline regular expression in Ruby? - ruby

I have a regular expression in Ruby that isn't working properly in multiline mode.
I'm trying to convert Markdown text into the Textile-eque markup used in Redmine. The problem is in my regular expression for converting code blocks. It should find any lines leading with 4 spaces or a tab, then wrap them in pre tags.
markdownText = '# header
some text that precedes code
var foo = 9;
var fn = function() {}
fn();
some post text'
puts markdownText.gsub!(/(^(?:\s{4}|\t).*?$)+/m,"<pre>\n\\1\n</pre>")
Intended result:
# header
some text that precedes code
<pre>
var foo = 9;
var fn = function() {}
fn();
</pre>
some post text
The problem is that the closing pre tag is printed at the end of the document instead of after "fn();". I tried some variations of the following expression but it doesn't match:
gsub!(/(^(?:\s{4}|\t).*?$)+^(\S)/m, "<pre>\n\\1\n</pre>\\2")
How do I get the regular expression to match just the indented code block? You can test this regular expression on Rubular here.

First, note that 'm' multi-line mode in Ruby is equivalent to 's' single-line mode of other languages. In other words; 'm' mode in Ruby means: "dot matches all".
This regex will do a pretty good job of matching a markdown-like code section:
re = / # Match a MARKDOWN CODE section.
(\r?\n) # $1: CODE must be preceded by blank line
( # $2: CODE contents
(?: # Group for multiple lines of code.
(?:\r?\n)+ # Each line preceded by a newline,
(?:[ ]{4}|\t).* # and begins with four spaces or tab.
)+ # One or more CODE lines
\r?\n # CODE folowed by blank line.
) # End $2: CODE contents
(?=\r?\n) # CODE folowed by blank line.
/x
result = subject.gsub(re, '\1<pre>\2</pre>')
This requires a blank line before and after the code section and allows blank lines within the code section itself. It allows for either \r\n or \n line terminations. Note that this does not strip the leading 4 spaces (or tab) before each line. Doing that will require more code complexity. (I am not a ruby guy so can't help out with that.)
I would recommend looking at the markdown source itself to see how its really being done.

/^(\s{4}|\t)+.+\;\n$/m
works a little better, still picks up a newline that we don't want.
here it is on rubular.

This is working for me with your sample input.
markdownText.gsub(/\n?((\s{4}.+)+)/, "\n<pre>#{$1}\n</pre>")

Here's another one that captures all the indented lines in a single block
((?:^(?: {4}|\t)[^\n]*$\n?)+)

Related

Matching the word without space and has to include certain start of the word

I am trying to match
driver. in
def fun
driver.find_element(:link_text, "Standard Menu Rates").click
driver.find_element(:id, "jpform:fromStation").send_keys("HOSUR - HSRA")
#driver.find_element(:id, "jpform:toStation").send_keys("SATUR - SRT")
So I have to written the following regular expression
^driver.
But driver. is having some space infront of the word, so it's not matching, How would I eliminate the space as well as stick to the start of the string as driver not #driver or not any other word?
Input
def fun
driver.find_element(:link_text, "Standard Menu Rates").click
driver.find_element(:id, "jpform:fromStation").send_keys("HOSUR - HSRA")
#driver.find_element(:id, "jpform:toStation").send_keys("SATUR - SRT")
output
driver.find_element(:link_text, "Standard Menu Rates").click
driver.find_element(:id, "jpform:fromStation").send_keys("HOSUR - HSRA")
And also,
I know to match those words inside the "" but how would I match those words which are outside the double quote?
Input
# 0 = {String#3546} "Policy Duration (Days)"
# 1 = {String#3547} "Related Proposal Nr."
Ouput
# 0 = {String#3546}
# 1 = {String#3547}
As per your comments, you want to match the start of the line, then any number of whitespaces on the same line, then driver and then a dot.
You need to use [[:blank:]]* (it will match any 0+ Unicode horizontal whitesdace chars). Note also, the . should be escaped to match a literal ..
Use
/^[[:blank:]]*driver\./
See the Rubular demo
Details
^ - start of a line
[[:blank:]]* - 0+ horizontal whitespace chars
driver - a literal substring
\. - a dot.
As for the second part, you may remove "..." substrings from the strings using
s.gsub(/[[:blank:]]*"[^"]*"$/, '')
See this Rubular demo
Alternatively, if you want to match a line part up to the first ", you may use
/^[^"\r\n]+/
See this Rubular demo
you can use the regex
^\s*\bdriver\.
where \b is represents a boundary. check the regex101 demo
for the 2nd part, you can replace the string inside the quotes. The remaining string would be the required string see the regex101 demo

Regex matching chars around text

I have a string with chars inside and I would like to match only the chars around a string.
"This is a [1]test[/1] string. And [2]test[/2]"
Rubular http://rubular.com/r/f2Xwe3zPzo
Currently, the code in the link matches the text inside the special chars, how can I change it?
Update
To clarify my question. It should only match if the opening and closing has the same number.
"[2]first[/2] [1]second[/2]"
In the code above, only first should match and not second. The text inside the special chars (first), should be ignored.
Try this:
(\[[0-9]\]).+?(\[\/[0-9]\])
Permalink to the example on Rubular.
Update
Since you want to remove the 'special' characters, try this instead:
foo = "This is a [1]test[/1] string. And [2]test[/2]"
foo.gsub /\[\/?\d\]/, ""
# => "This is a test string. And test"
Update, Part II
You only want to remove the 'special' characters when the surrounding tags match, so what about this:
foo = "This is a [1]test[/1] string. And [2]test[/2], but not [3]test[/2]"
foo.gsub /(?:\[(?<number>\d)\])(?<content>.+?)(?:\[\/\k<number>\])/, '\k<content>'
# => "This is a test string. And test, but not [3]test[/2]"
\[([0-9])\].+?\[\/\1\]
([0-9]) is a capture since it is surrounded with parentheses. The \1 tells it to use the result of that capture. If you had more than one capture, you could reference them as well, \2, \3, etc.
Rubular
You can also use a named capture, rather than \1 to make it a little less cryptic. As in: \[(?<number>[0-9])\].+?\[\/\k<number>\]
Here's a way to do it that uses the form of String#gsub that takes a block. The idea is to pull strings such as "[1]test[/1]" into the block, and there remove the unwanted bits.
str = "This is a [1]test[/1] string. And [2]test[/2], plus [3]test[/99]"
r = /
\[ # match a left bracket
(\d+) # capture one or more digits in capture group 1
\] # match a right bracket
.+? # match one or more characters lazily
\[\/ # match a left bracket and forward slash
\1 # match the contents of capture group 1
\] # match a right bracket
/x
str.gsub(r) { |s| s[/(?<=\]).*?(?=\[)/] }
#=> "This is a test string. And test, plus [3]test[/99]"
Aside: When I first heard of named capture groups, they seemed like a great idea, but now I wonder if they really make regexes easier to read than \1, \2....

Ruby Regex Group Replacement

I am trying to perform regular expression matching and replacement on the same line in Ruby. I have some libraries that manipulate strings in Ruby and add special formatting characters to it. The formatting can be applied in any order. However, if I would like to change the string formatting, I want to keep some of the original formatting. I'm using regex for that. I have the regular expression matching correctly what I need:
mystring.gsub(/[(\e\[([1-9]|[1,2,4,5,6,7,8]{2}m))|(\e\[[3,9][0-8]m)]*Text/, 'New Text')
However, what I really want is the matching from the first grouping found in:
(\e\[([1-9]|[1,2,4,5,6,7,8]{2}m))
to be appended to New Text and replaced as opposed to just New Text. I'm trying to reference the match in the form of
mystring.gsub(/[(\e\[([1-9]|[1,2,4,5,6,7,8]{2}m))|(\e\[[3,9][0-8]m)]*Text/, '\1' + 'New Text')
but my understanding is that \1 only works when using \d or \k. Is there any way to reference that specific capturing group in my replacement string? Additionally, since I am using an asterik for the [], I know that this grouping could occur more than once. Therefore, I would like to have the last matching occurrence yielded.
My expected input/output with a sample is:
Input: "\e[1mHello there\e[34m\e[40mText\e[0m\e[0m\e[22m"
Output: "\e[1mHello there\e[40mNew Text\e[0m\e[0m\e[22m"
Input: "\e[1mHello there\e[44m\e[34m\e[40mText\e[0m\e[0m\e[22m"
Output: "\e[1mHello there\e[40mNew Text\e[0m\e[0m\e[22m"
So the last grouping is found and appended.
You can use the following regex with back-reference \\1 in the replacement:
reg = /(\\e\[(?:[0-9]{1,2}|[3,9][0-8])m)+Text/
mystring = "\\e[1mHello there\\e[34m\\e[40mText\\e[0m\\e[0m\\e[22m"
puts mystring.gsub(reg, '\\1New Text')
mystring = "\\e[1mHello there\\e[44m\\e[34m\\e[40mText\\e[0m\\e[0m\\e[22m"
puts mystring.gsub(reg, '\\1New Text')
Output of the IDEONE demo:
\e[1mHello there\e[40mNew Text\e[0m\e[0m\e[22m
\e[1mHello there\e[40mNew Text\e[0m\e[0m\e[22m
Mind that your input has backslash \ that needs escaping in a regular string literal. To match it inside the regex, we use double slash, as we are looking for a literal backslash.

What is a regular expression for finding lines with uncommented Java code?

I'm working on a simple Ruby program that should count of the lines of text in a Java file that contain actual Java code. The line gets counted even if it has comments in it, so basically only lines that are just comments won't get counted.
I was thinking of using a regular expression to approach this problem. My program will just iterate line by line and compare it to a "regexp", like:
while line = file.gets
if line =~ regex
count+=1
end
end
I'm not sure what regexp format to use for that, though. Any ideas?
Getting the count for "Lines of code" can be a little subjective. Should auto-generated stuff like imports and package name really count? A person usually didn't write it. Does a line with just a closing curly brace count? There's not really any executing logic on that line.
I typically use this regex for counting Java lines of code:
^(?![ \s]*\r?\n|import|package|[ \s]*}\r?\n|[ \s]*//|[ \s]*/\*|[ \s]*\*).*\r?\n
This will omit:
Blank lines
Imports
Lines with the package name
Lines with just a }
Lines with single line comments //
Opening multi-line comments ((whitespace)/* whatever)
Continuation of multi-line comments ((whitespace)* whatever)
It will also match against either \n or \r\n newlines (since your source code could contain either depending on your OS).
While not perfect, it seems to come pretty close to matching against all, what I would consider, "legitimate" lines of code.
count = 0
file.each_line do |ln|
# Manage multiline and single line comments.
# Exclude single line if and only if there isn't code on that line
next if ln =~ %r{^\s*(//|/\*[^*]*\*/$|$)} or (ln =~ %r{/\*} .. ln =~ %r{\*/})
count += 1
end
There's only a problem with lines that have a multilines comment but also code, for example:
someCall(); /* Start comment
this a comment
even this
*/ thisShouldBeCounted();
However:
imCounted(); // Comment
meToo(); /* comment */
/* comment */ yesImCounted();
// i'm not
/* Nor
we
are
*/
EDIT
The following version is a bit more cumbersome but correctly count all cases.
count = 0
comment_start = false
file.each_line do |ln|
# Manage multiline and single line comments.
# Exclude single line if and only if there isn't code on that line
next if ln =~ %r{^\s*(//|/\*[^*]*\*/$|$)} or (ln =~ %r{^\s*/\*} .. ln =~ %r{\*/}) or (comment_start and not ln.include? '*/')
count += 1 unless comment_start and ln =~ %r{\*/\s*$}
comment_start = ln.include? '/*'
end

Using regex to find all code identified with 4 spaces

Given a textarea, similar to StackOverflow, I'd like to wrap code (indented by 4 spaces) with a pre/code block. I'm trying to use the following regex to find the code:
re = / # Match a MARKDOWN CODE section.
(\r?\n) # $1: CODE must be preceded by blank line
( # $2: CODE contents
(?: # Group for multiple lines of code.
(?:\r?\n)+ # Each line preceded by a newline,
(?:[ ]{4}|\t).* # and begins with four spaces or tab.
)+ # One or more CODE lines
\r?\n # CODE folowed by blank line.
) # End $2: CODE contents
(?=\r?\n) # CODE folowed by blank line.
/x
result = subject.gsub(re, '\1<pre>\2</pre>')
But this isn't working, here's the example in Rubular:
http://rubular.com/r/l5faSjR8ya
Any suggestions on how to have this Regex, match the code allow me to wrap a pre/code tags around the code? Thanks
I think there is an escape out of the code mode with any trailing newline not followed by tab or 4 spaces. Not sure but successive newlines would not be included in the code block.
I don't get Ruby's regex options too well, but this seems to work: http://rubular.com/r/BlbreoO3sn
((?:^(?:[ ]{4}|\t).*$(?:\r?\n|\z))+) Theorhetically, its in multi-line mode.
Just make the replacement <pre>\1</pre>
EDIT
#Rachela Meadows - After further examination, this is a fairly difficult regex.
I managed to exactly duplicate the functionality of the <pre><code> block features of the online editor here on SO.
After obtaining each block and before wrapping in a <pre><code>, all markup entities should be converted (ie; like < to <, etc). That being said, I didn't do that step in the Ruby code sample below. I do have the regex's to do that though.
A special note about trimming: The main regex below does not include residual trailing newlines. Nor does the SO functionality. So the code block is correct top to bottom.
However, the leading 4 spaces (or tab) that could be contained in the body can't be trimmed (and they should be) in the main regex. For that it needs a callback.
Playing around with the gsub block mode, its easy to trim those leading spaces/tab.
Let me know if you have any problems with this.
Links -
Rubular (for the regex): http://rubular.com/r/pp9oRLQ0xo
Ideone (for the working Ruby code): http://ideone.com/aA9it
Regex compressed -
(^\s*$\n|\A)(^(?:[ ]{4}|\t).*[^\s].*$\n?(?:(?:^\s*$\n?)*^(?:[ ]{4}|\t).*[^\s].*$\n?)*)
Regex expanded -
(^\s*$\n|\A) # Capt grp 1, block is preceeded by a blank line or begin of string
( # Begin "Capture group 2", start of pre/code block
^(?:[ ]{4}|\t) .* [^\s] .* $ \n? # First line of code block (note - lines must contain at least 1 non-whitespace character)
(?: # Start "Optionally, get more lines of code"
(?: ^ \s* $ \n? )* # Many optional blank lines
^(?:[ ]{4}|\t) .* [^\s] .* $ \n? # Another line of code
)* # End "Optionally, get more lines of code", do 0 or more times
) # End "Capture group 2", end of pre/code block
Ruby code -
regex = /(^\s*$\n|\A)(^(?:[ ]{4}|\t).*[^\s].*$\n?(?:(?:^\s*$\n?)*^(?:[ ]{4}|\t).*[^\s].*$\n?)*)/;
data = '
Hello Worldsasdasdffasdfasdf asdf
thisdqweee
asdfasdfasdfasdf
sdfg
#YYYY {
height: 100%;
min-height: 800px;
margin-right: 20px;
position: relative;
}
#ZZZZZZ {
height: 100%;
overflow: hidden;
}';
# ---
result = data.gsub(regex) {
||
x=$2;
## Construct the return value '\1<pre&gt<code&gt\2</code&gt</pre&gt'.
## But, trim each line with 1 to 4 leading spaces (or a tab with regex on the bottom).
## They are not necessary now, they are replaced with a code block.
$1 + '<pre&gt<code&gt' + x.gsub(/^[ ]{1,4}/, '') + '</code&gt</pre&gt'
};
# Note - Tabs can be trimed too, use : x.gsub(/^(?:[ ]{1,4}|\t)/,'') in the above
print result;
If you're looking to match full lines, don't explicitly match for (?:\r?\n)+, rather use ^ and $. Try
(\r?\n)((?:(?:^[ ]{4}|\t).*$)+)(?=\r?\n)
Im think your pattern require two new lines in the beginning to match.
Maybe like this? ((?:(?:[ ]{4}|\t).*(?:\r?\n|$))+)?
$ is used to match if last line is indented and have not new line)
http://rubular.com/r/Vg9HnJpjbw
Ruby:
s = "before\n indent1\n indent2\nmiddle\n indent1\nafter"
p s.gsub(/((?:(?:[ ]{4}|\t).*(?:\r?\n|$))+)/x, '<pre>\1</pre>')
Output:
"before\n<pre> indent1\n indent2\n</pre>middle\n<pre> indent1\n</pre>after"
I think one of your newline captures is redundant. You can use ^ and $ with the s flag turned off to match EOL rather than EOL, this is a better pattern than trying to match newlines.
Try this pattern:
/(?:^(?:[ ]{4}|\t).*$[\n\r]*)+/

Resources