regex too longwinded and multiline issues

regex too longwinded and multiline issues - ruby

im trying to change text to go from this:
\v 1 something \f + \xo footnote one \f* whatever \x + \xo footnote two \x* more text \f + \xo footnote three \f* blah blah blah \x + \xo footnote four \x*
\v 2 something \x + \xo footnote one \x*
to this:
\v 1 something \f * \xo footnote one \f* whatever \x ** \xo footnote two \x* more text \f *** \xo footnote three \f* blah blah blah \x $ \xo footnote four \x* \v 2 something \x * \xo footnote one \x*
so in each footnote, instead of a '+' each will have the next in the sequence (* ** *** $ $$ $$$ £) but the sequence has to reset when it gets to a new verse (\v) there can be up to 7 occurrences of footnotes between each verse.
i'm new to ruby so i know there's a better way to do this, what i've done is very long winded:
file = File.open('input.txt', 'r+')
contents = file.read
reassign = contents.gsub(/(\\v.*?(\\x|\\f) )\+(.*?(\\x|\\f) )\+(.*?(\\x|\\f) )\+(.*?(\\x|\\f) )\+(.*?(\\x|\\f) )\+/m, '\1*\3**\5***\7$\9$$')
.gsub(/(\\v.*?(\\x|\\f) )\+(.*?(\\x|\\f) )\+(.*?(\\x|\\f) )\+(.*?(\\x|\\f) )\+/m, '\1*\3**\5***\7$')
.gsub(/(\\v.*?(\\x|\\f) )\+(.*?(\\x|\\f) )\+(.*?(\\x|\\f) )\+/m, '\1*\3**\5***')
.gsub(/(\\v.*?(\\x|\\f) )\+(.*?(\\x|\\f) )\+/m, '\1*\3**')
.gsub(/(\\v.*?(\\x|\\f) )\+/m, '\1*')
new_file = File.open("output.txt", "w+")
new_file.write(reassign)
new_file.close
if i don't add the m after the regex search it skips a lot of footnotes longer than one line, but if i add it, it skips over the verses altogether and doesn't reset the sequence
Thanks

I suggest you first split('\v') the string, giving you an array of strings, map each of the strings in the resulting array to a string with the footnote symbols replaced with the appropriate strings, then join('\v') the strings back together.
Code
def map_footnote_symbols(str)
str.split('\v').map do |s|
t = %w{ * ** *** $ $$ $$$ £ }
s.gsub(/./) { |c| (c == '+') ? t.shift : c }
end.join('\v')
end
Example
str = "\v 1 something \f + \xo footnote one \f* whatever " +
"\x + \xo footnote two \x* more text \f + \xo footnote " +
"three \f* blah blah blah \x + \xo footnote four \x* \v 2 " +
"something \x + \xo footnote one \x*"
(I've broken the string into pieces so that in can be viewed without having to scroll horizontally.)
puts map_footnote_symbols str
#=> \v 1 something \f * \xo footnote one \f* whatever \x ** \xo |
# footnote two \x* more text \f *** \xo footnote three \f* |
# blah blah blah \x $ \xo footnote four \x* \v 2 something |
# \x * \xo footnote one \x*
(I've broken the output string into pieces so that in can be viewed without having to scroll horizontally. The character | indicates where I've broken each line.)
Explanation
a = str.split('\v')
#=> ["",
# " 1 something \\f + \\xo footnote one \\f* whatever \\x + \\xo |
# footnote two \\x* more text \\f + \\xo footnote three \\f* |
# blah blah blah \\x + \\xo footnote four \\x* ",
# " 2 something \\x + \\xo footnote one \\x*"]
(Again, I've broken the second string in the array into pieces so that in can be viewed without having to scroll horizontally.)
map passes each element of a into its block, assigning it to the block variable s. The first is:
s = ""
We then have:
t = %w{ * ** *** $ $$ $$$ £ }
#=> ["*", "**", "***", "$", "$$", "$$$", "£"]
b = "".gsub(/./) { |c| (c == '+') ? t.shift : c }
#=> ""
So "" is (obviously) mapped to "". The next element (string) map passes into the block is:
s = "footnote two \\x* more text \\f + \\xo footnote three \\f* " +
"blah blah blah \\x + \\xo footnote four \\x* "
The regex /./ causes gsub to pass each character of s to its block to determine the substituted value. (c == '+') is false for every character up to the first +, so these characters are all left unchanged (i.e., replaced by c). The first + is replaced by t.shift:
t = ["*", "**", "***", "$", "$$", "$$$", "£"]
t.shift #=> "*"
leaving
t #=> ["**", "***", "$", "$$", "$$$", "£"]
The characters up to the next + are left unchanged and that + is replaced by:
t.shift #=> "**"
leaving
t #=> ["***", "$", "$$", "$$$", "£"]
and so on. As a result:
c = a.map do |s|
t = %w{ * ** *** $ $$ $$$ £ }
s.gsub(/./) { |c| (c == '+') ? t.shift : c }
end
#=> ["",
# " 1 something \\f * \\xo footnote one \\f* whatever \\x ** \\xo |
# footnote two \\x* more text \\f *** \\xo footnote three \\f* |
# blah blah blah \\x $ \\xo footnote four \\x* ",
# " 2 something \\x * \\xo footnote one \\x*"]
All that remains is to reassemble the string:
c.join('\v')

Related

How do I write a regex that eliminates the space between a number and a colon?

I want to replace a space between one or two numbers and a colon followed by a space, a number, or the end of the line. If I have a string like,
line = " 0 : 28 : 37.02"
the result should be:
" 0: 28: 37.02"
I tried as below:
line.gsub!(/(\A|[ \u00A0|\r|\n|\v|\f])(\d?\d)[ \u00A0|\r|\n|\v|\f]:(\d|[ \u00A0|\r|\n|\v|\f]|\z)/, '\2:\3')
# => " 0: 28 : 37.02"
It seems to match the first ":", but the second ":" is not matched. I can't figure out why.

The problem
I'll define your regex with comments (in free-spacing mode) to show what it is doing.
r =
/
( # begin capture group 1
\A # match beginning of string (or does it?)
| # or
[ \u00A0|\r|\n|\v|\f] # match one of the characters in the string " \u00A0|\r\n\v\f"
) # end capture group 1
(\d?\d) # match one or two digits in capture group 2
[ \u00A0|\r|\n|\v|\f] # match one of the characters in the string " \u00A0|\r\n\v\f"
: # match ":"
( # begin capture group 3
\d # match a digit
| # or
[ \u00A0|\r|\n|\v|\f] # match one of the characters in the string " \u00A0|\r\n\v\f"
| # or
\z # match the end of the string
) # end capture group 3
/x # free-spacing regex definition mode
Note that '|' is not a special character ("or") within a character class. It's treated as an ordinary character. (Even if '|' were treated as "or" within a character class, that would serve no purpose because character classes are used to force any one character within it to be matched.)
Suppose
line = " 0 : 28 : 37.02"
Then
line.gsub(r, '\2:\3')
#=> " 0: 28 : 37.02"
$1 #=> " "
$2 #=> "0"
$3 #=> " "
In capture group 1 the beginning of the line (\A) is not matched because it is not a character and only characters are not matched (though I don't know why that does not raise an exception). The special character for "or" ('|') causes the regex engine to attempt to match one character of the string " \u00A0|\r\n\v\f". It therefore would match one of the three spaces at the beginning of the string line.
Next capture group 2 captures "0". For it to do that, capture group 1 must have captured the space at index 2 of line. Then one more space and a colon are matched, and lastly, capture group 3 takes the space after the colon.
The substring ' 0 : ' is therefore replaced with '\2:\3' #=> '0: ', so gsub returns " 0: 28 : 37.02". Notice that one space before '0' was removed (but should have been retained).
A solution
Here's how you can remove the last of one or more Unicode whitespace characters that are preceded by one or two digits (and not more) and are followed by a colon at the end of the string or a colon followed by a whitespace or digit. (Whew!)
def trim(str)
str.gsub(/\d+[[:space:]]+:(?![^[:space:]\d])/) do |s|
s[/\d+/].size > 2 ? s : s[0,s.size-2] << ':'
end
end
The regular expression reads, "match one or more digits followed by one or more whitespace characters, followed by a colon (all these characters are matched), not followed (negative lookahead) by a character other than a unicode whitespace or digit". If there is a match, we check to see how many digits there are at the beginning. If there are more than two the match is returned (no change), else the whitespace character before the colon is removed from the match and the modified match is returned.
trim " 0 : 28 : 37.02"
#=> " 0: 28: 37.02" xxx
trim " 0\v: 28 :37.02"
#=> " 0: 28:37.02"
trim " 0\u00A0: 28\n:37.02"
#=> " 0: 28:37.02"
trim " 123 : 28 : 37.02"
#=> " 123 : 28: 37.02"
trim " A12 : 28 :37.02"
#=> " A12: 28:37.02"
trim " 0 : 28 :"
#=> " 0: 28:"
trim " 0 : 28 :A"
#=> " 0: 28 :A"
If, as in the example, the only characters in the string are digits, whitespaces and colons, the lookbehind is not needed.
You can use Ruby's \p{} construct, \p{Space}, in place of the POSIX expression [[:space:]]. Both match a class of Unicode whitespace characters, including those shown in the examples.

Excluding the third digit can be done with a negative lookback, but since the other one or two digits are of variable length, you cannot use positive lookback for that part.
line.gsub(/(?<!\d)(\d{1,2}) (?=:[ \d\$])/, '\1')
# => " 0: 28: 37.02"

" 0 : 28 : 37.02".gsub!(/(\d)(\s)(:)/,'\1\3')
=> " 0: 28: 37.02"

How do I write a regex that captures the first non-numeric part of string that also doesn't include 3 or more spaces?

I'm using Ruby 2.4. I want to extract from a string the first consecutive occurrence of non-numeric characters that do not include at least three or more spaces. For example, in this string
str = "123 aa bb cc 33 dd"
The first such occurrence is " aa bb ". I thought the below expression would help me
data.split(/[[:space:]][[:space:]][[:space:]]+/).first[/\p{L}\D+\p{L}\p{L}/i]
but if the string is "123 456 aaa", it fails to return " aaa", which I would want it to.

r = /
(?: # begin non-capture group
[ ]{,2} # match 0, 1 or 2 spaces
[^[ ]\d]+ # match 1+ characters that are neither spaces nor digits
)+ # end non-capture group and perform 1+ times
[ ]{,2} # match 0, 1 or 2 spaces
/x # free-spacing regex definition mode
str = "123 aa bb cc 33 dd"
str[r] #=> " aa bb "
Note that [ ] could be replaced by a space if free-spacing regex definition mode is not used:
r = /(?: {,2}[^ \d]+)+ {,2}/

Remove all digits + spaces from the start of a string. Then split with 3 or more whitespaces and grab the first item.
def parse_it(s)
s[/\A(?:[\d[:space:]]*\d)?(\D+)/, 1].split(/[[:space:]]{3,}/).first
end
puts parse_it("123 aa bb cc 33 dd")
# => aa bb
puts parse_it("123 456 aaa")
# => aaa
See the Ruby demo
The first regex \A(?:[\d[:space:]]*\d)?(\D+) matches:
\A - start of a string
(?:[\d[:space:]]*\d)? - an optional sequence of:
[\d[:space:]]* - 0+ digits or whitespaces
\d - a digit
(\D+) -Group 1 capturing 1 or more non-digits
The splitting regex is [[:space:]]{3,}, it matches 3 or more whitespaces.

It looks like this'd do it:
regex = /(?: {1,2}[[:alpha:]]{2,})+/
"123 aa bb cc 33 dd"[regex] # => " aa bb"
"123 456 aaa"[regex] # => " aaa"
(?: ... ) is a non-capturing group.
{1,2} means "find at least one, and at most two".
[[:alpha:]] is a POSIX definition for alphabet characters. It's more comprehensive than [a-z].
You should be able to figure out the rest, which is all documented in the Regexp documentation and String's [] documentation.

Will this work?
str.match(/(?: ?)?(?:[^ 0-9]+(?: ?)?)+/)[0]
or apparently
str[/(?: ?)?(?:[^ 0-9]+(?: ?)?)+/]
or using Cary's nice space match,
str[/ {,2}(?:[^ 0-9]+ {,2})+/]

Ruby split keep the delimiter before the string

I have the following string :
a = '% abc \n %% abcd \n %% efgh\n '
I would like the ouput to be
['% abc \n', '%% abcd \n', '%% efgh \n']
If I have
b = '%% abc \n %% efg \n %% ijk \n]
I would like the output to be
['%% abc \n', '%% efg \n', '%% ijk \n']
I use b.split('%%').collect!{|v| '%%' + v } and it works fine for case 2.
but it doesn't work for case 1.
I saw some post of using 'scan' or 'split' to keep the delimiter if its after the string
For example : 'a; b; c' becomes ['a;', 'b;' ,'c']
But I want the opposite ['a', ';b', ';c']
There need not be space between \n and %% since \n depicts a new line.
A solution i made was
sel = '% asd \n %% asf sdaf \n %% adsasd asdf asd asf ';
delimiter = '%%';
indexOfPercent = test_string.index("%%")
if(indexOfPercent == 0)
result = (test_string || '').split(delimiter).reject(&:empty?).collect! {|v| delimiter + v}
else
result = (test_string.slice(test_string.index("%%")..-1) || '').split(delimiter).reject(&:empty?).collect! {|v| delimiter + v}
result.unshift(sel[0.. indexOfPercent-1])
end

(?<=\\n)\s*(?=%%)
You can split on the space using lookarounds.See demo.
https://regex101.com/r/fM9lY3/7

You could do it this way
def splitter(s)
#reject(&:empty) added to handle trailing space in a
s.lines.map{|n| n.lstrip.chomp(' ')}.reject(&:empty?)
end
#double quotes used to keep ruby from changing
# \n to \\n
a = "% abc \n %% abcd \n %% efgh\n "
b = "b = '%% abc \n %% efg \n %% ijk \n"
splitter(a)
#=> ["% abc \n", "%% abcd \n", "%% efgh\n"]
splitter(b)
#=> ["%% abc \n", "%% efg \n", "%% ijk \n"]
String#lines will partition the string right after the newline character by default. (This will return an Array. Then we call Array#map and pass in each matching string. This string then calls lstrip to remove the leading space and chomp(' ') to remove the trailing space without removing the \n. Then we reject any empty strings as would be the case in variable a because of the trailing space.

You can also use
a.split(/\\n\s?/).collect{|e| "#{e}\\n"}
a.split(/\\n\s?/)
# ["% abc ", "%% abcd ", "%% efgh"]
.collect{|e| "#{e}\\n"}
# will append \n
# ["% abc \\n", "%% abcd \\n", "%% efgh\\n"]

Building regex to match 2 words only

I'm trying to make regexp that match only 2 words and a single pace between. No special symbols, only [a-zA-Z] space [a-zA-z].
Foo Bar # Match (two words and one space only)
Foo # Mismatch (only one word)
Foo Bar # Mismatch (2 spaces)
Foo Bar Baz # Mismatch (3 words)

You want ^[a-zA-Z]+\s[a-zA-Z]+$
^ # Matches the start of the string
+ # quantifier mean one or more of the previous character class
\s # matches whitespace characters
$ # Matches the end of the string
The anchors ^ and $ are important here.
Demo:
if "foo bar" =~ /^[a-zA-Z]+\s[a-zA-Z]+$/
print "match 1"
end
if "foo bar" =~ /^[a-zA-Z]+\s[a-zA-Z]+$/
print "match 2"
end
if "foo bar biz" =~ /^[a-zA-Z]+\s[a-zA-Z]+$/
print "match 3"
end
Output:
Match 1

How to add string "\n" literally at the end of each line in Ruby?

Here is a string str:
str = "line1
line2
line3"
We would like to add string "\n" to the end of each line:
str = "line1 \n
line2 \n
line3 \n"
A method is defined:
def mod_line(str)
s = ""
str.each_line do |l|
s += l + '\\n'
end
end
The problem is that '\n' is a line feed and was not added to the end of the str even with escape \. What's the right way to add '\n' literally to each line?

String#gsub/String#gsub! plus a very simple regular expression can be used to achieve that:
str = "line1
line2
line3"
str.gsub!(/$/, ' \n')
puts str
Output:
line1 \n
line2 \n
line3 \n

The platform-independent solution:
str.gsub(/\R/) { " \\n#{$~}" }
It will search for line-feeds/carriage-returns and replace them with themselves, prepended by \n.

\n needs to be interpreted as a special character. You need to put it in double quotes.
"\n"
Your attempt:
'\\n'
only escapes the backslash, which is actually redundant. With or without escaping on the backslash, it gives you a backslash followed by the letter n.
Also, your method mod_line returns the result of str.each_line, which is the original string str. You need to return the modified string s:
def mod_line(str)
...
s
end
And by the way, be aware that each line of the original string already has "\n" at the end of each line, so you are adding the second "\n" to each line (making it two lines).

This is the closest I got to it.
def mod_line(str)
s = ""
str.each_line do |l|
s += l
end
p s
end
Using p instead of puts leaves the \n on the end of each line.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

regex too longwinded and multiline issues - ruby

Related

How do I write a regex that eliminates the space between a number and a colon?

How do I write a regex that captures the first non-numeric part of string that also doesn't include 3 or more spaces?

Ruby split keep the delimiter before the string

Building regex to match 2 words only

How to add string "\n" literally at the end of each line in Ruby?

Categories

Resources