Ignore empty captures when splitting string - ruby

I have a string:
Ayy ***lol* m8\nlol"
I would like to not include the empty capture and produce:
["Ayy ", "**", "*", "lol", "*", " m8", "\n", "lol"]
I am splitting the string by this regex:
/(?x)(\*\*|\*|\n|[.])/
This produces:
["Ayy ", "**", "", "*", "lol", "*", " m8", "\n", "lol"]

Here is a simplified version of your regex, chained with a method to remove empty strings -- which is inevitably necessary here when using String#split, since there is an 'empty result' in the middle of '***':
string = "Ayy ***lol* m8\nlol"
string.split(/(\*{1,2}|\n|\.)/).reject(&:empty?)
#=> ["Ayy ", "**", "*", "lol", "*", " m8", "\n", "lol"]
A few differences from your pattern:
I have removed the (?x); this served no purpose. Extended patterns are useful for ignoring spaces and comments within the regex - neither of which you are doing here.
\*\*|\* can be simplified to \*{1,2} (or \*\*? if you prefer).
[.] is technically fine, but \. is one character shorter and in my opinion shows clearer intent.

When splitting with a regex containing capturing groups, consecutive matches always produce empty array items.
Rather than switch to a matching approach, use
arr = arr.reject { |c| c.empty? }
Or any other method, see How do I remove blank elements from an array?
Else, you will have to match the substrings using a regex that will match the deilimiters first and then any text that does not start the delimiter texts (that is, you will need to build a tempered greedy token):
arr = s.scan(/(?x)\*{2}|[*\n.]|(?:(?!\*{2})[^*\n.])+/)
See the regex demo.
Here,
(?x) - a freespacing/comment modifier
\*{2} - ** substring
| - or
[*\n.] - a char that is either *, newline LF or a .
| - or
(?:(?!\*{2})[^*\n.])+ - 1 or more (+) chars that are not *, LF or . ([^*\n.]) that do not start a ** substring.

r = /
[ ]+ # match one or more spaces
| # or
(\*) # match one asterisk in capture group 1
[ ]* # match zero or more spaces
(?!\*) # not to be followed by an asterisk (negative lookahead)
| # or
(\n) # match "\n" in capture group 2
/x # free-spacing regex definition mode
str = "Ayy ***lol* m8\nlol"
str.split r
#=> ["Ayy", "**", "*", "lol", "*", "m8", "\n", "lol"]

Related

Ruby string split with space as input removes more than just spaces

I have some text that has this particular character:
When I call the string split() method (with just ' ' as the input) , the  gets removed. What should I do to keep the ?
That's the expected behavior when passing ' '. According to the docs:
If pattern is a single space, str is split on whitespace, with leading and trailing whitespace and runs of contiguous whitespace characters ignored.
With "whitespace" being space (" ") and \t, \n, \v, \f, \r.
"foo bar\nbaz \f qux".split(' ')
#=> ["foo", "bar", "baz", "qux"]
To split on space (U+0020) only, you have to use a regexp:
"foo bar\nbaz \f qux".split(/ /)
#=> ["foo", "bar\nbaz", "\f", "qux"]

How do I write a regex that eliminates the space between a number and a colon?

I want to replace a space between one or two numbers and a colon followed by a space, a number, or the end of the line. If I have a string like,
line = " 0 : 28 : 37.02"
the result should be:
" 0: 28: 37.02"
I tried as below:
line.gsub!(/(\A|[ \u00A0|\r|\n|\v|\f])(\d?\d)[ \u00A0|\r|\n|\v|\f]:(\d|[ \u00A0|\r|\n|\v|\f]|\z)/, '\2:\3')
# => " 0: 28 : 37.02"
It seems to match the first ":", but the second ":" is not matched. I can't figure out why.
The problem
I'll define your regex with comments (in free-spacing mode) to show what it is doing.
r =
/
( # begin capture group 1
\A # match beginning of string (or does it?)
| # or
[ \u00A0|\r|\n|\v|\f] # match one of the characters in the string " \u00A0|\r\n\v\f"
) # end capture group 1
(\d?\d) # match one or two digits in capture group 2
[ \u00A0|\r|\n|\v|\f] # match one of the characters in the string " \u00A0|\r\n\v\f"
: # match ":"
( # begin capture group 3
\d # match a digit
| # or
[ \u00A0|\r|\n|\v|\f] # match one of the characters in the string " \u00A0|\r\n\v\f"
| # or
\z # match the end of the string
) # end capture group 3
/x # free-spacing regex definition mode
Note that '|' is not a special character ("or") within a character class. It's treated as an ordinary character. (Even if '|' were treated as "or" within a character class, that would serve no purpose because character classes are used to force any one character within it to be matched.)
Suppose
line = " 0 : 28 : 37.02"
Then
line.gsub(r, '\2:\3')
#=> " 0: 28 : 37.02"
$1 #=> " "
$2 #=> "0"
$3 #=> " "
In capture group 1 the beginning of the line (\A) is not matched because it is not a character and only characters are not matched (though I don't know why that does not raise an exception). The special character for "or" ('|') causes the regex engine to attempt to match one character of the string " \u00A0|\r\n\v\f". It therefore would match one of the three spaces at the beginning of the string line.
Next capture group 2 captures "0". For it to do that, capture group 1 must have captured the space at index 2 of line. Then one more space and a colon are matched, and lastly, capture group 3 takes the space after the colon.
The substring ' 0 : ' is therefore replaced with '\2:\3' #=> '0: ', so gsub returns " 0: 28 : 37.02". Notice that one space before '0' was removed (but should have been retained).
A solution
Here's how you can remove the last of one or more Unicode whitespace characters that are preceded by one or two digits (and not more) and are followed by a colon at the end of the string or a colon followed by a whitespace or digit. (Whew!)
def trim(str)
str.gsub(/\d+[[:space:]]+:(?![^[:space:]\d])/) do |s|
s[/\d+/].size > 2 ? s : s[0,s.size-2] << ':'
end
end
The regular expression reads, "match one or more digits followed by one or more whitespace characters, followed by a colon (all these characters are matched), not followed (negative lookahead) by a character other than a unicode whitespace or digit". If there is a match, we check to see how many digits there are at the beginning. If there are more than two the match is returned (no change), else the whitespace character before the colon is removed from the match and the modified match is returned.
trim " 0 : 28 : 37.02"
#=> " 0: 28: 37.02" xxx
trim " 0\v: 28 :37.02"
#=> " 0: 28:37.02"
trim " 0\u00A0: 28\n:37.02"
#=> " 0: 28:37.02"
trim " 123 : 28 : 37.02"
#=> " 123 : 28: 37.02"
trim " A12 : 28 :37.02"
#=> " A12: 28:37.02"
trim " 0 : 28 :"
#=> " 0: 28:"
trim " 0 : 28 :A"
#=> " 0: 28 :A"
If, as in the example, the only characters in the string are digits, whitespaces and colons, the lookbehind is not needed.
You can use Ruby's \p{} construct, \p{Space}, in place of the POSIX expression [[:space:]]. Both match a class of Unicode whitespace characters, including those shown in the examples.
Excluding the third digit can be done with a negative lookback, but since the other one or two digits are of variable length, you cannot use positive lookback for that part.
line.gsub(/(?<!\d)(\d{1,2}) (?=:[ \d\$])/, '\1')
# => " 0: 28: 37.02"
" 0 : 28 : 37.02".gsub!(/(\d)(\s)(:)/,'\1\3')
=> " 0: 28: 37.02"

How do I find any space before "."

I have names "example .png" and "example 2.png". I am trying to convert any space to "_" and any space before "." should be removed.
So far I am doing it like this:
file.gsub(" .",".").gsub(" ", "_").gsub(".tif", "")
Use an rstripped File.basename(filename,File.extname(filename)) and replace spaces with underscores inside it then add an extname:
File.basename(filename,File.extname(filename)).rstrip.gsub(" ", "_") + File.extname(filename)
See the Ruby demo
Details:
File.basename(filename,File.extname(filename)) - get file name without extension
.rstrip - remove whitespace before the extension
.gsub(" ", "_") - replaces spaces (use /\s+/ regex to remove any whitespaces) with underscores
File.extname(filename) - a file extension.
If you prefer a regex way:
s = 'some example 2 .png'
puts s.gsub(/\s+(\.[^.]+\z)|\s/) {
Regexp.last_match(1) ?
Regexp.last_match(1) :
"_"
}
(can be shortened to s.gsub(/\s+(\.[^.]+\z)|\s/) { $1 || "_" } (see Jordan's remark)).
See this Ruby demo.
Here, the pattern matches:
\s+(\.[^.]+\z) - 1 or more whitespaces (\s+) before the extension (\.[^.]+ - a dot followed with 1+ chars other than a dot before the end of string \z), while capturing the extension into Group 1
| - or
\s - any other whitespace symbol (add + after it if you need to replace whole whitespace chunks with underscores).
In the gsub block, a check is performed to test Group 1, and if it matched, only the extension is inserted into the result. Else, a whitespace is replaced with an underscore.

Where did the character go?

I matched a string against a regex:
s = "`` `foo`"
r = /(?<backticks>`+)(?<inline>.+)\g<backticks>/
And I got:
s =~ r
$& # => "`` `foo`"
$~[:backticks] # => "`"
$~[:inline] # => " `foo"
Why is $~[:inline] not "` `foo"? Since $& is s, I expect:
$~[:backticks] + $~[:inline] + $~[:backticks]
to be s, but it is not, one backtick is gone. Where did the backtick go?
It is actually expected. Look:
(?<backticks>`+) - matches 1+ backticks and stores them in the named capture group "backticks" (there are two backticks). Then...
(?<inline>.+) - 1+ characters other than a newline are matched into the "inline" named capture group. It grabs all the string and backtracks to yield characters to the recursed subpattern that is actually the "backticks" capture group. So,...
\g<backticks> - finds 1 backtick that is at the end of the string. It satisfies the condition to match 1+ backticks. The named capture "backtick" buffer is re-written here.
The matching works like this:
"`` `foo`"
||1
| 2 |
|3
And then 1 becomes 3, and since 1 and 3 are the same group, you see one backtick.

what would the regular expression to extract the 3 from be?

I basically need to get the bit after the last pipe
"3083505|07733366638|3"
What would the regular expression for this be?
You can do this without regex. Here:
"3083505|07733366638|3".split("|").last
# => "3"
With regex: (assuming its always going to be integer values)
"3083505|07733366638|3".scan(/\|(\d+)$/)[0][0] # or use \w+ if you want to extract any word after `|`
# => "3"
Try this regex :
.*\|(.*)
It returns whatever comes after LAST | .
You could do that most easily by using String#rindex:
line = "3083505|07733366638|37"
line[line.rindex('|')+1..-1]
#=> "37"
If you insist on using a regex:
r = /
.* # match any number of any character (greedily!)
\| # match pipe
(.+) # match one or more characters in capture group 1
/x # extended mode
line[r,1]
#=> "37"
Alternatively:
r = /
.* # match any number of any character (greedily!)
\| # match pipe
\K # forget everything matched so far
.+ # match one or more characters
/x # extended mode
line[r]
#=> "37"
or, as suggested by #engineersmnky in a comment on #shivam's answer:
r = /
(?<=\|) # match a pipe in a positive lookbehind
\d+ # match any number of digits
\z # match end of string
/x # extended mode
line[r]
#=> "37"
I would use split and last, but you could do
last_field = line.sub(/.+\|/, "")
That remove all chars up to and including the last pipe.

Resources