In Ruby, what's the easiest way to split a string in the following manner?
'abc+def' should split to ['abc', '+', 'def']
'abc\*def+eee' should split to ['abc', '\*', 'def', '+', 'eee']
'ab/cd*de+df' should split to ['ab', '/', 'cd', '*', 'de', '+', 'df']
The idea is to split the string about these symbols: ['-', '+', '*', '/'] and also save those symbols in the result at appropriate locations.
Option 1
/\b/ is a word boundary and it has zero-width, so it will not consume any characters
'abc+def'.split(/\b/)
# => ["abc", "+", "def"]
'abc*def+eee'.split(/\b/)
# => ["abc", "*", "def", "+", "eee"]
'ab/cd*de+df'.split(/\b/)
# => ["ab", "/", "cd", "*", "de", "+", "df"]
Option 2
If your string contains other word boundary characters and you only want to split on -, +, *, and /, then you can use capture groups. If a capture group is used, String#split will also include captured strings in the result. (Thanks for pointing this out #Jordan) (#Cary Swoveland sorry, I didn't see your answer when I made this edit)
'abc+def'.split /([+*\/-])/
# => ["abc", "+", "def"]
'abc*def+eee'.split /([+*\/-])/
# => ["abc", "*", "def", "+", "eee"]
'ab/cd*de+df'.split /([+*\/-])/
# => ["ab", "/", "cd", "*", "de", "+", "df"]
Option 3
Lastly, for those using a language that might not support string splitting with a capture group, you can use two lookarounds. Lookarounds are also zero-width matches, so they will not consume any characters
'abc+def'.split /(?=[+*\/-])|(?<=[+*\/-])/
# => ["abc", "+", "def"]
'abc*def+eee'.split /(?=[+*\/-])|(?<=[+*\/-])/
# => ["abc", "*", "def", "+", "eee"]
'ab/cd*de+df'.split /(?=[+*\/-])|(?<=[+*\/-])/
# => ["ab", "/", "cd", "*", "de", "+", "df"]
The idea here is to split on any character that is preceded by one of your separators, or any character that is followed by one of the separators. Let's do a little visual
ab ⍿ / ⍿ cd ⍿ * ⍿ de ⍿ + ⍿ df
The little ⍿ symbols are either preceded or followed by one of the separators. So this is where the string will get cut.
Option 4
Maybe your language doesn't have a string split function or sensible ways to interact with regular expressions. It's nice to know you don't have to sit around guessing if there's clever built-in procedures that magically solve your problems. There's almost always a way to solve your problem using basic instructions
class String
def head
self[0]
end
def tail
self[1..-1]
end
def reduce acc, &f
if empty?
acc
else
tail.reduce yield(acc, head), &f
end
end
def separate chars
res, acc = reduce [[], ''] do |(res, acc), char|
if chars.include? char
[res + [acc, char], '']
else
[res, acc + char]
end
end
res + [acc]
end
end
'abc+def'.separate %w(- + / *)
# => ["abc", "+", "def"]
'abc*def+eee'.separate %w(- + / *)
# => ["abc", "*", "def", "+", "eee"]
'ab/cd*de+df'.separate %w(- + / *)
# => ["ab", "/", "cd", "*", "de", "+", "df"]
I see this is close to part of #naomic's answer, but I'll leave it for the small differences.
splitters = ['-', '+', '*', '/']
r = /(#{ Regexp.union(splitters) })/
# => /((?-mix:\-|\+|\*|\/))/
'abc+def'.split r
#=> ["abc", "+", "def"]
"abc\*def+eee".split r
#=> ["abc", "*", "def", "+", "eee"]
'ab/cd*de+df'.split r
#=> ["ab", "/", "cd", "*", "de", "+", "df"]
Notes:
the regex places #{ Regexp.union(splitters) } in a capture group, causing String#split to include the strings that do the splitting (last sentence of the third paragraph).
the second example string must be in double quotes in order to escape *.
Related
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
I would like to ask you for help. I have keywords in this form "AB10" and I need to split i to "AB" and "10". What is the best way?
Thank you for your help!
One could use String#scan:
def divide_str(s)
s.scan(/\d+|\D+/)
end
divide_str 'AB10' #=> ["AB", "10"]
divide_str 'AB10CD20' #=> ["AB", "10", "CD", "20"]
divide_str '10AB20CD' #=> ["10", "AB", "20", "CD"]
The regular expression /\d+|\D+/ reads, "match one or more (+) digits (\d) or one or more non-digits (\D).
Here is another way, one that does not employ a regular expression.
def divide_str(s)
digits = '0'..'9'
s.each_char.slice_when do |x,y|
digits.cover?(x) ^ digits.cover?(y)
end.map(&:join)
end
divide_str 'AB10' #=> ["AB", "10"]
divide_str 'AB10CD20' #=> ["AB", "10", "CD", "20"]
divide_str '10AB20CD' #=> ["10", "AB", "20", "CD"]
See Enumerable#slice_when, Range#cover?, TrueClass#^ and FalseClass#^.
Use split like so:
my_str.split(/(\d+)/)
To split any string on the boundary between digits and letters, use either of these 2 methods:
Use split with regex in capturing parentheses to include the delimiter, here a stretch of digits, into the resulting array. Remove empty strings (if any) using a combination of reject and empty?:
strings = ['AB10', 'AB10CD20', '10AB20CD']
strings.each do |str|
arr = str.split(/(\d+)/).reject(&:empty?)
puts "'#{str}' => #{arr}"
end
Output:
'AB10' => ["AB", "10"]
'AB10CD20' => ["AB", "10", "CD", "20"]
'10AB20CD' => ["10", "AB", "20", "CD"]
Use split with non-capturing parentheses: (?:PATTERN), positive lookahead (?=PATTERN) and positive lookbehind (?<=PATTERN) regexes to match the letter-digit and digit-letter boundaries:
strings.each do |str|
arr = str.split(/ (?: (?<=[A-Za-z]) (?=\d) ) | (?: (?<=\d) (?=[A-Za-z]) ) /x)
puts "'#{str}' => #{arr}"
end
The two methods give the same output for the cases shown.
irb(main):161:0> "Ready for your my next session?".scan(/[A-Za-z]+|\d+|. /)
=> ["Ready", "for", "your", "my", "next", "session"]
=> ["Ready", "for", "your", "my", "next", "session", "?"] #==> EXPECTED
irb(main):162:0> "yo mr. menon how are you? call at 9 a.m. \"okay\"".scan(/[A-Za-z]+|\d+|. /)
=> ["yo", "mr", ". ", "menon", "how", "are", "you", "? ", "call", "at", "9", "a", "m", ". ", "okay"]
=> ["yo", "mr", ". ", "menon", "how", "are", "you", "? ", "call", "at", "9", "a",".", "m", ".", "``", "okay", "''"] #==> EXPECTED
I am trying to use this scan(/[A-Za-z]+|\d+|. /) to tokenize the string and even the punctuations, even if there is an escaped quote in the string, \"
But it is behaving differently on different structure of a string? How to correct?
r = /
(?: # begin a non-capture group
\"? # optionally (?) match a double-quote
\p{alpha}+ # match one or more letters
\"? # optionally (?) match a double-quote
) # end non-capture group
| # or
\d+ # match one or more digits
| # or
[.,?!:;] # match a punctuation mark
/x # free-spacing regex definition mode
"yo mr. menon how are you? call at 9 a.m. \"okay\"".scan(r)
#=> ["yo", "mr", ".", "menon", "how", "are", "you", "?", "call", "at", "9",
# "a", ".", "m", ".", "\"okay\""]
puts "\"okay\""
# "okay"
The regular expression is conventionally written
/(?:\"?\p{alpha}+\"?)|\d+|[.,?!:;]/
I want to split a line by a space, or a "." separating a number in front of it and a non-number behind it. I want to split like:
"10.ABC DEF GHI" # => ["10", "ABC", "DEF", "GHI"]
"10.00 DEF GHI" #=> ["10.00", "DEF", "GHI"]
I have
words = line.strip.split(/(?<=\d)\.|[[:space:]]+/)
But I discovered this doesn't quite do what I want. Although it will split the line:
line = "10.ABC DEF GHI"
words = line.strip.split(/(?<=\d)\.|[[:space:]]+/) # => ["10", "ABC", "DEF", "GHI"]
It will also incorrectly split
line = "10.00 DEF GHI"
line.strip.split(/(?<=\d)\.|[[:space:]]+/) # => ["10", "00", "DEF", "GHI"]
How do I correct my regular expression to only split on the dot if there are non-numbers following the "."?
Add a negative lookahead (?!\d) after \.:
/(?<=\d)\.(?!\d)|[[:space:]]+/
^^^^^^
It will fail the match if the . is followed with a digit.
See the Rubular demo.
How can I separate different character sets in my string? For example, if I had these charsets:
[a-z]
[A-Z]
[0-9]
[\s]
{everything else}
And this input:
thisISaTEST***1234pie
Then I want to separate the different character sets, for example, if I used a newline as the separating character:
this
IS
a
TEST
***
1234
pie
I've tried this regex, with a positive lookahead:
'thisISaTEST***1234pie'.gsub(/(?=[a-z]+|[A-Z]+|[0-9]+|[\s]+)/, "\n")
But apparently the +s aren't being greedy, because I'm getting:
t
h
# (snip)...
S
T***
1
# (snip)...
e
I snipped out the irrelevant parts, but as you can see each character is counting as its own charset, except the {everything else} charset.
How can I do this? It does not necessarily have to be by regex. Splitting them into an array would work too.
The difficult part is to match whatever that does not match the rest of the regex. Forget about that, and think of a way that you can mix the non-matching parts together with the matching parts.
"thisISaTEST***1234pie"
.split(/([a-z]+|[A-Z]+|\d+|\s+)/).reject(&:empty?)
# => ["this", "IS", "a", "TEST", "***", "1234", "pie"]
In the ASCII character set, apart from alphanumerics and space, there are thirty-two "punctuation" characters, which are matched with the property construct \p{punct}.
To split your string into sequences of a single category, you can write
str = 'thisISaTEST***1234pie'
p str.scan(/\G(?:[a-z]+|[A-Z]+|\d+|\s+|[\p{punct}]+)/)
output
["this", "IS", "a", "TEST", "***", "1234", "pie"]
Alternatively, if your string contains characters outside the ASCII set, you could write the whole thing in terms of properties, like this
p str.scan(/\G(?:\p{lower}+|\p{upper}+|\p{digit}+|\p{space}|[^\p{alnum}\p{space}]+)/)
Here a two solutions.
String#scan with a regular expression
str = "thisISa\n TEST*$*1234pie"
r = /[a-z]+|[A-Z]+|\d+|\s+|[^a-zA-Z\d\s]+/
str.scan r
#=> ["this", "IS", "a", "\n ", "TEST", "*$*", "1234", "pie"]
Because of ^ at the beginning of [^a-zA-Z\d\s] that character class matches any character other than letters (lower and upper case), digits and whitespace.
Use Enumerable#slice_when1
First, a helper method:
def type(c)
case c
when /[a-z]/ then 0
when /[A-Z]/ then 1
when /\d/ then 2
when /\s/ then 3
else 4
end
end
For example,
type "f" #=> 0
type "P" #=> 1
type "3" #=> 2
type "\n" #=> 3
type "*" #=> 4
Then
str.each_char.slice_when { |c1,c2| type(c1) != type(c2) }.map(&:join)
#=> ["this", "IS", "a", "TEST", "***", "1234", "pie"]
1. slich_when made its debut in Ruby v2.4.
Non-word, non-space chars can be covered with [^\w\s], so:
"thisISaTEST***1234pie".scan /[a-z]+|[A-Z]+|\d+|\s+|[^\w\s]+/
#=> ["this", "IS", "a", "TEST", "***", "1234", "pie"]
I have the string "111221" and want to match all sets of consecutive equal integers: ["111", "22", "1"].
I know that there is a special regex thingy to do that but I can't remember and I'm terrible at Googling.
Using regex in Ruby 1.8.7+:
p s.scan(/((\d)\2*)/).map(&:first)
#=> ["111", "22", "1"]
This works because (\d) captures any digit, and then \2* captures zero-or-more of whatever that group (the second opening parenthesis) matched. The outer (…) is needed to capture the entire match as a result in scan. Finally, scan alone returns:
[["111", "1"], ["22", "2"], ["1", "1"]]
…so we need to run through and keep just the first item in each array. In Ruby 1.8.6+ (which doesn't have Symbol#to_proc for convenience):
p s.scan(/((\d)\2*)/).map{ |x| x.first }
#=> ["111", "22", "1"]
With no Regex, here's a fun one (matching any char) that works in Ruby 1.9.2:
p s.chars.chunk{|c|c}.map{ |n,a| a.join }
#=> ["111", "22", "1"]
Here's another version that should work even in Ruby 1.8.6:
p s.scan(/./).inject([]){|a,c| (a.last && a.last[0]==c[0] ? a.last : a)<<c; a }
# => ["111", "22", "1"]
"111221".gsub(/(.)(\1)*/).to_a
#=> ["111", "22", "1"]
This uses the form of String#gsub that does not have a block and therefore returns an enumerator. It appears gsub was bestowed with that option in v2.0.
I found that this works, it first matches each character in one group, and then it matches any of the same character after it. This results in an array of two element arrays, with the first element of each array being the initial match, and then the second element being any additional repeated characters that match the first character. These arrays are joined back together to get an array of repeated characters:
input = "WWBWWWWBBBWWWWWWWB3333!!!!"
repeated_chars = input.scan(/(.)(\1*)/)
# => [["W", "W"], ["B", ""], ["W", "WWW"], ["B", "BB"], ["W", "WWWWWW"], ["B", ""], ["3", "333"], ["!", "!!!"]]
repeated_chars.map(&:join)
# => ["WW", "B", "WWWW", "BBB", "WWWWWWW", "B", "3333", "!!!!"]
As an alternative I found that I could create a new Regexp object to match one or more occurrences of each unique characters in the input string as follows:
input = "WWBWWWWBBBWWWWWWWB3333!!!!"
regexp = Regexp.new("#{input.chars.uniq.join("+|")}+")
#=> regexp created for this example will look like: /W+|B+|3+|!+/
and then use that Regex object as an argument for scan to split out all the repeated characters, as follows:
input.scan(regexp)
# => ["WW", "B", "WWWW", "BBB", "WWWWWWW", "B", "3333", "!!!!"]
you can try is
string str ="111221";
string pattern =#"(\d)(\1)+";
Hope can help you