Match sequences of consecutive characters in a string - ruby

I have the string "111221" and want to match all sets of consecutive equal integers: ["111", "22", "1"].
I know that there is a special regex thingy to do that but I can't remember and I'm terrible at Googling.

Using regex in Ruby 1.8.7+:
p s.scan(/((\d)\2*)/).map(&:first)
#=> ["111", "22", "1"]
This works because (\d) captures any digit, and then \2* captures zero-or-more of whatever that group (the second opening parenthesis) matched. The outer (…) is needed to capture the entire match as a result in scan. Finally, scan alone returns:
[["111", "1"], ["22", "2"], ["1", "1"]]
…so we need to run through and keep just the first item in each array. In Ruby 1.8.6+ (which doesn't have Symbol#to_proc for convenience):
p s.scan(/((\d)\2*)/).map{ |x| x.first }
#=> ["111", "22", "1"]
With no Regex, here's a fun one (matching any char) that works in Ruby 1.9.2:
p s.chars.chunk{|c|c}.map{ |n,a| a.join }
#=> ["111", "22", "1"]
Here's another version that should work even in Ruby 1.8.6:
p s.scan(/./).inject([]){|a,c| (a.last && a.last[0]==c[0] ? a.last : a)<<c; a }
# => ["111", "22", "1"]

"111221".gsub(/(.)(\1)*/).to_a
#=> ["111", "22", "1"]
This uses the form of String#gsub that does not have a block and therefore returns an enumerator. It appears gsub was bestowed with that option in v2.0.

I found that this works, it first matches each character in one group, and then it matches any of the same character after it. This results in an array of two element arrays, with the first element of each array being the initial match, and then the second element being any additional repeated characters that match the first character. These arrays are joined back together to get an array of repeated characters:
input = "WWBWWWWBBBWWWWWWWB3333!!!!"
repeated_chars = input.scan(/(.)(\1*)/)
# => [["W", "W"], ["B", ""], ["W", "WWW"], ["B", "BB"], ["W", "WWWWWW"], ["B", ""], ["3", "333"], ["!", "!!!"]]
repeated_chars.map(&:join)
# => ["WW", "B", "WWWW", "BBB", "WWWWWWW", "B", "3333", "!!!!"]
As an alternative I found that I could create a new Regexp object to match one or more occurrences of each unique characters in the input string as follows:
input = "WWBWWWWBBBWWWWWWWB3333!!!!"
regexp = Regexp.new("#{input.chars.uniq.join("+|")}+")
#=> regexp created for this example will look like: /W+|B+|3+|!+/
and then use that Regex object as an argument for scan to split out all the repeated characters, as follows:
input.scan(regexp)
# => ["WW", "B", "WWWW", "BBB", "WWWWWWW", "B", "3333", "!!!!"]

you can try is
string str ="111221";
string pattern =#"(\d)(\1)+";
Hope can help you

Related

Ruby split binary string where the previous character is different from the next one

I wonder how could I split a binary string in Ruby.
I want to split the string where the previous character is different from the next one.
for example if i have the string
#s = "aaaabbabbaa"
I would like to create an array of strings
#array[0] = "aaaa"
#array[1] = "bb"
#array[2] = "a"
#array[3] = "bb"
#array[4] = "aa"
How could i do this?
Enumerable#chunk does that, but its defined on Enumerable - and String does not include Enumerable. Transform it into an Array of chars (and glue them back to strings) , like:
s = "aaaabbabbaa"
p array = s.chars.chunk(&:itself).map{|a| a.last.join} #=>["aaaa", "bb", "a", "bb", "aa"]
You could use a regular expression with scan:
#array = #s.scan(/((.)\2*)/).map(&:first)
#=> ["aaaa", "bb", "a", "bb", "aa"]
str = "aaaabbabbaa"
r = /
(?<=(.)) # match any character in capture group 1, in positive lookbehind
(?!\1) # do not match capture group 1, negative lookahead
/x # free-spacing regex definition mode
str.split(r)
#=> ["aaaa", "a", "bb", "b", "a", "a", "bb", "b", "aa", "a"]
By using two lookarounds no characters are lost when splitting on the regular expression.
using Enumerable#chunk_while
str = "aaaabbabbaa"
p str.chars.chunk_while(&:==).map(&:join)
Output : ["aaaa", "bb", "a", "bb", "aa"]

Ruby string char chunking

I have a string "wwwggfffw" and want to break it up into an array as follows:
["www", "gg", "fff", "w"]
Is there a way to do this with regex?
"wwwggfffw".scan(/((.)\2*)/).map(&:first)
scan is a little funny, as it will return either the match or the subgroups depending on whether there are subgroups; we need to use subgroups to ensure repetition of the same character ((.)\1), but we'd prefer it if it returned the whole match and not just the repeated letter. So we need to make the whole match into a subgroup so it will be captured, and in the end we need to extract just the match (without the other subgroup), which we do with .map(&:first).
EDIT to explain the regexp ((.)\2*) itself:
( start group #1, consisting of
( start group #2, consisting of
. any one character
) and nothing else
\2 followed by the content of the group #2
* repeated any number of times (including zero)
) and nothing else.
So in wwwggfffw, (.) captures w into group #2; then \2* captures any additional number of w. This makes group #1 capture www.
You can use back references, something like
'wwwggfffw'.scan(/((.)\2*)/).map{ |s| s[0] }
will work
Here's one that's not using regex but works well:
def chunk(str)
chars = str.chars
chars.inject([chars.shift]) do |arr, char|
if arr[-1].include?(char)
arr[-1] << char
else
arr << char
end
arr
end
end
In my benchmarks it's faster than the regex answers here (with the example string you gave, at least).
Another non-regex solution, this one using Enumerable#slice_when, which made its debut in Ruby v.2.2:
str.each_char.slice_when { |a,b| a!=b }.map(&:join)
#=> ["www", "gg", "fff", "w"]
Another option is:
str.scan(Regexp.new(str.squeeze.each_char.map { |c| "(#{c}+)" }.join)).first
#=> ["www", "gg", "fff", "w"]
Here the steps are as follows
s = str.squeeze
#=> "wgfw"
a = s.each_char
#=> #<Enumerator: "wgfw":each_char>
This enumerator generates the following elements:
a.to_a
#=> ["w", "g", "f", "w"]
Continuing
b = a.map { |c| "(#{c}+)" }
#=> ["(w+)", "(g+)", "(f+)", "(w+)"]
c = b.join
#=> "(w+)(g+)(f+)(w+)"
r = Regexp.new(c)
#=> /(w+)(g+)(f+)(w+)/
d = str.scan(r)
#=> [["www", "gg", "fff", "w"]]
d.first
#=> ["www", "gg", "fff", "w"]
Here's one more way of doing it without a regex:
'wwwggfffw'.chars.chunk(&:itself).map{ |s| s[1].join }
# => ["www", "gg", "fff", "w"]

Ignoring capture group in Regex that is used for repeating the patten

/((\w)\2)/ finds repeating letters. I was hoping to avoid the two dimensional array that is produced by ignoring the letter matching second capture group like this: /((?:\w)\2)/. It seems that's not possible. Any ideas why?
Rubular example
You don't need any capture groups:
str = [*'a+'..'z+', *'A+'..'Z+', *'0+'..'9+', '_+'].join('|')
#=> "a+|b+| ... |z+|A+|B+| ... |Z+|0+|1+| ... |9+|_+"
"aaabbcddd".scan(/#{str}/)
#=> ["aaa", "bb", "c", "ddd"]
but if you insist on having one:
"aaabbcddd".scan(/(#{str})/).flatten(1)
#=> ["aaa", "bb", "c", "ddd"]
Is this cheating? You did ask if it was possible.
If you mean you're using String#scan, you can post-process the result to return only the first items Enumerable#map:
'helloo'.scan(/((\w)\2)/)
# => [["ll", "l"], ["oo", "o"]]
'helloo'.scan(/((\w)\2)/).map { |m| m[0] }
# => ["ll", "oo"]

Format data in string to array?

I need to convert data from a string to an array. The string looks like this:
{a,b,c{1,2,3},d,e,f{11,22,33},g}
The array that I want to receive should look like this:
[a, b, c1, c2, c3, d, e, f11, f22, f33, g]
I tried to use the split method but it works poorly.
arr = str.split(' ');
keys = arr[0][2..-2]
keys = keys.split(',')
Do you have any ideas how it could be implemented?
Here's what I'd use:
string = '{a,b,c{1,2,3},d,e,f{11,22,33},g}'
array = string.scan(/[a-z](?:{.+?})?/).flat_map{ |s|
if s['{']
prefix = s[0]
values = s.scan(/\d+/)
([prefix] * values.size).zip(values).map(&:join)
else
s
end
}
array # => ["a", "b", "c1", "c2", "c3", "d", "e", "f11", "f22", "f33", "g"]
Here's how it works:
string.scan(/[a-z](?:{.+?})?/) # => ["a", "b", "c{1,2,3}", "d", "e", "f{11,22,33}", "g"]
returns the string broken into chunks, looking for a single letter followed by an optional string of { with some text then }.
values = s.scan(/\d+/) # => ["1", "2", "3"], ["11", "22", "33"]
As it's running in flat_map, if { is found, the numbers are scanned out.
([prefix] * values.size).zip(values).map(&:join) # => ["c1", "c2", "c3"], ["f11", "f22", "f33"]
And then an array of the prefix, with the same number of elements as there are values is created and zipped together, resulting in:
[["c", "1"], ["c", "2"], ["c", "3"]], [["f", "11"], ["f", "22"], ["f", "33"]]
The join glues those sub-arrays together. And flat_map flattens any subarrays created so the resulting output is a single array.
You need to arr = str.split(',') in the first step, because there is no whitespace between the values.
Also keep in mind you have {} to handle too.
This worked for me with simple regex and gsubing (though Tin Man's solution is better ruby):
def my_string_to_array(input_string)
groups = input_string.scan(/\w+\{.*?\}/)
groups.each do |group|
modified = group.gsub(',', ",#{group.match(/\w+/)[0]}").delete("{}")
input_string.gsub!(group, modified)
end
created_array = input_string.delete("{}").split(',')
end
string = '{a,b,c{1,2,3},d,e,f{11,22,33},g}'
my_string_to_array(string)
=> ["a", "b", "c1", "c2", "c3", "d", "e", "f11", "f22", "f33", "g"]
The way it works is that it first finds the groups having alphabets followed by braces and digits (like c{1,2,3})
For each such group, it modifies it by gsubing ',' with ',<alphabet>' and removing the braces.
Next, it replaces these groups with the modified ones in the original string.
And finally it removes the starting and ending braces in the original string, and converts it into an array.

Partition/split a string by character set in Ruby

How can I separate different character sets in my string? For example, if I had these charsets:
[a-z]
[A-Z]
[0-9]
[\s]
{everything else}
And this input:
thisISaTEST***1234pie
Then I want to separate the different character sets, for example, if I used a newline as the separating character:
this
IS
a
TEST
***
1234
pie
I've tried this regex, with a positive lookahead:
'thisISaTEST***1234pie'.gsub(/(?=[a-z]+|[A-Z]+|[0-9]+|[\s]+)/, "\n")
But apparently the +s aren't being greedy, because I'm getting:
t
h
# (snip)...
S
T***
1
# (snip)...
e
I snipped out the irrelevant parts, but as you can see each character is counting as its own charset, except the {everything else} charset.
How can I do this? It does not necessarily have to be by regex. Splitting them into an array would work too.
The difficult part is to match whatever that does not match the rest of the regex. Forget about that, and think of a way that you can mix the non-matching parts together with the matching parts.
"thisISaTEST***1234pie"
.split(/([a-z]+|[A-Z]+|\d+|\s+)/).reject(&:empty?)
# => ["this", "IS", "a", "TEST", "***", "1234", "pie"]
In the ASCII character set, apart from alphanumerics and space, there are thirty-two "punctuation" characters, which are matched with the property construct \p{punct}.
To split your string into sequences of a single category, you can write
str = 'thisISaTEST***1234pie'
p str.scan(/\G(?:[a-z]+|[A-Z]+|\d+|\s+|[\p{punct}]+)/)
output
["this", "IS", "a", "TEST", "***", "1234", "pie"]
Alternatively, if your string contains characters outside the ASCII set, you could write the whole thing in terms of properties, like this
p str.scan(/\G(?:\p{lower}+|\p{upper}+|\p{digit}+|\p{space}|[^\p{alnum}\p{space}]+)/)
Here a two solutions.
String#scan with a regular expression
str = "thisISa\n TEST*$*1234pie"
r = /[a-z]+|[A-Z]+|\d+|\s+|[^a-zA-Z\d\s]+/
str.scan r
#=> ["this", "IS", "a", "\n ", "TEST", "*$*", "1234", "pie"]
Because of ^ at the beginning of [^a-zA-Z\d\s] that character class matches any character other than letters (lower and upper case), digits and whitespace.
Use Enumerable#slice_when1
First, a helper method:
def type(c)
case c
when /[a-z]/ then 0
when /[A-Z]/ then 1
when /\d/ then 2
when /\s/ then 3
else 4
end
end
For example,
type "f" #=> 0
type "P" #=> 1
type "3" #=> 2
type "\n" #=> 3
type "*" #=> 4
Then
str.each_char.slice_when { |c1,c2| type(c1) != type(c2) }.map(&:join)
#=> ["this", "IS", "a", "TEST", "***", "1234", "pie"]
1. slich_when made its debut in Ruby v2.4.
Non-word, non-space chars can be covered with [^\w\s], so:
"thisISaTEST***1234pie".scan /[a-z]+|[A-Z]+|\d+|\s+|[^\w\s]+/
#=> ["this", "IS", "a", "TEST", "***", "1234", "pie"]

Resources