Split a string into an array based on runs of contiguous characters

Split a string into an array based on runs of contiguous characters - ruby

I've been searching for an answer to this in Ruby for a little while now and haven't found a good solution. What I am trying to figure out is how to split a string when the next character doesn't match the previous and pass the groupings into an array. ie.
'aaaabbbbzzxxxhhnnppp'
becomes
['aaaa', 'bbbb', 'zz', 'xxx', 'hh', 'nn', 'ppp']
I know I could just iterate over each char in the string and check for a change but am curious if there's anything built-in that could tackle this in a elegant manner.

Doable with a simple regex:
'aaaabbbbzzxxxhhnnppp'.scan(/((.)\2*)/).map{|x| x[0]}
=> ["aaaa", "bbbb", "zz", "xxx", "hh", "nn", "ppp"]

Related

Better way to tokenize string in Ruby?

I have a string of arbitrary characters. I would like to turn it into an array, where each character is in a single array element, EXCEPT of successive word-characters (\w+), which should end up together in one array element. Example:
ab.:u/87z
should become
['ab','.',':','u','/','87z']
My first approach went like this:
mystring.split(/\b/)
Of course this groups together non-word characters:
['ab','.:','u','/','87','z']
I can take them apart in a subsequent step, but I'm looking for a more elegant way. Next I tried these:
mystring.split(/(\w+|\W)/)
mystring.split(/(\b|\W)/)
Both return nearly the desired result, only that they also return array elements containing empty strings, so I have to write something like
mystring.split(/(\b|\W)/).reject(&:empty?)
Now my question: Is there a simpler way to do this?
UPDATE: I made a silly mistake when I explained my example. Of course '87' and 'z' should be together, i.e. '87z'. I fixed my example.

'ab.:u/87z'.scan(/\w+|./) #=>["ab", ".", ":", "u", "/", "87z"]
I'm not exactly sure what you want because you said word-characters (\w+) but split the 87 and z. If I'm correct, \w should match letters, digits and underscores. Hence the "87z".
'ab.:u/87z'.scan(/[A-Za-z]+|\d+|./) #=>["ab", ".", ":", "u", "/", "87", "z"]
You could always do this to achieve what you showed there though

Don't use split, use the scan method:
> "ab.:u/87z".scan(/\w+|\W/)
=> ["ab", ".", ":", "u", "/", "87z"]

ruby regex to match multiple occurrences of pattern

I am looking to build a ruby regex to match multiple occurrences of a pattern and return them in an array. The pattern is simply: [[.+]]. That is, two left brackets, one or more characters, followed by two right brackets.
This is what I have done:
str = "Some random text[[lead:first_name]] and more stuff [[client:last_name]]"
str.match(/\[\[(.+)\]\]/).captures
The regex above doesn't work because it returns this:
["lead:first_name]] and another [[client:last_name"]
When what I wanted was this:
["lead:first_name", "client:last_name"]
I thought if I used a noncapturing group that for sure it should solve the issue:
str.match(/(?:\[\[(.+)\]\])+/).captures
But the noncapturing group returns the same exact wrong output. Any idea on how I can resolve my issue?

The problem with your regex is that the .+ part is "greedy", meaning that if the regex matches both a smaller and larger part of the string, it will capture the larger part (more about greedy regexes).
In Ruby (and most regex syntaxes), you can qualify your + quantifier with a ? to make it non-greedy. So your regex would become /(?:\[\[(.+?)\]\])+/.
However, you'll notice this still doesn't work for what you want to do. The Ruby capture groups just don't work inside a repeating group. For your problem, you'll need to use scan:
"[[a]][[ab]][[abc]]".scan(/\[\[(.+?)\]\]/).flatten
=> ["a", "ab", "abc"]

Try this:
=> str.match(/\[\[(.*)\]\].*\[\[(.*)\]\]/).captures
=> ["lead:first_name", "client:last_name"]
With many occurrences:
=> str
=> "Some [[lead:first_name]] random text[[lead:first_name]] and more [[lead:first_name]] stuff [[client:last_name]]"
=> str.scan(/\[(\w+:\w+)\]/)
=> [["lead:first_name"], ["lead:first_name"], ["lead:first_name"], ["client:last_name"]]

ruby - %w(%q(words in a row)) returns wrong elements

How can I get this expression to return an array of words?
%w(%q(words in a row))
I thought that %q would give me a string and then %w would give me an array of the words.
Bt instead I get
["%q(words", "in", "a", "row")"]
This is part of some larger code so just using %w on its own will not help.
I want to be able to interpolate the %q expression first.

Lower case %w and %q does not interpolate variables inside. You will need to use upper case %W and %Q, and also you need to wrap the variable with #{} to interpolate.
Your version of working code:
%W{#{%q{words in a row}}}
However just as Justin said, I do not understand the point of this. You can directly put raw string without quotes in %w{}.

Maybe I am misunderstanding the problem. But if you want to interpolate the string and then convert to a word array, I believe the %W does this, noting the capitalization (see here).
%W(words in a row)
#=> ["words", "in", "a", "row"]

I doubt the usefulness of doing such thing, and there is definitely a smell in your code, but here is a way:
eval("%w(#{%q(words in a row)})")
# => ["words", "in", "a", "row"]

Ruby Regex, Only One Capture (Very Simple!)

I guess this will be a silly mistake but for me, the following returns an array containing only "M". See this:
/(.)+?/.match("Many many characters!").captures
=> ["M"]
Why doesn't it return an array of every character? I must have missed something blatantly obvious because I can't see whats wrong with this?
Edit: Just realised, I don't need the +? but it still doesn't work without it.
Edit: Apologies! I will clarify: my goal is to allow users to enter a regular expression and styling and an input text file, wherever there is a match, the text will be surrounded with a html element and styling will be applied, I am not just splitting the string into characters, I only used the given regex because it was the simplest although that was stupid on my part. How do I get capture groups from scan() or is that not possible? I see that $1 contains "!" (last match?) and not any others.
Edit: Gosh, it really isn't my day. As injekt has informed me, the captures are stored in separate arrays. How do I get the offset of these captures from the original string? I would like to be able to get the offset of a captures then surround it with another string. Or is that what gsub is for? (I thought that only replaced the match, not a capture group)
Hopefully final edit: Right, let me just start this again :P
So, I have a string. The user will use a configuration file to enter a regular expression, then a style associated with each capture group. I need to be able to scan the entire string and get the start and finish or offset and size of each group match.
So if a user had configured ([\w-\.]+)#((?:[\w]+\.)+)([a-zA-Z]{2,4}) (email address) then I should be able to get:
[ ["elliotpotts", 0, 11],
["sample.", 12, 7],
["com", 19, 3] ]
from the string:
"elliotpotts#sample.com"
If that is not clear, there is simply something wrong with me :P. Thanks a lot so far guys, and thank you for being so patient!

Because your capture is only matching one single character. (.)+ is not the same as (.+)
>> /(.)+?/.match("Many many characters!").captures
=> ["M"]
>> /(.+)?/.match("Many many characters!").captures
=> ["Many many characters!"]
>> /(.+?)/.match("Many many characters!").captures
=> ["M"]
If you want to match every character recursively use String#scan or String#split if you don't care about capture groups
Using scan:
"Many many characters!".scan(/./)
#=> ["M", "a", "n", "y", " ", "m", "a", "n", "y", " ", "c", "h", "a", "r", "a", "c", "t", "e", "r", "s", "!"]
Note that other answer are using (.) whilst that's fine if you care about the capture group, it's a little pointless if you don't, otherwise it'll return EVERY CHARACTER in it's own separate Array, like this:
[["M"], ["a"], ["n"], ["y"], [" "], ["m"], ["a"], ["n"], ["y"], [" "], ["c"], ["h"], ["a"], ["r"], ["a"], ["c"], ["t"], ["e"], ["r"], ["s"], ["!"]]
Otherwise, just use split: "Many many characters!".split(' ')"
EDIT In reply to your edit:
reg = /([\w-\.]+)#((?:[\w]+\.)+)([a-zA-Z]{2,4})/
str = "elliotpotts#sample.com"
str.scan(reg).flatten.map { |capture| [capture, str.index(capture), capture.size] }
#=> [["elliotpotts", 0, 11], ["sample.", 12, 7], ["com", 19, 3]]`
Oh, and you don't need scan, you're not really scanning so you dont need to traverse, at least not with the example you provided:
str.match(reg).captures.map { |capture| [capture, str.index(capture), capture.size] }
Will also work

Yes, something important was missed ;-)
(...) only introduces ONE capture group: the number of times the group matches is irrelevant as the index is determined only by the regular expression itself and not the input.
The key is a "global regular expression", which will apply the regular expression multiple times in order. In Ruby this is done with inverting from Regex#match to String#scan (many other languages have a "/g" regular expression modifier):
"Many many characters!".scan(/(.)+?/)
# but more simply (or see answers using String#split)
"Many many characters!".scan(/(.)/)
Happy coding

It's only returning one character because that's all you've asked it to match. You probably want to use scan instead:
str = "Many many characters!"
matches = str.scan(/(.)/)

The following code is from Get index of string scan results in ruby and modified for my liking.
[].tap {|results|
"abab".scan(/a/) {|capture|
results.push(([capture, Regexp::last_match.offset(0)]).flatten)
}
}
=> [["a", 0], ["a", 2]]

Split a string with multiple delimiters in Ruby

Take for instance, I have a string like this:
options = "Cake or pie, ice cream, or pudding"
I want to be able to split the string via or, ,, and , or.
The thing is, is that I have been able to do it, but only by parsing , and , or first, and then splitting each array item at or, flattening the resultant array afterwards as such:
options = options.split(/(?:\s?or\s)*([^,]+)(?:,\s*)*/).reject(&:empty?);
options.each_index {|index| options[index] = options[index].sub("?","").split(" or "); }
The resultant array is as such: ["Cake", "pie", "ice cream", "pudding"]
Is there a more efficient (or easier) way to split my string on those three delimiters?

What about the following:
options.gsub(/ or /i, ",").split(",").map(&:strip).reject(&:empty?)
replaces all delimiters but the ,
splits it at ,
trims each characters, since stuff like ice cream with a leading space might be left
removes all blank strings

First of all, your method could be simplified a bit with Array#flatten:
>> options.split(',').map{|x|x.split 'or'}.flatten.map(&:strip).reject(&:empty?)
=> ["Cake", "pie", "ice cream", "pudding"]
I would prefer using a single regex:
>> options.split /\s*, or\s+|\s*,\s*|\s+or\s+/
=> ["Cake", "pie", "ice cream", "pudding"]
You can use | in a regex to give alternatives, and putting , or first guarantees that it won’t produce an empty item. Capturing the whitespace with the regex is probably best for efficiency, since you don’t have to scan the array again.
As Zabba points out, you may still want to reject empty items, prompting this solution:
>> options.split(/,|\sor\s/).map(&:strip).reject(&:empty?)
=> ["Cake", "pie", "ice cream", "pudding"]

As "or" and "," does the same thing, the best approach is to tell the regex that multiple cases should be treated the same as a single case:
options = "Cake or pie, ice cream, or pudding"
regex = /(?:\s*(?:,|or)\s*)+/
options.split(regex)

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Split a string into an array based on runs of contiguous characters - ruby

Doable with a simple regex: 'aaaabbbbzzxxxhhnnppp'.scan(/((.)\2*)/).map{|x| x[0]} => ["aaaa", "bbbb", "zz", "xxx", "hh", "nn", "ppp"]

Related

Better way to tokenize string in Ruby?

ruby regex to match multiple occurrences of pattern

ruby - %w(%q(words in a row)) returns wrong elements

Ruby Regex, Only One Capture (Very Simple!)

Split a string with multiple delimiters in Ruby

Categories

Resources