Partition/split a string by character set in Ruby - ruby

How can I separate different character sets in my string? For example, if I had these charsets:
[a-z]
[A-Z]
[0-9]
[\s]
{everything else}
And this input:
thisISaTEST***1234pie
Then I want to separate the different character sets, for example, if I used a newline as the separating character:
this
IS
a
TEST
***
1234
pie
I've tried this regex, with a positive lookahead:
'thisISaTEST***1234pie'.gsub(/(?=[a-z]+|[A-Z]+|[0-9]+|[\s]+)/, "\n")
But apparently the +s aren't being greedy, because I'm getting:
t
h
# (snip)...
S
T***
1
# (snip)...
e
I snipped out the irrelevant parts, but as you can see each character is counting as its own charset, except the {everything else} charset.
How can I do this? It does not necessarily have to be by regex. Splitting them into an array would work too.

The difficult part is to match whatever that does not match the rest of the regex. Forget about that, and think of a way that you can mix the non-matching parts together with the matching parts.
"thisISaTEST***1234pie"
.split(/([a-z]+|[A-Z]+|\d+|\s+)/).reject(&:empty?)
# => ["this", "IS", "a", "TEST", "***", "1234", "pie"]

In the ASCII character set, apart from alphanumerics and space, there are thirty-two "punctuation" characters, which are matched with the property construct \p{punct}.
To split your string into sequences of a single category, you can write
str = 'thisISaTEST***1234pie'
p str.scan(/\G(?:[a-z]+|[A-Z]+|\d+|\s+|[\p{punct}]+)/)
output
["this", "IS", "a", "TEST", "***", "1234", "pie"]
Alternatively, if your string contains characters outside the ASCII set, you could write the whole thing in terms of properties, like this
p str.scan(/\G(?:\p{lower}+|\p{upper}+|\p{digit}+|\p{space}|[^\p{alnum}\p{space}]+)/)

Here a two solutions.
String#scan with a regular expression
str = "thisISa\n TEST*$*1234pie"
r = /[a-z]+|[A-Z]+|\d+|\s+|[^a-zA-Z\d\s]+/
str.scan r
#=> ["this", "IS", "a", "\n ", "TEST", "*$*", "1234", "pie"]
Because of ^ at the beginning of [^a-zA-Z\d\s] that character class matches any character other than letters (lower and upper case), digits and whitespace.
Use Enumerable#slice_when1
First, a helper method:
def type(c)
case c
when /[a-z]/ then 0
when /[A-Z]/ then 1
when /\d/ then 2
when /\s/ then 3
else 4
end
end
For example,
type "f" #=> 0
type "P" #=> 1
type "3" #=> 2
type "\n" #=> 3
type "*" #=> 4
Then
str.each_char.slice_when { |c1,c2| type(c1) != type(c2) }.map(&:join)
#=> ["this", "IS", "a", "TEST", "***", "1234", "pie"]
1. slich_when made its debut in Ruby v2.4.

Non-word, non-space chars can be covered with [^\w\s], so:
"thisISaTEST***1234pie".scan /[a-z]+|[A-Z]+|\d+|\s+|[^\w\s]+/
#=> ["this", "IS", "a", "TEST", "***", "1234", "pie"]

Related

How can I use regex in Ruby to split a string into an array of the words it contains?

I am trying to create a regex pattern that will split a string into an array of words based on many different patterns and conventions. The rules are as follows:
It must split the string on all dashes, spaces, underscores, and periods.
When multiple of the aforementioned characters show up together, it must only split once (so 'the--.quick' must split to ['the', 'quick'] and not ['the', '', '', 'quick'] )
It must split the string on new capital letters, while keeping that letter with its corresponding word ('theQuickBrown' splits to ['the', 'quick', 'brown']
It must group multiple uppercase letters in a row together ('LETS_GO' must split to ['lets', 'go'], not ['l', 'e', 't', 's', 'g', 'o'])
It must use only lowercase letters in the split array.
If it is working properly, the following should be true
"theQuick--brown_fox JumpsOver___the.lazy DOG".split_words ==
["the", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"]
So far, I have been able to get almost there, with the only issue being that it splits on every capital, so "DOG".split_words is ["d", "o", "g"] and not ["dog"]
I also use a combination of regex and maps/filters on the split array to get to the solution, bonus points if you can tell me how to get rid of that and use only regex.
Here's what I have so far:
class String
def split_words
split(/[_,\-, ,.]|(?=[A-Z]+)/).
map(&:downcase).
reject(&:empty?)
end
end
Which when called on the string from the test above returns:
["the", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "d", "o", "g"]
How can I update this method to meet all of the above specs?
You can slightly change the regex so it doesn't split on every capital, but every sequence of letters that starts with a capital. This just involves putting a [a-z]+ after the [A-Z]+
string = "theQuick--brown_fox JumpsOver___the.lazy DOG"
regex = /[_,\-, ,.]|(?=[A-Z]+[a-z]+)/
string.split(regex).reject(&:empty?)
# => ["the", "Quick", "brown", "fox", "Jumps", "Over", "the", "lazy", "DOG"]
You may use a matching approach to extract chunks of 2 or more uppercase letters or a letter followed only with 0+ lowercase letters:
s.scan(/\p{Lu}{2,}|\p{L}\p{Ll}*/).map(&:downcase)
See the Ruby demo and the Rubular demo.
The regex matches:
\p{Lu}{2,} - 2 or more uppercase letters
| - or
\p{L} - any letter
\p{Ll}* - 0 or more lowercase letters.
With map(&:downcase), the items you get with .scan() are turned to lower case.
r = /
[- _.]+ # match one or more combinations of dashes, spaces,
# underscores and periods
| # or
(?<=\p{Ll}) # match a lower case letter in a positive lookbehind
(?=\p{Lu}) # match an upper case letter in a positive lookahead
/x # free-spacing regex definition mode
str = "theQuick--brown_dog, JumpsOver___the.--lazy FOX for $5"
str.split(r).map(&:downcase)
#=> ["the", "quick", "brown", "dog,", "jumps", "over", "the", "lazy",
"fox", "for", "$5"]
If the string is to be broken on spaces and all punctuation characters, replace [- _.]+ with [ [:punct:]]+. Search for "[[:punct:]]" at Regexp for the reference.

How do I split on a "." but only if there are non-numbers following it?

I want to split a line by a space, or a "." separating a number in front of it and a non-number behind it. I want to split like:
"10.ABC DEF GHI" # => ["10", "ABC", "DEF", "GHI"]
"10.00 DEF GHI" #=> ["10.00", "DEF", "GHI"]
I have
words = line.strip.split(/(?<=\d)\.|[[:space:]]+/)
But I discovered this doesn't quite do what I want. Although it will split the line:
line = "10.ABC DEF GHI"
words = line.strip.split(/(?<=\d)\.|[[:space:]]+/) # => ["10", "ABC", "DEF", "GHI"]
It will also incorrectly split
line = "10.00 DEF GHI"
line.strip.split(/(?<=\d)\.|[[:space:]]+/) # => ["10", "00", "DEF", "GHI"]
How do I correct my regular expression to only split on the dot if there are non-numbers following the "."?
Add a negative lookahead (?!\d) after \.:
/(?<=\d)\.(?!\d)|[[:space:]]+/
^^^^^^
It will fail the match if the . is followed with a digit.
See the Rubular demo.

ruby regexp to find a word that does not contain digits

I want my regular expression to return an enumerator that would return blocks with words that are not digits, what is the best way I could get that?
I have tried following:
regexp= /(?=\w+)(?=^(?:(?!\d+).)*$)/
"this is a number 1234".split(regexp) # ["this is a number 1234"]
where I expected (?=\w+) should ensure if that is word or not and I expected (?=^(?:(?!\d+).)*$) to ensure it does not contain any digits.
I expected an output:
["this", "is", "a", "number"]
scan is easier than split for this:
regexp = /\b[[:alpha:]]+\b/
p "this is a number 1234".scan(regexp)
# => ["this", "is", "a", "number"]
Try Following.
p "this is a number 1234".scan(/\D+/).first.split(' ')

Splitting string using whitespace while keeping \n as a separate element

I am using Ruby and looking for a way to read in a sample string with the following text:
"This is a test
file, dog cat bark
meow woof woof"
and split elements into an array of characters based on whitespace, but to keep the \n value in the array as a separate element.
I know I can use the string.split(/\n/) to get
["this is a test", "file, dog cat bark", "meow woof woof"]
Also string.split(/ /) yields
["this", "is", "a", "test\nfile,", "dog", "cat", "bark\nmeow", "woof", "woof"]
But I am looking for a way to get:
["this", "is", "a", "test", "\n", "file,", "dog", "cat", "bark", "\n", "meow", "woof", "woof"]
Is there any way to accomplish this using Ruby?
It's a strange thing to do but:
string.split /(?=\n)|(?<=\n)| /
#=> ["This", "is", "a", "test", "\n", "file,", "dog", "cat", "bark", "\n", "meow", "woof", "woof"]
You could turn your logic around a bit and look for what you want instead of looking for the delimiters between what you want. A simple scan like this should do the trick:
>> s.scan(/\S+|\n+/)
=> ["This", "is", "a", "test", "\n", "file,", "dog", "cat", "bark", "\n", "meow", "woof", "woof"]
That assumes that repeated \n should be a single token of course.
This isn't particularly elegant, but you could try replacing "\n" with " \n " (note the spaces surrounding \n), and then split the resulting string on / /.
This is an odd request, and perhaps, if you told us WHY you want to do that, we could help you do it in a more straightforward and conventional fashion.
It looks like you're trying to split the words and still know where your original line-ends were. Having the lines split into individual words is useful for many things, but keeping the line-ends... not so much in my experience.
When I'm dealing with text and need to break the lines up for processing, I do it this way:
text = "This is a test
file, dog cat bark
meow woof woof"
data = text.lines.map(&:split)
At this point, data looks like:
[["This", "is", "a", "test"],
["file,", "dog", "cat", "bark"],
["meow", "woof", "woof"]]
I know that each sub-array was a separate line, so if I need to process by lines I can do it using an iterator like each or map, or to reconstruct the original text I can join(" ") the sub-array elements, then join("\n") the resulting lines:
data.map{ |a| a.join(' ') }.join("\n")
=> "This is a test\nfile, dog cat bark\nmeow woof woof"

Match sequences of consecutive characters in a string

I have the string "111221" and want to match all sets of consecutive equal integers: ["111", "22", "1"].
I know that there is a special regex thingy to do that but I can't remember and I'm terrible at Googling.
Using regex in Ruby 1.8.7+:
p s.scan(/((\d)\2*)/).map(&:first)
#=> ["111", "22", "1"]
This works because (\d) captures any digit, and then \2* captures zero-or-more of whatever that group (the second opening parenthesis) matched. The outer (…) is needed to capture the entire match as a result in scan. Finally, scan alone returns:
[["111", "1"], ["22", "2"], ["1", "1"]]
…so we need to run through and keep just the first item in each array. In Ruby 1.8.6+ (which doesn't have Symbol#to_proc for convenience):
p s.scan(/((\d)\2*)/).map{ |x| x.first }
#=> ["111", "22", "1"]
With no Regex, here's a fun one (matching any char) that works in Ruby 1.9.2:
p s.chars.chunk{|c|c}.map{ |n,a| a.join }
#=> ["111", "22", "1"]
Here's another version that should work even in Ruby 1.8.6:
p s.scan(/./).inject([]){|a,c| (a.last && a.last[0]==c[0] ? a.last : a)<<c; a }
# => ["111", "22", "1"]
"111221".gsub(/(.)(\1)*/).to_a
#=> ["111", "22", "1"]
This uses the form of String#gsub that does not have a block and therefore returns an enumerator. It appears gsub was bestowed with that option in v2.0.
I found that this works, it first matches each character in one group, and then it matches any of the same character after it. This results in an array of two element arrays, with the first element of each array being the initial match, and then the second element being any additional repeated characters that match the first character. These arrays are joined back together to get an array of repeated characters:
input = "WWBWWWWBBBWWWWWWWB3333!!!!"
repeated_chars = input.scan(/(.)(\1*)/)
# => [["W", "W"], ["B", ""], ["W", "WWW"], ["B", "BB"], ["W", "WWWWWW"], ["B", ""], ["3", "333"], ["!", "!!!"]]
repeated_chars.map(&:join)
# => ["WW", "B", "WWWW", "BBB", "WWWWWWW", "B", "3333", "!!!!"]
As an alternative I found that I could create a new Regexp object to match one or more occurrences of each unique characters in the input string as follows:
input = "WWBWWWWBBBWWWWWWWB3333!!!!"
regexp = Regexp.new("#{input.chars.uniq.join("+|")}+")
#=> regexp created for this example will look like: /W+|B+|3+|!+/
and then use that Regex object as an argument for scan to split out all the repeated characters, as follows:
input.scan(regexp)
# => ["WW", "B", "WWWW", "BBB", "WWWWWWW", "B", "3333", "!!!!"]
you can try is
string str ="111221";
string pattern =#"(\d)(\1)+";
Hope can help you

Resources