I have a string:
"N8383"
I want to split on the character and maintain it to get:
["N", "8383"]
I tried the following:
"N8383".split(/[A-Z]/)
which gives me:
["", "8383"]
I want to match some more example strings like:
N344 344N S555 555S
String#split is a bad fit for this problem for the reasons others have stated. I would approach it like this, using String#scan instead:
str_parts = "N8383".scan(/[[:alpha:]]+/)
num_parts = "N8383".scan(/[[:digit:]]+/)
This will give you something to work with if the strings contain multiple string parts and/or multiple numeric parts.
This expression:
%w[N344 344N S555 555S].map do |str|
next str.scan(/[[:alpha:]]+/), str.scan(/[[:digit:]]+/)
end
Will return:
[
[["N"], ["344"]],
[["N"], ["344"]],
[["S"], ["555"]],
[["S"], ["555"]]
]
Although you are scanning each string twice, I think it's a better solution than 1. trying to come up with a complex regex that backtracks to return the parts in the right order, or 2. reprocessing the results to put the parts in the right order. Especially if the strings are as short as they are in the examples you've provided. That being said, if scanning each string twice really rankles you, here's another way to do it:
str_parts, num_parts = str.scan(/([[:alpha:]]+)|([[:digit:]]+)/).transpose.each(&:compact!)
Okay given the examples you could use the following regex
/(?=[A-Z])|(?<=[A-Z])/
This will look look ahead (?=) for a single character [A-Z] or look behind (?<=) for a single character [A-Z]. Since these are zero length assertions the split is placed between the characters rather than being the character. e.g.
%w{N8383 N344 344N S555 555S}.map {|s| s.split(/(?=[A-Z])|(?<=[A-Z])/) }
#=> [["N", "8383"], ["N", "344"], ["344", "N"], ["S", "555"], ["555", "S"]]
However this regex is specific to the given cases and does not offer any real deviation from the given cases e.g I have no idea of desired output for "N344S" but right now it will be ["N", "344" ,"S"] and worse yet "NSS344S" will be ["N", "S", "S", "344", "S"]
def doit(str)
str.scan(/\d+|\p{L}+/)
end
doit "N123" #=> ["N", "123"]
doit "123N" #=> ["123", "N"]
doit "N123M" #=> ["N", "123", "M"]
doit "N12M3P" #=> ["N", "12", "M", "3", "P"]
doit "123" #=> ["123"]
doit "NMN" #=> ["NMN"]
doit "" #=> []
Related
I have a string "wwwggfffw" and want to break it up into an array as follows:
["www", "gg", "fff", "w"]
Is there a way to do this with regex?
"wwwggfffw".scan(/((.)\2*)/).map(&:first)
scan is a little funny, as it will return either the match or the subgroups depending on whether there are subgroups; we need to use subgroups to ensure repetition of the same character ((.)\1), but we'd prefer it if it returned the whole match and not just the repeated letter. So we need to make the whole match into a subgroup so it will be captured, and in the end we need to extract just the match (without the other subgroup), which we do with .map(&:first).
EDIT to explain the regexp ((.)\2*) itself:
( start group #1, consisting of
( start group #2, consisting of
. any one character
) and nothing else
\2 followed by the content of the group #2
* repeated any number of times (including zero)
) and nothing else.
So in wwwggfffw, (.) captures w into group #2; then \2* captures any additional number of w. This makes group #1 capture www.
You can use back references, something like
'wwwggfffw'.scan(/((.)\2*)/).map{ |s| s[0] }
will work
Here's one that's not using regex but works well:
def chunk(str)
chars = str.chars
chars.inject([chars.shift]) do |arr, char|
if arr[-1].include?(char)
arr[-1] << char
else
arr << char
end
arr
end
end
In my benchmarks it's faster than the regex answers here (with the example string you gave, at least).
Another non-regex solution, this one using Enumerable#slice_when, which made its debut in Ruby v.2.2:
str.each_char.slice_when { |a,b| a!=b }.map(&:join)
#=> ["www", "gg", "fff", "w"]
Another option is:
str.scan(Regexp.new(str.squeeze.each_char.map { |c| "(#{c}+)" }.join)).first
#=> ["www", "gg", "fff", "w"]
Here the steps are as follows
s = str.squeeze
#=> "wgfw"
a = s.each_char
#=> #<Enumerator: "wgfw":each_char>
This enumerator generates the following elements:
a.to_a
#=> ["w", "g", "f", "w"]
Continuing
b = a.map { |c| "(#{c}+)" }
#=> ["(w+)", "(g+)", "(f+)", "(w+)"]
c = b.join
#=> "(w+)(g+)(f+)(w+)"
r = Regexp.new(c)
#=> /(w+)(g+)(f+)(w+)/
d = str.scan(r)
#=> [["www", "gg", "fff", "w"]]
d.first
#=> ["www", "gg", "fff", "w"]
Here's one more way of doing it without a regex:
'wwwggfffw'.chars.chunk(&:itself).map{ |s| s[1].join }
# => ["www", "gg", "fff", "w"]
/((\w)\2)/ finds repeating letters. I was hoping to avoid the two dimensional array that is produced by ignoring the letter matching second capture group like this: /((?:\w)\2)/. It seems that's not possible. Any ideas why?
Rubular example
You don't need any capture groups:
str = [*'a+'..'z+', *'A+'..'Z+', *'0+'..'9+', '_+'].join('|')
#=> "a+|b+| ... |z+|A+|B+| ... |Z+|0+|1+| ... |9+|_+"
"aaabbcddd".scan(/#{str}/)
#=> ["aaa", "bb", "c", "ddd"]
but if you insist on having one:
"aaabbcddd".scan(/(#{str})/).flatten(1)
#=> ["aaa", "bb", "c", "ddd"]
Is this cheating? You did ask if it was possible.
If you mean you're using String#scan, you can post-process the result to return only the first items Enumerable#map:
'helloo'.scan(/((\w)\2)/)
# => [["ll", "l"], ["oo", "o"]]
'helloo'.scan(/((\w)\2)/).map { |m| m[0] }
# => ["ll", "oo"]
If you wanted to split a space-separated list of words, you would use
def words(text)
return text.split.map{|word| word.downcase}
end
similarly to Python's list comprehension:
words("get out of here")
which returns ["get", "out", "of", "here"]. How can I apply a block to every character in a string?
Use String#chars:
irb> "asdf".chars.map { |ch| ch.upcase }
=> ["A", "S", "D", "F"]
Are you looking for something like this?
class String
def map
size.times.with_object('') {|i,s| s << yield(self[i])}
end
end
"ABC".map {|c| c.downcase} #=> "abc"
"ABC".map(&:downcase) #=> "abc"
"abcdef".map {|c| (c.ord+1).chr} #=> "bcdefg"
"abcdef".map {|c| c*3} #=> "aaabbbcccdddeeefff"
I think the short answer to your question is "no, there's nothing like map for strings that operates a character at a time." Previous answerer had the cleanest solution in my book; simply create one by adding a function definition to the class.
BTW, there's also String#each_char which is an iterator across each character of a string. In this case String#chars gets you the same result because it returns an Array which also responds to each (or map), but I guess there may be cases where the distinction would be important.
I am using Ruby and looking for a way to read in a sample string with the following text:
"This is a test
file, dog cat bark
meow woof woof"
and split elements into an array of characters based on whitespace, but to keep the \n value in the array as a separate element.
I know I can use the string.split(/\n/) to get
["this is a test", "file, dog cat bark", "meow woof woof"]
Also string.split(/ /) yields
["this", "is", "a", "test\nfile,", "dog", "cat", "bark\nmeow", "woof", "woof"]
But I am looking for a way to get:
["this", "is", "a", "test", "\n", "file,", "dog", "cat", "bark", "\n", "meow", "woof", "woof"]
Is there any way to accomplish this using Ruby?
It's a strange thing to do but:
string.split /(?=\n)|(?<=\n)| /
#=> ["This", "is", "a", "test", "\n", "file,", "dog", "cat", "bark", "\n", "meow", "woof", "woof"]
You could turn your logic around a bit and look for what you want instead of looking for the delimiters between what you want. A simple scan like this should do the trick:
>> s.scan(/\S+|\n+/)
=> ["This", "is", "a", "test", "\n", "file,", "dog", "cat", "bark", "\n", "meow", "woof", "woof"]
That assumes that repeated \n should be a single token of course.
This isn't particularly elegant, but you could try replacing "\n" with " \n " (note the spaces surrounding \n), and then split the resulting string on / /.
This is an odd request, and perhaps, if you told us WHY you want to do that, we could help you do it in a more straightforward and conventional fashion.
It looks like you're trying to split the words and still know where your original line-ends were. Having the lines split into individual words is useful for many things, but keeping the line-ends... not so much in my experience.
When I'm dealing with text and need to break the lines up for processing, I do it this way:
text = "This is a test
file, dog cat bark
meow woof woof"
data = text.lines.map(&:split)
At this point, data looks like:
[["This", "is", "a", "test"],
["file,", "dog", "cat", "bark"],
["meow", "woof", "woof"]]
I know that each sub-array was a separate line, so if I need to process by lines I can do it using an iterator like each or map, or to reconstruct the original text I can join(" ") the sub-array elements, then join("\n") the resulting lines:
data.map{ |a| a.join(' ') }.join("\n")
=> "This is a test\nfile, dog cat bark\nmeow woof woof"
I have the string "111221" and want to match all sets of consecutive equal integers: ["111", "22", "1"].
I know that there is a special regex thingy to do that but I can't remember and I'm terrible at Googling.
Using regex in Ruby 1.8.7+:
p s.scan(/((\d)\2*)/).map(&:first)
#=> ["111", "22", "1"]
This works because (\d) captures any digit, and then \2* captures zero-or-more of whatever that group (the second opening parenthesis) matched. The outer (…) is needed to capture the entire match as a result in scan. Finally, scan alone returns:
[["111", "1"], ["22", "2"], ["1", "1"]]
…so we need to run through and keep just the first item in each array. In Ruby 1.8.6+ (which doesn't have Symbol#to_proc for convenience):
p s.scan(/((\d)\2*)/).map{ |x| x.first }
#=> ["111", "22", "1"]
With no Regex, here's a fun one (matching any char) that works in Ruby 1.9.2:
p s.chars.chunk{|c|c}.map{ |n,a| a.join }
#=> ["111", "22", "1"]
Here's another version that should work even in Ruby 1.8.6:
p s.scan(/./).inject([]){|a,c| (a.last && a.last[0]==c[0] ? a.last : a)<<c; a }
# => ["111", "22", "1"]
"111221".gsub(/(.)(\1)*/).to_a
#=> ["111", "22", "1"]
This uses the form of String#gsub that does not have a block and therefore returns an enumerator. It appears gsub was bestowed with that option in v2.0.
I found that this works, it first matches each character in one group, and then it matches any of the same character after it. This results in an array of two element arrays, with the first element of each array being the initial match, and then the second element being any additional repeated characters that match the first character. These arrays are joined back together to get an array of repeated characters:
input = "WWBWWWWBBBWWWWWWWB3333!!!!"
repeated_chars = input.scan(/(.)(\1*)/)
# => [["W", "W"], ["B", ""], ["W", "WWW"], ["B", "BB"], ["W", "WWWWWW"], ["B", ""], ["3", "333"], ["!", "!!!"]]
repeated_chars.map(&:join)
# => ["WW", "B", "WWWW", "BBB", "WWWWWWW", "B", "3333", "!!!!"]
As an alternative I found that I could create a new Regexp object to match one or more occurrences of each unique characters in the input string as follows:
input = "WWBWWWWBBBWWWWWWWB3333!!!!"
regexp = Regexp.new("#{input.chars.uniq.join("+|")}+")
#=> regexp created for this example will look like: /W+|B+|3+|!+/
and then use that Regex object as an argument for scan to split out all the repeated characters, as follows:
input.scan(regexp)
# => ["WW", "B", "WWWW", "BBB", "WWWWWWW", "B", "3333", "!!!!"]
you can try is
string str ="111221";
string pattern =#"(\d)(\1)+";
Hope can help you