How to split a string of repeated characters with uneven amounts? Ruby - ruby

If I have a string such as "aabbbbccdddeffffgg" and I wanted to split the string into this array: ["aa", "bbbb", "cc", "ddd", "e", "ffff", "gg"], how would I go about that?
I know of string.split/.../ < or however many period you put there, but it doesn't account for if the strings are uneven. The point of the problem I'm working on is to take two strings and see if there are three characters in a row of one string and two in a row in the other. I tried
`letter_count_1 = {}
str1.each_char do |let|
letter_count_1[let] = str1.count(let)
end`
But that gives the count for the total amount of each character in the string, and some of the inputs are randomized with the same letter in multiple places, like, "aabbbacccdba"
So how do you split the string up by character?

You can use a regex with a back reference and the scan() method:
str = "aabbbbccdddeffffgg"
groups = []
str.scan(/((.)\2*)/) { |x| groups.push(x[0]) }
groups will look like this afterwards:
["aa", "bbbb", "cc", "ddd", "e", "ffff", "gg"]

Here is a non-regexp version
str = "aabbbbccdddeffffgg"
p str.chars.chunk(&:itself).map{|x|x.last.join} #=> ["aa", "bbbb", "cc", "ddd", "e", "ffff", "gg"]

Related

Ruby split binary string where the previous character is different from the next one

I wonder how could I split a binary string in Ruby.
I want to split the string where the previous character is different from the next one.
for example if i have the string
#s = "aaaabbabbaa"
I would like to create an array of strings
#array[0] = "aaaa"
#array[1] = "bb"
#array[2] = "a"
#array[3] = "bb"
#array[4] = "aa"
How could i do this?
Enumerable#chunk does that, but its defined on Enumerable - and String does not include Enumerable. Transform it into an Array of chars (and glue them back to strings) , like:
s = "aaaabbabbaa"
p array = s.chars.chunk(&:itself).map{|a| a.last.join} #=>["aaaa", "bb", "a", "bb", "aa"]
You could use a regular expression with scan:
#array = #s.scan(/((.)\2*)/).map(&:first)
#=> ["aaaa", "bb", "a", "bb", "aa"]
str = "aaaabbabbaa"
r = /
(?<=(.)) # match any character in capture group 1, in positive lookbehind
(?!\1) # do not match capture group 1, negative lookahead
/x # free-spacing regex definition mode
str.split(r)
#=> ["aaaa", "a", "bb", "b", "a", "a", "bb", "b", "aa", "a"]
By using two lookarounds no characters are lost when splitting on the regular expression.
using Enumerable#chunk_while
str = "aaaabbabbaa"
p str.chars.chunk_while(&:==).map(&:join)
Output : ["aaaa", "bb", "a", "bb", "aa"]

Finding dictionary words within a source text, using Ruby

Using Ruby, I need to output a list of words, found in a dictionary, that can be formed by eliminating letters from a source text.
E.g., if I input the source text "crazed" I want to get not only words like "craze" and "razed", whose letters are in the same order AND whose letters are adjacent to each other within the source text, but ALSO words like "rad" and "red", because those words exist and can be found by eliminating select letters from "crazed" AND the output words retain letter order. BUT, words like "dare" or "race" should not be in the output list, because the letter order of the letters in "dare" or "race" are not the same as those letters found in "crazed". (If "raed" or "crae" were words in the dictionary, they WOULD be part of the output.)
My thought was to go through the source text in a binary manner
(for "crazed", we'd get:
000001 = "d";
000010 = "e";
000011 = "ed";
000100 = "z";
000101 = "zd";
000111 = "zed";
001000 = "a";
001001 = "ad"; etc.)
and compare each result with words in a dictionary, though I don't know how to code that, nor whether that is most efficient. This is where I would greatly benefit from your help.
Also, the length of the source text would be variable; it wouldn't necessarily be six letters long (like "crazed"). Inputs would potentially be much larger (20-30 characters, possibly more).
I've searched here and found questions about anagrams and about words that can be in any letter order, but not specifically what i'm looking for. Is this even possible in Ruby? Thank you.
First let's read the words of a dictionary into an array, after chomping, downcasing and removing duplicates (if, for example, the dictionary contains both "A" and "a", as does the dictionary on my Mac that I've used below).
DICTIONARY = File.readlines("/usr/share/dict/words").map { |w| w.chomp.downcase }.uniq
#=> ["a", "aa", "aal", "aalii",..., "zyzomys", "zyzzogeton"]
DICTIONARY.size
#=> 234371
The following method generates all combinations of one or more characters of a given word, respecting order, and for each, joins the characters to form a string, checks to see if the string is in the dictionary, and if it is, saves the string to an array.
To check if a string matches a word in the dictionary I perform a binary search, using the method Array#bsearch. This makes use of the fact that the dictionary is already sorted in alphabetical order.
def subwords(word)
arr = word.chars
(1..word.size).each.with_object([]) do |n,a|
arr.combination(n).each do |comb|
w = comb.join
a << w if DICTIONARY.bsearch { |dw| w <=> dw }
end
end
end
subwords "crazed"
# => ["c", "r", "a", "z", "e", "d",
# "ca", "ce", "ra", "re", "ae", "ad", "ed",
# "cad", "rad", "red", "zed",
# "raze", "craze", "crazed"]
Yes, that particular dictionary contains all those strings (such as "z") that don't appear to be English words.
Another example.
subwords "importance"
#=> ["i", "m", "p", "o", "r", "t", "a", "n", "c", "e",
# "io", "it", "in", "ie", "mo", "mr", "ma", "me", "po", "pa", "or",
# "on", "oe", "ra", "re", "ta", "te", "an", "ae", "ne", "ce",
# "imp", "ima", "ion", "ira", "ire", "ita", "ian", "ice", "mor", "mot",
# "mon", "moe", "man", "mac", "mae", "pot", "poa", "pon", "poe", "pan",
# "pac", "ort", "ora", "orc", "ore", "one", "ran", "tan", "tae", "ace",
# "iota", "ione", "iran", "mort", "mora", "morn", "more", "mote",
# "moan", "mone", "mane", "mace", "port", "pore", "pote", "pone",
# "pane", "pace", "once", "rane", "race", "tane",
# "impot", "moran", "morne", "porta", "ponce", "rance",
# "import", "impone", "impane", "prance",
# "portance",
# "importance"]
An extensive solution set that comprises words that can be obtained from using letters in any order is below. The catch with using combination to find possible subwords is that the permutations of the combinations are missed. eg: drawing from 'importance', the combination of 'mpa' will arise at some point. since this isn't a dictionary word, it'll be skipped. thereby costing us, the permutation 'map'-- dictionary subword of 'importance'. below is an extensive solution that finds more possible dictionary words. I agree that my method can be optimized for speed.
#steps
#split string at ''
#find combinations for n=2 all the way to n=word.size
#for each combination
#find the permutations of all the arrangements
#then
#join the array
#check to see if word is in dictionary
#and it's not already collected
#if it is, add to collecting array
require 'set'
Dictionary=File.readlines('dictionary.txt').map(&:chomp).to_set
Dictionary.size #39501
def subwords(word)
#split string at ''
arr=word.split('')
#excluding single letter words
#you can change 2 to 1 in line below to select for single letter words too
(2..word.size).each_with_object([]) do |n,a|
#find combinations for n=2 all the way to n=word.size
arr.combination(n).each do |comb|
#for each combination
#find the permutations of all the arrangements
comb.permutation(n).each do |perm|
#join the array
w=perm.join
#check to see if word is in dictionary and it's not already collected
if Dictionary.include?(w) && !a.include?(w)
#if it is, add to collecting array
a<<w
end
end
end
end
end
p subwords('crazed')
#["car", "arc", "rec", "ace", "cad", "are", "era", "ear", "rad", "red", "adz", "zed", "czar", "care", "race", "acre", "card", "dace", "raze", "read", "dare", "dear", "adze", "daze", "craze", "cadre", "cedar", "crazed"]
p subwords('battle')
#["bat", "tab", "alb", "lab", "bet", "tat", "ate", "tea", "eat", "eta", "ale", "lea", "let", "bate", "beat", "beta", "abet", "bale", "able", "belt", "teat", "tale", "teal", "late", "bleat", "table", "latte", "battle", "tablet"]

Ignoring capture group in Regex that is used for repeating the patten

/((\w)\2)/ finds repeating letters. I was hoping to avoid the two dimensional array that is produced by ignoring the letter matching second capture group like this: /((?:\w)\2)/. It seems that's not possible. Any ideas why?
Rubular example
You don't need any capture groups:
str = [*'a+'..'z+', *'A+'..'Z+', *'0+'..'9+', '_+'].join('|')
#=> "a+|b+| ... |z+|A+|B+| ... |Z+|0+|1+| ... |9+|_+"
"aaabbcddd".scan(/#{str}/)
#=> ["aaa", "bb", "c", "ddd"]
but if you insist on having one:
"aaabbcddd".scan(/(#{str})/).flatten(1)
#=> ["aaa", "bb", "c", "ddd"]
Is this cheating? You did ask if it was possible.
If you mean you're using String#scan, you can post-process the result to return only the first items Enumerable#map:
'helloo'.scan(/((\w)\2)/)
# => [["ll", "l"], ["oo", "o"]]
'helloo'.scan(/((\w)\2)/).map { |m| m[0] }
# => ["ll", "oo"]

Format data in string to array?

I need to convert data from a string to an array. The string looks like this:
{a,b,c{1,2,3},d,e,f{11,22,33},g}
The array that I want to receive should look like this:
[a, b, c1, c2, c3, d, e, f11, f22, f33, g]
I tried to use the split method but it works poorly.
arr = str.split(' ');
keys = arr[0][2..-2]
keys = keys.split(',')
Do you have any ideas how it could be implemented?
Here's what I'd use:
string = '{a,b,c{1,2,3},d,e,f{11,22,33},g}'
array = string.scan(/[a-z](?:{.+?})?/).flat_map{ |s|
if s['{']
prefix = s[0]
values = s.scan(/\d+/)
([prefix] * values.size).zip(values).map(&:join)
else
s
end
}
array # => ["a", "b", "c1", "c2", "c3", "d", "e", "f11", "f22", "f33", "g"]
Here's how it works:
string.scan(/[a-z](?:{.+?})?/) # => ["a", "b", "c{1,2,3}", "d", "e", "f{11,22,33}", "g"]
returns the string broken into chunks, looking for a single letter followed by an optional string of { with some text then }.
values = s.scan(/\d+/) # => ["1", "2", "3"], ["11", "22", "33"]
As it's running in flat_map, if { is found, the numbers are scanned out.
([prefix] * values.size).zip(values).map(&:join) # => ["c1", "c2", "c3"], ["f11", "f22", "f33"]
And then an array of the prefix, with the same number of elements as there are values is created and zipped together, resulting in:
[["c", "1"], ["c", "2"], ["c", "3"]], [["f", "11"], ["f", "22"], ["f", "33"]]
The join glues those sub-arrays together. And flat_map flattens any subarrays created so the resulting output is a single array.
You need to arr = str.split(',') in the first step, because there is no whitespace between the values.
Also keep in mind you have {} to handle too.
This worked for me with simple regex and gsubing (though Tin Man's solution is better ruby):
def my_string_to_array(input_string)
groups = input_string.scan(/\w+\{.*?\}/)
groups.each do |group|
modified = group.gsub(',', ",#{group.match(/\w+/)[0]}").delete("{}")
input_string.gsub!(group, modified)
end
created_array = input_string.delete("{}").split(',')
end
string = '{a,b,c{1,2,3},d,e,f{11,22,33},g}'
my_string_to_array(string)
=> ["a", "b", "c1", "c2", "c3", "d", "e", "f11", "f22", "f33", "g"]
The way it works is that it first finds the groups having alphabets followed by braces and digits (like c{1,2,3})
For each such group, it modifies it by gsubing ',' with ',<alphabet>' and removing the braces.
Next, it replaces these groups with the modified ones in the original string.
And finally it removes the starting and ending braces in the original string, and converts it into an array.

Splitting string into pair of characters in Ruby

I have a string (e.g. "AABBCCDDEEFF") and want to split this into an array with each element containing two characters - ["AA", "BB", "CC", "DD", "EE", "FF"].
Try the String object's scan method:
>> foo = "AABBCCDDEEFF"
=> "AABBCCDDEEFF"
>> foo.scan(/../)
=> ["AA", "BB", "CC", "DD", "EE", "FF"]
Depending on your needs, this may work better:
> foo = "AAABBCDEEFF"
=> "AAABBCDEEFF"
> foo.scan(/.{1,2}/)
=> ["AA", "AB", "BC", "DE", "EF", "F"]
Not sure what your input looks like. The above answer will drop any characters that do not have a pair, this one will work on odd length strings.

Resources