Ruby Anagram Using String#sum - ruby

I've solved a problem that asks you to write a method for determining what words in a supplied array are anagrams and group the anagrams into a sub array within your output.
I've solved it using what seems to be the typical way that you would which is by sorting the words and grouping them into a hash based on their sorted characters.
When I originally started looking for a way to do this I noticed that String#sum exists which adds the ordinals of each character together.
I'd like to try and work out some way to determine an anagram based on using sum. For example "cars" and "scar" are anagrams and their sum is 425.
given an input of %w[cars scar for four creams scream racs] the expected output (which I already get using the hash solution) is: [[cars, scar, racs],[for],[four],[creams,scream]].
It seems like doing something like:
input.each_with_object(Hash.new []) do |word, hash|
hash[word.sum] += [word]
end
is the way to go, that gives you a hash where the values of the key "425" is ['cars','racs','scar']. What I think i'm missing is moving that into the expected format of the output.

Unfortunately I don't think String#sum is a robust way to solve this problem.
Consider:
"zaa".sum # => 316
"yab".sum # => 316
Same sum, but not anagrams.
Instead, how about grouping them by the sorted order of their characters?
words = %w[cars scar for four creams scream racs]
anagrams = words.group_by { |word| word.chars.sort }.values
# => [["cars", "scar", "racs"], ["for"], ["four"], ["creams", "scream"]]

words = %w[cars scar for four creams scream racs]
res={}
words.each do |word|
key=word.split('').sort.join
res[key] ||= []
res[key] << word
end
p res.values
[["cars", "scar", "racs"], ["for"], ["four"],["creams", "scream"]]

Actually, I think you could use sums for anagram testing, but not summing the chars' ordinals themselves, but something like this instead:
words = %w[cars scar for four creams scream racs]
# get the length of the longest word:
maxlen = words.map(&:length).max
# => 6
words.group_by{|word|
word.bytes.map{|b|
maxlen ** (b-'a'.ord)
}.inject(:+)
}
# => {118486616113189=>["cars", "scar", "racs"], 17005023616608=>["for"], 3673163463679584=>["four"], 118488792896821=>["creams", "scream"]}
Not sure if this is 100% correct, but I think the logic stands.
The idea is to map every word to a N-based number, every digit position representing a different char. N is the length of the longest word in input set.

To get the desired output format, you just need hash.values. But note that just using the sum of the character codes in a word could fail on some inputs. It is possible for the sums of the character codes in two words to be the same by chance, when they are not anagrams.
If you used a different algorithm to combine the character codes, the chances of incorrectly identifying words as "anagrams" could be made much lower, but still not zero. Basically you need some kind of hash algorithm, but with the property that the order of the values being hashed doesn't matter. Perhaps map each character to a different random bitstring, and take the sum of the bitstrings for each character in the string?
That way, the chances of any two non-anagrams giving you a false positive would be approximately 2 ** bitstring_length.

Related

Find the last occurence of a string being a certain length

I know there is a method to find the largest string in an array
def longest_word(string_of_words)
x = string_of_words.split(" ").max_by(&:length)
end
However, if there are multiple words with the longest length, how do i return the last instance of the word with the longest length? Is there a method and do I use indexing?
Benjamin
What if we took advantage of reverse?
"asd qweewe lol qwerty df qwsazx".split.reverse_each.max_by(&:length)
=> "qwsazx"
Simply reverse your words array before applying max_by.
The first longest word from the reversed array will be the last one in your sentence.
can do this way also:
> "asd qweewe lol qwerty df qwsazx".split.sort_by(&:length).last
#=> "qwsazx"
Note: You can split words and sort by length in ascending(default) order and take the last word
You can use inject which will replace the maximum only if (via <=) it's matched or improved upon. By default inject takes the first element of its receiver.
str.split.inject { |m,s| m.size <= s.size ? s : m }
max_by.with_index{|e, i| [e, i]}
There's no need to convert the string to an array.
def longest_word(str)
str.gsub(/[[:alpha:]]+/).
each_with_object('') {|s,longest| longest.replace(s) if s.size >= longest.size}
end
longest_word "Many dogs love to swim in the sea"
#=> "swim"
Two points.
I've used String#gsub to create an enumerator that will feed words to Enumerable.#each_with_object. The string argument is not modified. This is an usual use of gsub that I've been able to use to advantage in several situations.
Within the block it's necessary to use longest.replace(s) rather than longest = s. That's because each_with_object returns the originally given object (usually modified by the block), but does not update that object on each iteration. longest = s merely returns s (is equivalent to just s) but does not alter the value of the block variable. By contrast, longest.replace(s) modifies the original object.
With regard to the second of these two points, it is interesting to contrast the use of each_with_object with Enumerable#reduce (aka inject).
str.gsub(/[[:alpha:]]+/).
reduce('') {|longest,s| s.size >= longest.size ? s : longest }
#=> "swim"

Parse a string without spaces into an array of individual words

If I have a string "blueberrymuffinsareinsanelydelicious", what is the most efficient way to parse it such that I am left with ["blueberry", "muffins", "are", "insanely", "delicious"]?
I already have my wordlist (mac's /usr/share/dict/words), but how do I ensure that the full word is stored in my array, aka: blueberry, instead of two separate words, blue and berry.
Although there's cases where there's multiple interpretations possible and picking the best one can be trouble, you can always approach it with a fairly naïve algorithm like this:
WORDS = %w[
blueberry
blue
berry
fin
fins
muffin
muffins
are
insane
insanely
in
delicious
deli
us
].sort_by do |word|
[ -word.length, word ]
end
WORD_REGEXP = Regexp.union(*WORDS)
def best_fit(string)
string.scan(WORD_REGEXP)
end
This will parse your example:
best_fit("blueberrymuffinsareinsanelydelicious")
# => ["blueberry", "muffins", "are", "insanely", "delicious"]
Note that this skips any non-matching components.
Here's a recursive method which finds the correct sentence in 0.4s on my slowish laptop.
It first imports almost 100K english words and sorts them by decreasing size
For every word, it checks if text starts with it
If it does, it removes the word from the text, keeps the word in an array and recursively calls itself.
If the text is empty, it means a sentence has been found.
It uses a lazy array to stop at the first found sentence.
text = "blueberrymuffinsareinsanelydeliciousbecausethey'rereallymoistandcolorful"
dictionary = File.readlines('/usr/share/dict/american-english')
.map(&:chomp)
.sort_by{ |w| -w.size }
def find_words(text, possible_words, sentence = [])
return sentence if text.empty?
possible_words.lazy.select{ |word|
text.start_with?(word)
}.map{ |word|
find_words(text[word.size..-1], possible_words, sentence + [word])
}.find(&:itself)
end
p find_words(text, dictionary)
#=> ["blueberry", "muffins", "are", "insanely", "delicious", "because", "they're", "really", "moist", "and", "colorful"]
p find_words('someword', %w(no way to find a combination))
#=> nil
p find_words('culdesac', %w(culd no way to find a combination cul de sac))
#=> ["cul", "de", "sac"]
p find_words("carrotate", dictionary)
#=> ["carrot", "ate"]
For a faster lookup, it could be a good idea to use a Trie.

find Anagrams words using Regxp in ruby

An anagram group is a group of words such that any one can be converted into any other just by rearranging the letters. For example, "rats", "tars" and "star" are an anagram group.
Now I have an array of words and I am going to find the anagram words
to find this I have written the following code
actually it works for some words like scar and cars, but it doesn't work
for [scar , carts].
temp=[]
words.each do |e|
temp=e.split(//) # make an array of letters
words.each do |z|
if z.match(/#{temp}/) # match to find scar and cars
puts "exp is True"
else
puts "exp is false"
end
end
end
I just think that while [abc] means a or b or c I can separate my words to letters and then look for other cases in the array
Your algorithm is incorrect and inefficient (quadratic time complexity). Why regex?
Here's another idea. Define the signature of a word such that all the letters of a word are sorted. For example, the signature of hello is ehllo.
By this definition, anagrams are words that have the same signature, for example, rats, tars and star all have the signature arst. The code to implement this idea is straight-forward.
Two words are anagrams if they contain the same letters. There are several ways to figure out whether they do, the most obvious one is sorting the letters alphabetically. Then you want to separate the words into groups. Here's an idea:
words = %w[cats scat rats tars star scar cars carts]
words.group_by {|word| word.each_char.sort }.values
# => [['cats', 'scat'], ['rats', 'tars', 'star'], ['scar', 'cars'], ['carts']]
The problem is that /#{e.split(//)}/ here is pretty much nonsensical.
To illustrate this, lets see what happens:
word = 'wtf'
letters = word.split(//) # => ["w", "t", "f"]
regex = /#{letters}/ # => /["w", "t", "f"]/
'"'.match(regex) # => 0
','.match(regex) # => 0
' '.match(regex) # => 0
't'.match(regex) # => 0
What happens is interpolating something in a regex replaces it with the result of its to_s method. And since character sets match a single character in what's inside, you will get a regex that matches " or , or or any of the letters in the original word.
Therefore, I will unfortunately call your solution unsalvageable.
A very easy way to check if two words are anagrams is to sort their characters and see if the result is the same.
The faster way would be:
def is_anagram? w1, w2
w1.chars.sort == w2.chars.sort
end
You could also do something like this I suppose:
def is_anagram? w1, w2
w2 = w2.chars
w1.chars.permutation.to_a.include?(w2)
end
then run it like this:
is_anagram? "rats", "star"
=> true
Note:
This post has been edited as per Cary Swoveland's advice.
words = ['demo', 'none', 'tied', 'evil', 'dome', 'mode', 'live',
'fowl', 'veil', 'wolf', 'diet', 'vile', 'edit', 'tide',
'flow', 'neon']
groups = words.group_by { |word| word.split('').sort }
groups.each { |x, y| p y }

Searching for single words and combination words in Ruby

I want my output to search and count the frequency of the words "candy" and "gram", but also the combinations of "candy gram" and "gram candy," in a given text (whole_file.)
I am currently using the following code to display the occurrences of "candy" and "gram," but when I aggregate the combinations within the %w, only the word and frequencies of "candy" and "gram" display. Should I try a different way? thanks so much.
myArray = whole_file.split
stop_words= %w{ candy gram 'candy gram' 'gram candy' }
nonstop_words = myArray - stop_words
key_words = myArray - nonstop_words
frequency = Hash.new (0)
key_words.each { |word| frequency[word] +=1 }
key_words = frequency.sort_by {|x,y| x }
key_words.each { |word, frequency| puts word + ' ' + frequency.to_s }
It sounds like you're after n-grams. You could break the text into combinations of consecutive words in the first place, and then count the occurrences in the resulting array of word groupings. Here's an example:
whole_file = "The big fat man loves a candy gram but each gram of candy isn't necessarily gram candy"
[["candy"], ["gram"], ["candy", "gram"], ["gram", "candy"]].each do |term|
terms = whole_file.split(/\s+/).each_cons(term.length).to_a
puts "#{term.join(" ")} #{terms.count(term)}"
end
EDIT: As was pointed out in the comments below, I wasn't paying close enough attention and was splitting the file on each loop which is obviously not a good idea, especially if it's large. I also hadn't accounted for the fact that the original question may've need to sort by the count, although that wasn't explicitly asked.
whole_file = "The big fat man loves a candy gram but each gram of candy isn't necessarily gram candy"
# This is simplistic. You would need to address punctuation and other characters before
# or at this step.
split_file = whole_file.split(/\s+/)
terms_to_count = [["candy"], ["gram"], ["candy", "gram"], ["gram", "candy"]]
counts = []
terms_to_count.each do |term|
terms = split_file.each_cons(term.length).to_a
counts << [term.join(" "), terms.count(term)]
end
# Seemed like you may need to do sorting too, so here that is:
sorted = counts.sort { |a, b| b[1] <=> a[1] }
sorted.each do |count|
puts "#{count[0]} #{count[1]}"
end
Strip punctuation and convert to lower-case
The first thing you probably want to do is remove all punctuation from the string holding the contents of the file and then convert what's left to lower case, the latter so you don't have worry about counting 'Cat' and 'cat' as the same word. Those two operations can be done in either order.
Changing upper-case letters to lower-case is easy:
text = whole_file.downcase
To remove the punctuation it is probably easier to decide what to keep rather than what to discard. If we only want to keep lower-case letters, you can do this:
text = whole_file.downcase.gsub(/[^a-z]/, '')
That is, substitute an empty string for all characters other than (^) lowercase letters.1
Determine frequency of individual words
If you want to count the number of times text contains the word 'candy', you can use the method String#scan on the string text and then determine the size of the array that is returned:
text.scan(/\bcandy\b/).size
scan returns an array with every occurrence of the string 'candy'; .size returns the size of that array. Here \b ensures 'candy gram' has a word "boundary" at each end, which could be whitespace or the beginning or end of a line or the file. That's to prevent `candycane' from being counted.
A second way is to convert the string text to an array of words, as you have done2:
myArray = text.split
If you don't mind, I'd like to call this:
words = text.split
as I find that more expressive.3
The most direct way to determine the number of times 'candy' appears is to use the method Enumberable#count, like this:
words.count('candy')
You can also use the array difference method, Array#-, as you noted:
words.size - (words - ['candy']).size
If you wish to know the number of times either 'candy' or 'gram' appears, you could of course do the above for each and sum the two counts. Some other ways are:
words.size - (myArray - ['candy', 'gram']).size
words.count { |word| word == 'candy' || word = 'gram' }
words.count { |word| ['candy', 'gram'].include?(word) }
Determine the frequency of all words that appear in the text
Your use of a hash with a default value of zero was a good choice:
def frequency_of_all_words(words)
frequency = Hash.new(0)
words.each { |word| frequency[word] +=1 }
frequency
end
I wrote this as a method to emphasize that words.each... does not return frequency. Often you would see this written more compactly using the method Enumerable#each_with_object, which returns the hash ("object"):
def frequency_of_all_words(words)
words.each_with_object(Hash.new(0)) { |word, h| h[word] +=1 }
end
Once you have the hash frequency you can sort it as you did:
frequency.sort_by {|word, freq| freq }
or
frequency.sort_by(&:last)
which you could write:
frequency.sort_by {|_, freq| freq }
since you aren't using the first block variable. If you wanted the most frequent words first:
frequency.sort_by(&:last).reverse
or
frequency.sort_by {|_, freq| -freq }
All of these will give you an array. If you want to convert it back to a hash (with the largest values first, say):
Hash[frequency.sort_by(&:last).reverse]
or in Ruby 2.0+,
frequency.sort_by(&:last).reverse.to_h
Count the number of times a substring appears
Now let's count the number of times the string 'candy gram' appears. You might think we could use String#scan on the string holding the entire file, as we did earlier4:
text.scan(/\bcandy gram\b/).size
The first problem is that this won't catch 'candy\ngram'; i.e., when the words are separated by a newline character. We could fix that by changing the regex to /\bcandy\sgram\b/. A second problem is that 'candy gram' might have been 'candy. Gram' in the file, in which case you might not want to count it.
A better way is to use the method Enumerable#each_cons on the array words. The easiest way to show you how that works is by example:
words = %w{ check for candy gram here candy gram again }
#=> ["check", "for", "candy", "gram", "here", "candy", "gram", "again"]
enum = words.each_cons(2)
#=> #<Enumerator: ["check", "for", "candy", "gram", "here", "candy",
# "gram", "again"]:each_cons(2)>
enum.to_a
#=> [["check", "for"], ["for", "candy"], ["candy", "gram"],
# ["gram", "here"], ["here", "candy"], ["candy", "gram"],
# ["gram", "again"]]
each_cons(2) returns an enumerator; I've converted it to an array to display its contents.
So we can write
words.each_cons(2).map { |word_pair| word_pair.join(' ') }
#=> ["check for", "for candy", "candy gram", "gram here",
# "here candy", "candy gram", "gram again"]
and lastly:
words.each_cons(2).map { |word_pair|
word_pair.join(' ') }.count { |s| s == 'candy gram' }
#=> 2
1 If you also wanted to keep dashes, for hyphenated words, change the regex to /[^-a-z]/ or /[^a-z-]/.
2 Note from String#split that .split is the same as both .split(' ') and .split(/\s+/)).
3 Also, Ruby's naming convention is to use lower-case letters and underscores ("snake-case") for variables and methods, such as my_array.

Given a list of strings, how can I determine what the shortest length of differentiation is?

Say I have an array of hash strings, e.g.
['a04a872ff4027233', '8cef496d2a92808c', etc.]
I would like an elegant way to determine what is the shortest uniform-length substring I can use to differentiate between the alternatives.
E.g. if the shortest length substring is 3, then the options could be abbreviated to ['a04', '8ce', etc], and I could then just expand the abbreviation later.
I need a solution in Ruby.
(1...s.first.size).find {|i| !s.map {|j| j[0...i]}.uniq!}
Not very "elegant" per se, but this would work:
strings = ['a04a872ff4027233', '8cef496d2a92808c', .....]
count = 1
count += 1 while strings.map{ |item| item[0...count] }.uniq.length != strings.length
count
# => 3
strings.map{ |item| item[0...count] }
# => ['a04', '8ce', ...]

Resources