Most common words in string - ruby

I am new to Ruby and trying to write a method that will return an array of the most common word(s) in a string. If there is one word with a high count, that word should be returned. If there are two words tied for the high count, both should be returned in an array.
The problem is that when I pass through the 2nd string, the code only counts "words" twice instead of three times. When the 3rd string is passed through, it returns "it" with a count of 2, which makes no sense, as "it" should have a count of 1.
def most_common(string)
counts = {}
words = string.downcase.tr(",.?!",'').split(' ')
words.uniq.each do |word|
counts[word] = 0
end
words.each do |word|
counts[word] = string.scan(word).count
end
max_quantity = counts.values.max
max_words = counts.select { |k, v| v == max_quantity }.keys
puts max_words
end
most_common('a short list of words with some words') #['words']
most_common('Words in a short, short words, lists of words!') #['words']
most_common('a short list of words with some short words in it') #['words', 'short']

Your method of counting instances of the word is your problem. it is in with, so it's double counted.
[1] pry(main)> 'with some words in it'.scan('it')
=> ["it", "it"]
It can be done easier though, you can group an array's contents by the number of instances of the values using an each_with_object call, like so:
counts = words.each_with_object(Hash.new(0)) { |e, h| h[e] += 1 }
This goes through each entry in the array and adds 1 to the value for each word's entry in the hash.
So the following should work for you:
def most_common(string)
words = string.downcase.tr(",.?!",'').split(' ')
counts = words.each_with_object(Hash.new(0)) { |e, h| h[e] += 1 }
max_quantity = counts.values.max
counts.select { |k, v| v == max_quantity }.keys
end
p most_common('a short list of words with some words') #['words']
p most_common('Words in a short, short words, lists of words!') #['words']
p most_common('a short list of words with some short words in it') #['words', 'short']

As Nick has answered your question, I will just suggest another way this can be done. As "high count" is vague, I suggest you return a hash with downcased words and their respective counts. Since Ruby 1.9, hashes retain the order that key-value pairs have been entered, so we may want to make use of that and return the hash with key-value pairs ordered in decreasing order of values.
Code
def words_by_count(str)
str.gsub(/./) do |c|
case c
when /\w/ then c.downcase
when /\s/ then c
else ''
end
end.split
.group_by {|w| w}
.map {|k,v| [k,v.size]}
.sort_by(&:last)
.reverse
.to_h
end
words_by_count('Words in a short, short words, lists of words!')
The method Array#h was introduced in Ruby 2.1. For earlier Ruby versions, one must use:
Hash[str.gsub(/./)... .reverse]
Example
words_by_count('a short list of words with some words')
#=> {"words"=>2, "of"=>1, "some"=>1, "with"=>1,
# "list"=>1, "short"=>1, "a"=>1}
words_by_count('Words in a short, short words, lists of words!')
#=> {"words"=>3, "short"=>2, "lists"=>1, "a"=>1, "in"=>1, "of"=>1}
words_by_count('a short list of words with some short words in it')
#=> {"words"=>2, "short"=>2, "it"=>1, "with"=>1,
# "some"=>1, "of"=>1, "list"=>1, "in"=>1, "a"=>1}
Explanation
Here is what's happening in the second example, where:
str = 'Words in a short, short words, lists of words!'
str.gsub(/./) do |c|... matches each character in the string and sends it to the block to decide what do with it. As you see, word characters are downcased, whitespace is left alone and everything else is converted to a blank space.
s = str.gsub(/./) do |c|
case c
when /\w/ then c.downcase
when /\s/ then c
else ''
end
end
#=> "words in a short short words lists of words"
This is followed by
a = s.split
#=> ["words", "in", "a", "short", "short", "words", "lists", "of", "words"]
h = a.group_by {|w| w}
#=> {"words"=>["words", "words", "words"], "in"=>["in"], "a"=>["a"],
# "short"=>["short", "short"], "lists"=>["lists"], "of"=>["of"]}
b = h.map {|k,v| [k,v.size]}
#=> [["words", 3], ["in", 1], ["a", 1], ["short", 2], ["lists", 1], ["of", 1]]
c = b.sort_by(&:last)
#=> [["of", 1], ["in", 1], ["a", 1], ["lists", 1], ["short", 2], ["words", 3]]
d = c.reverse
#=> [["words", 3], ["short", 2], ["lists", 1], ["a", 1], ["in", 1], ["of", 1]]
d.to_h # or Hash[d]
#=> {"words"=>3, "short"=>2, "lists"=>1, "a"=>1, "in"=>1, "of"=>1}
Note that c = b.sort_by(&:last), d = c.reverse can be replaced by:
d = b.sort_by { |_,k| -k }
#=> [["words", 3], ["short", 2], ["a", 1], ["in", 1], ["lists", 1], ["of", 1]]
but sort followed by reverse is generally faster.

def count_words string
word_list = Hash.new(0)
words = string.downcase.delete(',.?!').split
words.map { |word| word_list[word] += 1 }
word_list
end
def most_common_words string
hash = count_words string
max_value = hash.values.max
hash.select { |k, v| v == max_value }.keys
end
most_common 'a short list of words with some words'
#=> ["words"]
most_common 'Words in a short, short words, lists of words!'
#=> ["words"]
most_common 'a short list of words with some short words in it'
#=> ["short", "words"]

Assuming string is a string containing multiple words.
words = string.split(/[.!?,\s]/)
words.sort_by{|x|words.count(x)}
Here we split the words in an string and add them to an array. We then sort the array based on the number of words. The most common words will appear at the end.

The same thing can be done in the following way too:
def most_common(string)
counts = Hash.new 0
string.downcase.tr(",.?!",'').split(' ').each{|word| counts[word] += 1}
# For "Words in a short, short words, lists of words!"
# counts ---> {"words"=>3, "in"=>1, "a"=>1, "short"=>2, "lists"=>1, "of"=>1}
max_value = counts.values.max
#max_value ---> 3
return counts.select{|key , value| value == counts.values.max}
#returns ---> {"words"=>3}
end
This is just a shorter solution, which you might want to use. Hope it helps :)

This is the kind of question programmers love, isn't it :) How about a functional approach?
# returns array of words after removing certain English punctuations
def english_words(str)
str.downcase.delete(',.?!').split
end
# returns hash mapping element to count
def element_counts(ary)
ary.group_by { |e| e }.inject({}) { |a, e| a.merge(e[0] => e[1].size) }
end
def most_common(ary)
ary.empty? ? nil :
element_counts(ary)
.group_by { |k, v| v }
.sort
.last[1]
.map(&:first)
end
most_common(english_words('a short list of words with some short words in it'))
#=> ["short", "words"]

def firstRepeatedWord(string)
h_data = Hash.new(0)
string.split(" ").each{|x| h_data[x] +=1}
h_data.key(h_data.values.max)
end

def common(string)
counts=Hash.new(0)
words=string.downcase.delete('.,!?').split(" ")
words.each {|k| counts[k]+=1}
p counts.sort.reverse[0]
end

Related

Counting number of string occurrences in a string/array

I'm expecting to return all words with the max occurrences in a given string. The following code is expected to do so:
t1 = "This is a really really really cool experiment cool really "
frequency = Hash.new(0)
words = t1.split
words.each { |word| frequency[word.downcase] += 1 }
frequency = frequency.map.max_by { |k, v| v }
puts "The words with the most frequencies is '#{frequency[0]}' with
a frequency of #{frequency[1]}."
The output is:
The words with the most frequencies is 'really' with
a frequency of 4.
However, it does not work if there are, for example two strings that equal to the max. For example, if I add three cools to the text, it would still return the same output even though the count of cool is also equal to four.
It would be nice if you could tell me if those method would work on an array too instead of a string.
Try this.
t1 = "This is a really really really cool cool cool"
Step 1: Break your string into an array of words
words = t1.split
#=> ["This", "is", "a", "really", "really", "really", "cool", "cool", "cool"]
Step 2: Compute your frequency hash
frequency = Hash.new(0)
words.each { |word| frequency[word.downcase] += 1 }
frequency
##=> {"this"=>1, "is"=>1, "a"=>1, "really"=>3, "cool"=>3}
Step 3: Determine the maximum frequency
arr = frequency.max_by { |k, v| v }
#=> ["really", 3]
max_frequency = arr.last
#=> 3
Step 4: Create an array containing words having a frequency of max_frequency
arr = frequency.select { |k, v| v == max_frequency }
#=> {"really"=>3, "cool"=>3}
arr.map { |k, v| k }
#=> ["really", "cool"]
Conventional way of writing this in Ruby
words = t1.split
#=> ["This", "is", "a", "really", "really", "really", "cool", "cool", "cool"]
frequency = words.each_with_object(Hash.new(0)) do |word, f|
f[word.downcase] += 1
end
#=> {"this"=>1, "is"=>1, "a"=>1, "really"=>3, "cool"=>3}
max_frequency = frequency.max_by(&:last).last
#=> 3
frequency.select { |k, v| v == max_frequency }.map(&:first)
#=> ["really", "cool"]
Notes
e = [1,2,3].map #=> #<Enumerator: [1, 2, 3]:map>. This tells us that frequency.map.max_by { |k,v| v } is the same as frequency.max_by { |k,v| v }.
In frequency = frequency.map.max_by {|k, v| v }, frequency on the right is a hash; frequency on the left is an array. It's generally consider bad practice to reuse variables in that way.
Often frequency.max_by { |k,v| v } is written frequency.max_by { |_,v| v } or frequency.max_by { |_k,v| v }, mainly to signal to the reader that the first block variable is not used in the block calculation. (As I indicated above, this statement would generally be written frequency.max_by(&:last).) Note _ is a valid local variable.
frequency.max_by { |k, v| v }.last could instead be written frequency.map { |k, v| v }.max but that has the disadvantage that map produces an intermediate array of frequence.size elements, whereas the former produces an intermediate array of two elements.
You've already found the most frequent
greatest_frequency = frequency.max_by {|_, v| v }
Let's use it to found all the words which have this frequency
most_frequent_words = frequency.select { |_, v| v == greatest_frequency }.keys
puts "The words with the most frequencies are #{most_frequent_words.join(', ')} with a frequency of #{greatest_frequency}."
string = 'This is is a really a really a really cool cool experiment a cool cool really'
1). Separate string into array of words
words = string.split.map(&:downcase)
2). Calculate maximum frequency based on unique words
max_frequency = words.uniq.map { |i| words.count(i) }.max
3). Find combinations of word and frequency
combos = words.group_by { |e| e }.map { |k, v| [k, v.size] }.to_h
4). Select most frequent words
most_frequent_words = combos.select { |_, v| v == max_frequency }.keys
Result
puts "The words with the most frequencies are '#{most_frequent_words.join(', ')}' with a frequency of #{max_frequency}."
#=> The words with the most frequencies are 'a, really, cool' with a frequency of 4.

How do I find the first string of differing case in an array?

I have an array of strings, and all contain at least one letter:
["abc", "FFF", "EEE"]
How do I find the index of the first string that is of a different case than any previous string in the array? The function should give 1 for the above since:
FFF".eql?("FFF".upcase)
and that condition isn't true for any previous string in the array, whereas:
["P", "P2F", "ccc", "DDD"]
should yield 2 since "ccc" is not capitalized and all its predecessors are.
I know how to find the first string that is capitalized using
string_tokens.find_index { |w| w == w.upcase }
but I can't figure out how to adjust the above to account for differing case.
You could take each consecutive pair of elements and compare their upcase-ness. When they differ, you return the index.
def detect_case_change(ary)
ary.each_cons(2).with_index(1) do |(a, b), idx|
return idx if (a == a.upcase) != (b == b.upcase)
end
nil
end
detect_case_change ["abc", "FFF", "EEE"] # => 1
detect_case_change ["P", "P2F", "ccc", "DDD"] # => 2
This makes some assumptions about your data being composed entirely of 'A'..'Z' and 'a'..'z':
def find_case_mismatch(list)
index = list.each_cons(2).to_a.index do |(a,b)|
a[0].ord & 32 != b[0].ord & 32
end
index && index + 1
end
This compares the character values. 'A' differs from 'a' by one bit, and that bit is always in the same place (0x20).
Enumerable#chunk helps a lot for this task.
Enumerates over the items, chunking them together based on the return
value of the block. Consecutive elements which return the same block
value are chunked together.
l1 = ["abc", "FFF", "EEE"]
l2 = ["P", "P2F", "ccc", "DDD"]
p l1.chunk{|s| s == s.upcase }.to_a
# [[false, ["abc"]], [true, ["FFF", "EEE"]]]
p l2.chunk{|s| s == s.upcase }.to_a
# [[true, ["P", "P2F"]], [false, ["ccc"]], [true, ["DDD"]]]
The fact that you need an index makes it a bit less readable, but here's an example. The desired index (if it exists) is the size of the first chunk:
p l1.chunk{|s| s == s.upcase }.first.last.size
# 1
p l2.chunk{|s| s == s.upcase }.first.last.size
# 2
If the case doesn't change at all, it returns the length of the whole array:
p %w(aaa bbb ccc ddd).chunk{|s| s == s.upcase }.first.last.size
# 4
I assume that each element (string) of the array contains at least one letter and only letters of the same case.
def first_case_change_index(arr)
s = arr.map { |s| s[/[[:alpha:]]/] }.join
(s[0] == s[0].upcase ? s.swapcase : s) =~ /[[:upper:]]/
end
first_case_change_index ["abc", "FFF", "EEE"] #=> 1
first_case_change_index ["P", "P2F", "ccc"] #=> 2
first_case_change_index ["P", "P2F", "DDD"] #=> nil
The steps are as follows.
arr = ["P", "2PF", "ccc"]
a = arr.map { |s| s[/[[:alpha:]]/] }
#=> ["P", "P", "c"]
s = a.join
#=> "PPc"
s[0] == s[0].upcase
#=> "P" == "P"
#=> true
t = s.swapcase
#=> "ppC"
t =~ /[[:upper:]]/
#=> 2
Here is another way.
def first_case_change_index(arr)
look_for_upcase = (arr[0] == arr[0].downcase)
arr.each_index.drop(1).find do |i|
(arr[i] == arr[i].upcase) == look_for_upcase
end
end

Search a hash for string with most vowels

So let's say I have a hash full of strings as the values. How would I make a method that will search the hash and return the string with the most vowels in it?
I suggest you use Enumerable#max_by and String#count.
def most_vowel_laden(h)
h.values.max_by { |str| str.count('aeiouAEIOU') }
end
keys = [1, 2, 3, 4, 5, 6]
h = keys.zip(%w| It was the best of times |).to_h
#=> {1=>"it", 2=>"was", 3=>"the", 4=>"best", 5=>"of", 6=>"times"}
most_vowel_laden h
#=> "times"
h = keys.zip(%w| by my dry fly why supercalifragilisticexpialidocious |).to_h
#=> {1=>"by", 2=>"my", 3=>"dry", 4=>"fly", 5=>"why",
# 6=>"supercalifragilisticexpialidocious"}
most_vowel_laden h
#=> "supercalifragilisticexpialidocious"
Alternatively,
def most_vowel_laden(h)
h.max_by { |_,str| str.count('aeiouAEIOU') }.last
end
result = nil
max = 0
# hash is your hash with strings
hash.values.each do |value|
vowels = value.scan(/[aeiouy]/).size
if vowels > max
max = vowels
result = value
end
end
puts result

How to write a method that finds the most common letter in a string?

This is the question prompt:
Write a method that takes in a string. Your method should return the most common letter in the array, and a count of how many times it appears.
I'm not entirely sure where to go with what I have so far.
def most_common_letter(string)
arr1 = string.chars
arr2 = arr1.max_by(&:count)
end
I suggest you use a counting hash:
str = "The quick brown dog jumped over the lazy fox."
str.downcase.gsub(/[^a-z]/,'').
each_char.
with_object(Hash.new(0)) { |c,h| h[c] += 1 }.
max_by(&:last)
#=> ["e",4]
Hash::new with an argument of zero creates an empty hash whose default value is zero.
The steps:
s = str.downcase.gsub(/[^a-z]/,'')
#=> "thequickbrowndogjumpedoverthelazyfox"
enum0 = s.each_char
#=> #<Enumerator: "thequickbrowndogjumpedoverthelazyfox":each_char>
enum1 = enum0.with_object(Hash.new(0))
#=> #<Enumerator: #<Enumerator:
# "thequickbrowndogjumpedoverthelazyfox":each_char>:with_object({})>
You can think of enum1 as a "compound" enumerator. (Study the return value above.)
Let's see the elements of enum1:
enum1.to_a
#=> [["t", {}], ["h", {}], ["e", {}], ["q", {}],..., ["x", {}]]
The first element of enum1 (["t", {}]) is passed to the block by String#each_char and assigned to the block variables:
c,h = enum1.next
#=> ["t", {}]
c #=> "t"
h #=> {}
The block calculation is then performed:
h[c] += 1
#=> h[c] = h[c] + 1
#=> h["t"] = h["t"] + 1
#=> h["t"] = 0 + 1 #=> 1
h #=> {"t"=>1}
Ruby expands h[c] += 1 to h[c] = h[c] + 1, which is h["t"] = h["t"] + 1 As h #=> {}, h has no key "t", so h["t"] on the right side of the equal sign is replaced by the hash's default value, 0. The next time c #=> "t", h["t"] = h["t"] + 1 will reduce to h["t"] = 1 + 1 #=> 2 (i.e., the default value will not be used, as h now has a key "t").
The next value of enum1 is then passed into the block and the block calculation is performed:
c,h = enum1.next
#=> ["h", {"t"=>1}]
h[c] += 1
#=> 1
h #=> {"t"=>1, "h"=>1}
The remaining elements of enum1 are processed similarly.
A simple way to do that, without worrying about checking empty letters:
letter, count = ('a'..'z')
.map {|letter| [letter, string.count(letter)] }
.max_by(&:last)
Here is another way of doing what you want:
str = 'aaaabbbbcd'
h = str.each_char.with_object(Hash.new(0)) { |c,h| h[c] += 1 }
max = h.values.max
output_hash = Hash[h.select { |k, v| v == max}]
puts "most_frequent_value: #{max}"
puts "most frequent character(s): #{output_hash.keys}"
def most_common_letter(string)
string.downcase.split('').group_by(&:itself).map { |k, v| [k, v.size] }.max_by(&:last)
end
Edit:
Using hash:
def most_common_letter(string)
chars = {}
most_common = nil
most_common_count = 0
string.downcase.gsub(/[^a-z]/, '').each_char do |c|
count = (chars[c] = (chars[c] || 0) + 1)
if count > most_common_count
most_common = c
most_common_count = count
end
end
[most_common, most_common_count]
end
I'd like to mention a solution with Enumerable#tally, introduced by Ruby 2.7.0:
str =<<-END
Tallies the collection, i.e., counts the occurrences of each element. Returns a hash with the elements of the collection as keys and the corresponding counts as values.
END
str.scan(/[a-z]/).tally.max_by(&:last)
#=> ["e", 22]
Where:
str.scan(/[a-z]/).tally
#=> {"a"=>8, "l"=>9, "i"=>6, "e"=>22, "s"=>12, "t"=>13, "h"=>9, "c"=>11, "o"=>11, "n"=>11, "u"=>5, "r"=>5, "f"=>2, "m"=>2, "w"=>1, "k"=>1, "y"=>1, "d"=>2, "p"=>1, "g"=>1, "v"=>1}
char, count = string.split('').
group_by(&:downcase).
map { |k, v| [k, v.size] }.
max_by { |_, v| v }

How can I sort by word frequency and then sort alphabetically within each frequency in Ruby?

wordfrequency = Hash.new(0)
splitfed.each { |word| wordfrequency[word] += 1 }
wordfrequency = wordfrequency.sort_by {|x,y| y }
wordfrequency.reverse!
puts wordfrequency
I have added the words into a hash table and have gotten it to sort by word frequency, but then order within each frequency is random when I want it to be in alphabetical order. Any quick fixes? Thanks! Much appreciated.
You can use:
wordfrequency = wordfrequency.sort_by{|x,y| [y, x] }
to sort by the value then the key.
In your case,
splitfed = ["bye", "hi", "hi", "a", "a", "there", "alphabet"]
wordfrequency = Hash.new(0)
splitfed.each { |word| wordfrequency[word] += 1 }
wordfrequency = wordfrequency.sort_by{|x,y| [y, x] }
wordfrequency.reverse!
puts wordfrequency.inspect
will output:
[["hi", 2], ["a", 2], ["there", 1], ["bye", 1], ["alphabet", 1]]
which is reverse ordered by the occurrence of the word then the word itself.
Make sure you note (which might be pretty obvious) that wordfrequency is now an array.
Hashes do not necessarily sort in natural order; it is down to the individual data structure. If you want to pretty print a hash, you need to sort the keys, then iterate over that sorted list of keys, outputting the value for each key as you go.
There are tricks you can do to do this on a single line, or collect the entries from the hash into a sorted array of arrays, but ultimately they all come back to sorting the keys then retrieving the data for the sorted key list.
Some hashes maintain insertion order, some hashes maintain a sorted structure which you can then traverse as you process the hash, but these are exceptions to the rule.
Ruby's group_by is the basis for this:
words = %w[foo bar bar baz]
words.group_by{ |w| w }
# => {"foo"=>["foo"], "bar"=>["bar", "bar"], "baz"=>["baz"]}
words.group_by{ |w| w }.map{ |k, v| [k, v.size ] }
# => [["foo", 1], ["bar", 2], ["baz", 1]]
If you want to sort by the words then by their frequency:
words.group_by{ |w| w }.map{ |k, v| [k, v.size ] }.sort_by{ |k, v| [k, v] }
# => [["bar", 2], ["baz", 1], ["foo", 1]]
If you want to sort by the frequency then by the words:
words.group_by{ |w| w }.map{ |k, v| [k, v.size ] }.sort_by{ |k, v| [v, k] }
# => [["baz", 1], ["foo", 1], ["bar", 2]]

Resources