Build an array of descending match counts? - ruby

I have a hash where the keys are book titles and the values are an array of words in the book.
I want to write a method where, if I enter a word, I can search through the hash to find which array has the highest frequency of the word. Then I want to return an array of the book titles in order of most matches.
The method should return an array in descending order of highest amount of occurrences of the searched word.
This is what I have so far:
def search(query)
books_names = #book_info.keys
book_info = {}
#result.each do |key,value|
contents = #result[key]
if contents.include?(query)
book_info[:key] += 1
end
end
end

If book_info is your hash and input_str is the string you want to search in book_info, the following will return you a hash in the order of frequency of input_str in the text:
Hash[book_info.sort_by{|k, v| v.count(input_str)}.reverse]
If you want output to be an array of book names, remove Hash and take out the first elements:
book_info.sort_by{|k, v| v.count(input_str)}.reverse.map(&:first)
This is a more compact version(but little bit slow), replacing reverse with negative sort criteria:
book_info.sort_by{|k, v| -v.count(input_str)}.map(&:first)

You may want to consider creating a Book class. Here's a book class that will index the words into a word_count hash for quick sorting.
class Book
attr_accessor :title, :words
attr_reader :word_count
#books = []
class << self
attr_accessor :books
def top(word)
#books.sort_by{|b| b.word_count[word.downcase]}.reverse
end
end
def initialize
self.class.books << self
#word_count = Hash.new { |h,k| h[k] = 0}
end
def words=(str)
str.gsub(/[^\w\s]/,"").downcase.split.each do |word|
word_count[word] += 1
end
end
def to_s
title
end
end
Use it like so:
a = Book.new
a.title = "War and Peace"
a.words = "WELL, PRINCE, Genoa and Lucca are now no more than private estates of the Bonaparte family."
b = Book.new
b.title = "Moby Dick"
b.words = "Call me Ishmael. Some years ago - never mind how long precisely - having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world."
puts Book.top("ago")
result:
Moby Dick
War and Peace

Here is one way to build a hash whose keys are words and whose values are arrays of hashes with keys :title and :count, the hashes ordered by decreasing value of count.
Code
I am assuming we will start with a hash books, whose keys are titles and whose values are all the text in the book in one string.
def word_count_hash(books)
word_and_count_by_title = books.each_with_object({}) { |(title,words),h|
h[title] = words.scan(/\w+/)
.map(&:downcase)
.each_with_object({}) { |w,g| g[w] = (g[w] || 0)+1 } }
title_and_count_by_word = word_and_count_by_title
.each_with_object({}) { |(title,words),g| words.each { |w,count|
g.update({w =>[{title: title, count: count}]}){|_,oarr,narr|oarr+narr}}}
title_and_count_by_word.keys.each { |w| g[w].sort_by! { |h| -h[:count] } }
title_and_count_by_word
end
Example
books = {}
books["Grapes of Wrath"] =
<<_
To the red country and part of the gray country of Oklahoma, the last rains
came gently, and they did not cut the scarred earth. The plows crossed and
recrossed the rivulet marks. The last rains lifted the corn quickly and
scattered weed colonies and grass along the sides of the roads so that the
gray country and the dark red country began to disappear under a green cover.
_
books["Tale of Two Cities"] =
<<_
It was the best of times, it was the worst of times, it was the age of wisdom,
it was the age of foolishness, it was the epoch of belief, it was the epoch of
incredulity, it was the season of Light, it was the season of Darkness, it was
the spring of hope, it was the winter of despair, we had everything before us,
we had nothing before us, we were all going direct to Heaven, we were all
going direct the other way
_
books["Moby Dick"] =
<<_
Call me Ishmael. Some years ago - never mind how long precisely - having little
or no money in my purse, and nothing particular to interest me on shore, I
thought I would sail about a little and see the watery part of the world. It is
a way I have of driving off the spleen and regulating the circulation. Whenever
I find myself growing grim about the mouth; whenever it is a damp, drizzly
November in my soul; whenever I find myself involuntarily pausing before coffin
warehouses
_
Construct the hash:
title_and_count_by_word = word_count_hash(books)
and then look up words:
title_and_count_by_word["the"]
#=> [{:title=>"Grapes of Wrath", :count=>12},
# {:title=>"Tale of Two Cities", :count=>11},
# {:title=>"Moby Dick", :count=>5}]
title_and_count_by_word["to"]
#=> [{:title=>"Grapes of Wrath", :count=>2},
# {:title=>"Tale of Two Cities", :count=>1},
# {:title=>"Moby Dick", :count=>1}]
Note the words being looked up must be entered in (or converted to) lower case.
Explanation
Construct the first hash:
word_and_count_by_title = books.each_with_object({}) { |(title,words),h|
h[title] = words.scan(/\w+/)
.map(&:downcase)
.each_with_object({}) { |w,g| g[w] = (g[w] || 0)+1 } }
#=> {"Grapes of Wrath"=>
# {"to"=>2, "the"=>12, "red"=>2, "country"=>4, "and"=>6, "part"=>1,
# ...
# "disappear"=>1, "under"=>1, "a"=>1, "green"=>1, "cover"=>1},
# "Tale of Two Cities"=>
# {"it"=>10, "was"=>10, "the"=>11, "best"=>1, "of"=>10, "times"=>2,
# ...
# "going"=>2, "direct"=>2, "to"=>1, "heaven"=>1, "other"=>1, "way"=>1},
# "Moby Dick"=>
# {"call"=>1, "me"=>2, "ishmael"=>1, "some"=>1, "years"=>1, "ago"=>1,
# ...
# "pausing"=>1, "before"=>1, "coffin"=>1, "warehouses"=>1}}
To see what's happening here, consider the first element of books that Enumerable#each_with_object passes into the block. The two block variables are assigned the following values:
title
#=> "Grapes of Wrath"
words
#=> "To the red country and part of the gray country of Oklahoma, the
# last rains came gently,\nand they did not cut the scarred earth.
# ...
# the dark red country began to disappear\nunder a green cover.\n"
each_with_object has created a hash represented by the block variable h, which is initially empty.
First construct an array of words and convert each to lower-case.
q = words.scan(/\w+/).map(&:downcase)
#=> ["to", "the", "red", "country", "and", "part", "of", "the", "gray",
# ...
# "began", "to", "disappear", "under", "a", "green", "cover"]
We may now create a hash that contains a count of each word for the title "Grapes of Wrath":
h[title] = q.each_with_object({}) { |w,g| g[w] = (g[w] || 0) + 1 }
#=> {"to"=>2, "the"=>12, "red"=>2, "country"=>4, "and"=>6, "part"=>1,
# ...
# "disappear"=>1, "under"=>1, "a"=>1, "green"=>1, "cover"=>1}
Note the expression
g[w] = (g[w] || 0) + 1
If the hash g already has a key for the word w, this expression is equivalent to
g[w] = g[w] + 1
On the other hand, if g does not have this key (word) (in which case g[w] => nil), then the expression is eqivalent to
g[w] = 0 + 1
The same calculations are then performed for each of the other two books.
We can now construct the second hash.
title_and_count_by_word =
word_and_count_by_title.each_with_object({}) { |(title,words),g|
words.each { |w,count| g.update({ w => [{title: title, count: count}]}) \
{ |_, oarr, narr| oarr + narr } } }
#=> {"to" => [{:title=>"Grapes of Wrath", :count=>2},
# {:title=>"Tale of Two Cities", :count=>1},
# {:title=>"Moby Dick", :count=>1}],
#=> "the" => [{:title=>"Grapes of Wrath", :count=>12},
# {:title=>"Tale of Two Cities", :count=>11},
# {:title=>"Moby Dick", :count=>5}],
# ...
# "warehouses"=> [{:title=>"Moby Dick", :count=>1}]}
(Note that this operation does not order the hashes for each word by :count, even though that may appear to be the case in this output fragment. The hashes are sorted in the next and final step.)
The main operation here that requires explanation is Hash#update (aka Hash#merge!). We are building a hash denoted by the block variable g, which initially is empty. The keys of this hash are words, the values are hashes with keys :title and :count. Whenever the hash being merged has a key (word) that is already a key of g, the block
{ |_, oarr, narr| oarr + narr }
is called to determine the value for the key in the merged hash. The block variables here are the key (word) (which we have replaced with an underscore because it will not be used), the old array of hashes and the new array of hashes to be merged (of which there is just one). We simply add the new hash to merged array of hashes.
Lastly we sort the values of the hash (which are arrays of hashes) on decreasing value of :count.
title_and_count_by_word.keys.each { |w| g[w].sort_by! { |h| -h[:count] } }
title_and_count_by_word
#=> {"to"=>
# [{:title=>"Grapes of Wrath", :count=>2},
# {:title=>"Tale of Two Cities", :count=>1},
# {:title=>"Moby Dick", :count=>1}],
# "the"=>
# [{:title=>"Grapes of Wrath", :count=>12},
# {:title=>"Tale of Two Cities", :count=>11},
# {:title=>"Moby Dick", :count=>5}],
# ...
# "warehouses"=>[{:title=>"Moby Dick", :count=>1}]}

Related

Unscrambling a string given the number of splits and words that the sentence can be comprised of

Im working on a problem in which I'm given a string that has been scrambled. The scrambling works like this.
An original string is chopped into substrings at random positions and a random number of times.
Each substring is then moved around randomly to form a new string.
I'm also given a dictionary of words that are possible words in the string.
Finally, i'm given the number of splits in the string that were made.
The example I was given is this:
dictionary = ["world", "hello"]
scrambled_string = rldhello wo
splits = 1
The expected output of my program would be the original string, in this case:
"hello world"
Suppose the initial string
"hello my name is Sean"
with
splits = 2
yields
["hel", "lo my name ", "is Sean"]
and those three pieces are shuffled to form the following array:
["lo my name ", "hel", "is Sean"]
and then the elements of this array are joined to form:
scrambled = "lo my name helis Sean"
Also suppose:
dictionary = ["hello", "Sean", "the", "name", "of", "my", "cat", "is", "Sugar"]
First convert dictionary to a set to speed lookups.
require 'set'
dict_set = dictionary.to_set
#=> #<Set: {"hello", "Sean", "the", "name", "of", "my", "cat", "is", "Sugar"}>
Next I will create a helper method.
def indices_to_ranges(indices, last_index)
[-1, *indices, last_index].each_cons(2).map { |i,j| i+1..j }
end
Suppose we split scrambled twice (because splits #=> 2), specifically after the 'y' and the 'h':
indices = [scrambled.index('y'), scrambled.index('h')]
#=> [4, 11]
The first element of indices will always be -1 and the last value will always be scrambled.size-1.
We may then use indices_to_ranges to convert these indices to ranges of indices of characters in scrambed:
ranges = indices_to_ranges(indices, scrambled.size-1)
#=> [0..4, 5..11, 12..20]
a = ranges.map { |r| scrambled[r] }
#=> ["lo my", " name h", "elis Sean"]
We could of course combine these two steps:
a = indices_to_ranges(indices, scrambled.size-1).map { |r| scrambled[r] }
#=> ["lo my", " name h", "elis Sean"]
Next I will permute the values of a. For each permutation I will join the elements to form a string, then split the string on single spaces to form an array of words. If all of those words are in the dictionary we may claim success and are finished. Otherwise, a different array indices will be constructed and we try again, continuing until success is realized or all possible arrays indices have been considered. We can put all this in the following method.
def unscramble(scrambled, dict_set, splits)
last_index = scrambled.size-1
(0..scrambled.size-2).to_a.combination(splits).each do |indices|
indices_to_ranges(indices, last_index).
map { |r| scrambled[r] }.
permutation.each do |arr|
next if arr[0][0] == ' ' || arr[-1][-1] == ' '
words = arr.join.split(' ')
return words if words.all? { |word| dict_set.include?(word) }
end
end
end
Let's try it.
original string: "hello my name is Sean"
scrambled = "lo my name helis Sean"
splits = 4
unscramble(scrambled, dict_set, splits)
#=> ["my", "name", "hello", "is", "Sean"]
See Array#combination and Array#permutation.
bonkers answer (not quite perfect yet ... trouble with single chars):
#
# spaces appear to be important!
#check = {}
#ordered = []
def previous_words (word)
#check.select{|y,z| z[:previous] == word}.map do |nw,z|
#ordered << nw
previous_words(nw)
end
end
def in_word(dictionary, string)
# check each word in the dictionary to see if the string is container in one of them
dictionary.each do |word|
if word.include?(string)
return word
end
end
return nil
end
letters=scrambled.split("")
previous=nil
substr=""
letters.each do |l|
if in_word(dictionary, substr+l)
substr+= l
elsif (l==" ")
word=in_word(dictionary, substr)
#check[word]={found: 1}
#check[word][:previous] = previous if previous
substr=""
previous=word
else
word=in_word(dictionary, substr)
#check[word]={found: 1}
#check[word][:previous] = previous if previous
substr=l
previous=nil
end
end
word=in_word(dictionary, substr)
#check[word]={found: 1}
#check[word][:previous] = previous if previous
#check.select{|y,z| z[:previous].nil?}.map do |w,z|
#ordered << w
previous_words(w)
end
pp #ordered
output:
dictionary = ["world", "hello"]
scrambled = "rldhello wo"
... my code here ...
2.5.8 :817 > #ordered
=> ["hello", "world"]
dictionary = ["hello", "my", "name", "is", "Sean"]
scrambled = "me is Shelleano my na"
... my code here ...
2.5.8 :879 > #ordered
=> ["Sean", "hello", "my", "name", "is"]

Array of strings Group by first common letters [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 7 years ago.
Improve this question
Is there anyway of grouping first common letters in an array of strings?
For example:
array = [ 'hello', 'hello you', 'people', 'finally', 'finland' ]
so when i do
array.group_by{ |string| some_logic_with_string }
The result should be,
{
'hello' => ['hello', 'hello you'],
'people' => ['people'],
'fin' => ['finally', 'finland']
}
NOTE: Some test cases are ambiguous and expectations conflict with other tests, you need to fix them.
I guess plain group_by may not work, a further processing is needed.
I have come up with below code that seems to work for all the given test cases in consistent manner.
I have left notes in the code to explain the logic. Only way to fully understand it will be to inspect value of h and see the flow for a simple test case.
def group_by_common_chars(array)
# We will iteratively group by as many time as there are characters
# in a largest possible key, which is max length of all strings
max_len = array.max_by {|i| i.size}.size
# First group by first character.
h = array.group_by{|i| i[0]}
# Now iterate remaining (max_len - 1) times
(1...max_len).each do |c|
# Let's perform a group by next set of starting characters.
t = h.map do |k,v|
h1 = v.group_by {|i| i[0..c]}
end.reduce(&:merge)
# We need to merge the previously generated hash
# with the hash generated in this iteration. Here things get tricky.
# If previously, we had
# {"a" => ["a"], "ab" => ["ab", "abc"]},
# and now, we have
# {"a"=>["a"], "ab"=>["ab"], "abc"=>["abc"]},
# We need to merge the two hashes such that we have
# {"a"=>["a"], "ab"=>["ab", "abc"], "abc"=>["abc"]}.
# Note that `Hash#merge`'s block is called only for common keys, so, "abc"
# will get merged, we can't do much about it now. We will process
# it later in the loop
h = h.merge(t) do |k, o, n|
if (o.size != n.size)
diff = [o,n].max - [o,n].min
if diff.size == 1 && t.value?(diff)
[o,n].max
else
[o,n].min
end
else
o
end
end
end
# Sort by key length, smallest in the beginning.
h = h.sort {|i,j| i.first.size <=> j.first.size }.to_h
# Get rid of those key-value pairs, where value is single element array
# and that single element is already part of another key-value pair, and
# that value array has more than one element. This step will allow us
# to get rid of key-value like "abc"=>["abc"] in the example discussed
# above.
h = h.tap do |h|
keys = h.keys
keys.each do |k|
v = h[k]
if (v.size == 1 &&
h.key?(v.first) &&
h.values.flatten.count(v.first) > 1) then
h.delete(k)
end
end
end
# Get rid of those keys whose value array consist of only elements that
# already part of some other key. Since, hash is ordered by key's string
# size, this process allows us to get rid of those keys which are smaller
# in length but consists of only elements that are present somewhere else
# with a key of larger length. For example, it lets us to get rid of
# "a"=>["aba", "abb", "aaa", "aab"] from a hash like
# {"a"=>["aba", "abb", "aaa", "aab"], "ab"=>["aba", "abb"], "aa"=>["aaa", "aab"]}
h.tap do |h|
keys = h.keys
keys.each do |k|
values = h[k]
other_values = h.values_at(*(h.keys-[k])).flatten
already_present = values.all? do |v|
other_values.include?(v)
end
h.delete(k) if already_present
end
end
end
Sample Run:
p group_by_common_chars ['hello', 'hello you', 'people', 'finally', 'finland']
#=> {"fin"=>["finally", "finland"], "hello"=>["hello", "hello you"], "people"=>["people"]}
p group_by_common_chars ['a', 'ab', 'abc']
#=> {"a"=>["a"], "ab"=>["ab", "abc"]}
p group_by_common_chars ['aba', 'abb', 'aaa', 'aab']
#=> {"ab"=>["aba", "abb"], "aa"=>["aaa", "aab"]}
p group_by_common_chars ["Why", "haven't", "you", "answered", "the", "above", "questions?", "Please", "do", "so."]
#=> {"a"=>["answered", "above"], "do"=>["do"], "Why"=>["Why"], "you"=>["you"], "so."=>["so."], "the"=>["the"], "Please"=>["Please"], "haven't"=>["haven't"], "questions?"=>["questions?"]}
Not sure, if you can sort by all common letters. But if you want to do sort only by first letter then here it is:
array = [ 'hello', 'hello you', 'people', 'finally', 'finland' ]
result = {}
array.each { |st| result[st[0]] = result.fetch(st[0], []) + [st] }
pp result
{"h"=>["hello", "hello you"], "p"=>["people"], "f"=>["finally", "finland"]}
Now result contains your desired hash.
Hmm, you're trying to do something that's pretty custom. I can think of two classical approaches that sort of do what you want: 1) Stemming and 2) Levenshtein Distance.
With stemming you're finding the root word to a longer word. Here's a gem for it.
Levenshtein is a famous algorithm which calculates the difference between two strings. There is a gem for it that runs pretty fast due to a native C extension.

Comparing values of one hash to many hashes to get inverse document frequency in ruby

I'm trying to find the inverse document frequency for a categorization algorithm and am having trouble getting it the way that my code is structured (with nested hashes), and generally comparing one hash to many hashes.
My training code looks like this so far:
def train!
#data = {}
#all_books.each do |category, books|
#data[category] = {
words: 0,
books: 0,
freq: Hash.new(0)
}
books.each do |filename, tokens|
#data[category][:words] += tokens.count
#data[category][:books] += 1
tokens.each do |token|
#data[category][:freq][token] += 1
end
end
#data[category][:freq].map { |k, v| v = (v / #data[category][:freq].values.max) }
end
end
Basically, I have a hash with 4 categories (subject to change), and for each have word count, book count, and a frequency hash which shows term frequency for the category. How do I get the frequency of individual words from one category compared against the frequency of the words shown in all categories? I know how to do the comparison for one set of hash keys against another, but am not sure how to loop through a nested hash to get the frequency of terms against all other terms, if that makes sense.
Edit to include predicted outcome -
I'd like to return a hash of nested hashes (one for each category) that shows the word as the key, and the number of other categories in which it appears as the value. i.e. {:category1 = {:word => 3, :other => 2, :third => 1}, :category2 => {:another => 1, ...}} Alternately an array of category names as the value, instead of the number of categories, would also work.
I've tried creating a new hash as follows, but it's turning up empty:
def train!
#data = {}
#all_words = Hash.new([]) #new hash for all words, default value is empty array
#all_books.each do |category, books|
#data[category] = {
words: 0,
books: 0,
freq: Hash.new(0)
}
books.each do |filename, tokens|
#data[category][:words] += tokens.count
#data[category][:books] += 1
tokens.each do |token|
#data[category][:freq][token] += 1
#all_words[token] << category #should insert category name if the word appears, right?
end
end
#data[category][:freq].map { |k, v| v = (v / #data[category][:freq].values.max) }
end
end
If someone can help me figure out why the #all_words hash is empty when the code is run, I may be able to get the rest.
I haven't gone through it all, but you certainly have an error:
#all_words[token] << category #should insert category name if the word appears, right?
Nope. #all_words[token] will return empty array, but not create a new slot with an empty array, like you're assuming. So that statement doesn't modify the #all_words hash at all.
Try these 2 changes and see if it helps:
#all_words = {} # ditch the default value
...
(#all_words[token] ||= []) << category # lazy-init the array, and append

Unique frequency of occurence

For a project for class, we are supposed to take a published paper and create an algorithm to create a list of all words in the unit of text while excluding the stop words. I am trying to produce a list of all unique words (in the entire text) along with their frequency of occurrence. This is the algorithm that I created for one line of the text:
x = l[125] #Selecting specific line in the text
p = Array.new() # Assign new array to variable p
p = x.split # Split the array
for i in (0...p.length)
if(p[i] != "the" and p[i] != "to" and p[i] != "union" and p[i] != "political")
print p[i] + " "
end
end
puts
The output of this program is one sentence (from line 125) excluding the stop words. Should I use bubble sort? How would I modify it to sort strings of equal length (or is that irrelevant)?
I'd say you have a good start, considering you are new to Ruby. You asked if you should use a bubble sort. I guess you're thinking of grouping multiple occurrences of a word, then go through the array to count them. That would work, but there are a couple of other approaches that are easier and more 'Ruby-like'. (By that I mean they make use of powerful features of the language and at the same time are more natural.)
Let's focus on counting the unique words in a single line. Once you can do that, you should be able to easily generalize that for multiple lines.
First Method: Use a Hash
The first approach is to use a hash. h = {} creates a new empty one. The hash's keys will be words and its values will be the number of times each word is present in the line. For example, if the word "cat" appears 9 times, we will have h["cat"] = 9, just what you need. To construct this hash, we see if each word w in the line is already in hash. It is in the hash if
h[w] != nil
If it is, we increment the word count:
h[w] = h[w] + 1
or just
h[w] += 1
If it's not in the hash, we add the word to the hash like this:
h[w] = 1
That means we can do this:
if h[w]
h[w] += 1
else
h[w] = 1
end
Note that here if h[w] is the same as if h[w] != nil.
Actually, we can use a trick to make this even easier. If we create the hash like this:
h = Hash.new(0)
then any key we add without a value will be assigned a default value of zero. That way we don't have to check if the word is already in the hash; we simply write
h[w] += 1
If w is not in the hash, h[w] will add it and initialize it to 0, then += 1 will increment it to 1. Cool, eh?
Let's put all this together. Suppose
line = "the quick brown fox jumped over the lazy brown fox"
We convert this string to an array with the String#split method:
arr = line.split # => ["the", "quick", "brown", "fox", "jumped", \
"over", "the", "lazy", "brown", "fox"]
then
h = Hash.new(0)
arr.each {|w| h[w] += 1}
h # => {"the"=>2, "quick"=>1, "brown"=>2, "fox"=>2, "jumped"=>1, "over"=>1, "lazy"=>1}
We're done!
Second Method: use the Enumerable#group_by method
Whenever you want to group elements of an array, hash or other collection, the group_by method should come to mind.
To apply group_by to the quick, brown fox array, we provide a block that contains the grouping criterion, which in this case is simply the words themselves. This produces a hash:
g = arr.group_by {|e| e}
# => {"the"=>["the", "the"], "quick"=>["quick"], "brown"=>["brown", "brown"], \
# "fox"=>["fox", "fox"], "jumped"=>["jumped"], "over"=>["over"], "lazy"=>["lazy"]}
The next thing to do is convert the hash values to the number of occurrences of the word (e.g., convert ["the", "the"] to 2). To do this, we can create a new empty hash h, and add hash pairs to it:
h = {}
g.each {|k,v| h[k] = v.size}
h # => {"the"=>2, "quick"=>1, "brown"=>2, "fox"=>2, "jumped"=>1, "over"=>1, "lazy"=>1
One More Thing
You have this code snippet:
if(p[i] != "the" and p[i] != "to" and p[i] != "union" and p[i] != "political")
print p[i] + " "
end
Here are a couple of ways you could make this a little cleaner, both using the hash h above.
First Way
skip_words = %w[the to union political] # => ["the", "to", "union", "political"]
h.each {|k,v| (print v + ' ') unless skip_words.include?(k)}
Second Way
h.each |k,v|
case k
when "the", "to", "union", "political"
next
else
puts "The word '#{k}' appears #{v} times."
end
end
Edit to address your comment. Try this:
p = "The quick brown fox jumped over the quick grey fox".split
freqs = Hash.new(0)
p.each {|w| freqs[w] += 1}
sorted_freqs = freqs.sort_by {|k,v| -v}
sorted_freqs.each {|word, freq| puts word+' '+freq.to_s}
=>
quick 2
fox 2
jumped 1
The 1
brown 1
over 1
the 1
grey 1
Normally, ypu would not sort a hash; rather you'd first convert it to an array:
sorted_freqs = freqs.to_a.sort_by {|x,y| v}.reverse
or
sorted_freqs = freqs.to_a.sort_by {|x,y| -v}
Now sorted_freqs is an array, rather than a hash. The last line stays the same. In general, it's best not to rely on a hash's order. In fact, before Ruby version 1.9.2, hashes were not ordered. If order is important, use an array or convert a hash to array.
Having said that, you can sort smallest-to-largest on the hash values, or (as I have done), sort largest-to-smallest on the negative of the hash values. Note that there is no Enumerable#reverse or Hash#reverse. Alternatively (always many ways to skin a cat with Ruby), you could sort on v and then use Enumerable#reverse_each:
sorted_freqs.reverse_each {|word, freq| puts word+' '+freq.to_s}
Lastly, you could eliminate the temporary variable sorted_freqs (needed because there is no Enumerable#sort_by! method), by chaining the last two statements:
freqs.sort_by {|k,v| -v}.each {|word, freq| puts word+' '+freq.to_s}
You should really look into Ruby's enumerable class. you very seldom do for x in y in ruby.
word_list = ["the", "to", "union", "political"]
l[125].split.each do |word|
print word + " " unless word_list.include?(word)
end
In order to count, sort and all that stuff look into the group_by method and perhaps the sort_by method of arrays.

How to make sure certain elements not get into arrays in Ruby

I have an array lets say
array1 = ["abc", "a", "wxyz", "ab",......]
How do I make sure neither for example "a" (any 1 character), "ab" (any 2 characters), "abc" (any 3 characters), nor words like "that", "this", "what" etc nor any of the foul words are saved in array1?
This removes elements with less than 4 characters and words like this, that, what from array1 (if I got it right):
array1.reject! do |el|
el.length < 4 || ['this', 'that', 'what'].include?(el)
end
This changes array1. If you use reject (without !), it'll return the result and not change array1
You can open and add a new interface to the Array class which will disallow certain words. Example:
class Array
def add(ele)
unless rejects.include?(ele)
self.push ele
end
end
def rejects
['this', 'that', 'what']
end
end
arr = []
arr.add "one"
puts arr
arr.add "this"
puts arr
arr.add "aslam"
puts arr
Output would be:
one one one aslam
And notice the word "this" was not added.
You could create a stop list. Using a hash for this would be more efficient than an array as lookup time will be consistant with a hash. With an array the lookup time is proportional to the number of elements in the array. If you are going to check for stop words a lot, I suggest using a hash that contains all the stop words. Using your code, you could do the following
badwords_a = ["abc", "a", "wxyz", "ab"] # Your array of bad words
badwords_h = {} # Initialize and empty hash
badwords_a.each{|word| badwords_h[word] = nil} # Fill the hash
goodwords = []
words_to_process = ["abc","a","Foo","Bar"] # a list of words you want to process
words_to_process.each do |word| # Process new words
if badwords_h.key?(word)
else
goodwords << word # Add the word if it did not match the bad list
end
end
puts goodwords.join(", ")

Resources