Printing the n most frequent words in a file (string) - ruby

The goal:
Write a function that takes two parameters: (1) a String representing a text document and (2) an integer providing the number of items to return. Implement the function such that it returns a list of Strings ordered by word frequency, the most frequently occurring word first. Use your best judgement to decide how words are separated. Your solution should run in O(n) time where n is the number of characters in the document.
My thoughts were that, in the worst case, the input to the function could be the total number of words in the document, reducing the problem to sorting the words by their frequencies. This made me think that the lower bound for time complexity would be O (n log n) if I used a comparison sorting method. So, my thought was that the best approach was to implement a counting sort. Here is my code.
I would like for you to tell me whether my analysis is correct, I've annotated the code with my idea of what the time complexity is, but it could definitely be incorrect. What is the actual time and space complexity of this code? Also I would like to hear if this is in fact a good approach, if there are any alternate approaches that would be used in practice.
### n is number of characters in string, k is number of words ###
def word_frequencies(string, n)
words = string.split(/\s/) # O(n)
max = 0
min = Float::INFINITY
frequencies = words.inject(Hash.new(0)) do |hash,word| # O(k)
occurrences = hash[word] += 1 # O(1)
max = occurrences if occurrences > max # O(1)
min = occurrences if occurrences < min # O(1)
hash; # O(1)
end
### perform a counting sort ###
sorted = Array.new(max + words.length)
delta = 0
frequencies.each do |word, frequency| #O(k)
p word + "--" + frequency.to_s
index = frequency
if sorted[index]
sorted[index] = sorted[index].push(word) # ??? I think O(1).
else
sorted[index] = [word] # O(1)
end
end
return sorted.compact.flatten[-n..-1].reverse
### Compact is O(k). Flatten is O(k). Reverse is O(k). So O(3k)
end
### Total --- O(n + 5k) = O(n). Correct?
### And the space complexity is O(n) for the hash + O(2k) for the sorted array.
### So total O(n).
text = "hi hello hi my name is what what hi hello hi this is a test test test test hi hi hi what hello these are some words these these"
p word_frequencies(text, 4)

Two ways:
def word_counter(string, max)
string.split(/\s+/)
.group_by{|x|x}
.map{|x,y|[x,y.size]}
.sort_by{|_,size| size} # Have to sort =/
.last(max)
end
def word_counter(string, max)
# Create a Hash and a List to store values in.
word_counter, max_storage = Hash.new(0), []
#Split the string an and add each word to the hash:
string.split(/\s+/).each{|word| word_counter[word] += 1}
# Take each word and add it to the list (so that the list_index = word_count)
# I also add the count, but that is not really needed
word_counter.each{|key, val| max_storage[val] = [*max_storage[val]] << [key, val]}
# Higher count will always be at the end, remove nils and get the last "max" elements.
max_storage.compact.flatten(1).last(max)
end

One idea is following:
You are already constructing a hash map that gives the frequency of a given word.
Now iterate through this hash map and create a reverse "hash set". That is the set of words for a given frequency.
Find the maximum frequency and output the set of words for that frequency.
Decrement it, and check for words in the hash set.
Keep doing this till the required number of words.
The order of this algorithm shall be O(f) where f is the maximum frequency of any word. The maximum frequency of any word shall be at most n where n is the number of characters as required.

Sample, quick way :)
#assuming you read from the file and get it to a string called str
h = {}
arr = str.split("\n")
arr.each do |i|
i.split(" ").each do |w|
if h.has_key[w]
h[w] += 1
else
h[w] = 1
end
end
end
Hash[h.sort_by{|k, v| v}.reverse]
This works, but could be improved.

Related

Feasibility of a bit modified version of Rabin Karp algorithm

I am trying to implement a bit modified version of Rabin Karp algorithm. My idea is if I get a hash value of the given pattern in terms of weight associated with each letter, then I don't have to worry about anagrams so I can just pick up a part of the string, calculate its hash value and compare with hash value of the pattern unlike traditional approach where hashvalue of both part of string and pattern is calculated and then checked whether they are actually similar or it could be an anagram. Here is my code below
string = "AABAACAADAABAABA"
pattern = "AABA"
#string = "gjdoopssdlksddsoopdfkjdfoops"
#pattern = "oops"
#get hash value of the pattern
def gethashp(pattern):
sum = 0
#I mutiply each letter of the pattern with a weight
#So for eg CAT will be C*1 + A*2 + T*3 and the resulting
#value wil be unique for the letter CAT and won't match if the
#letters are rearranged
for i in range(len(pattern)):
sum = sum + ord(pattern[i]) * (i + 1)
return sum % 101 #some prime number 101
def gethashst(string):
sum = 0
for i in range(len(string)):
sum = sum + ord(string[i]) * (i + 1)
return sum % 101
hashp = gethashp(pattern)
i = 0
def checkMatch(string,pattern,hashp):
global i
#check if we actually get first four strings(comes handy when you
#are nearing the end of the string)
if len(string[:len(pattern)]) == len(pattern):
#assign the substring to string2
string2 = string[:len(pattern)]
#get the hash value of the substring
hashst = gethashst(string2)
#if both the hashvalue matches
if hashst == hashp:
#print the index of the first character of the match
print("Pattern found at {}".format(i))
#delete the first character of the string
string = string[1:]
#increment the index
i += 1 #keep a count of the index
checkMatch(string,pattern,hashp)
else:
#if no match or end of string,return
return
checkMatch(string,pattern,hashp)
The code is working just fine. My question is this a valid way of doing it? Can there be any instance where the logic might fail? All the Rabin Karp algorithms that I have come across doesn't use this logic instead for every match, it furthers checks character by character to ensure it's not an anagram. So is it wrong if I do it this way? My opinion is with this code as soon as the hash value matches, you never have to further check both the strings character by character and you can just move on to the next.
It's not necessary that only anagrams collide with the hash value of the pattern. Any other string with same hash value could also collide. Same hash value can act as a liar, so character by character match is required.
For example in your case, you are taking mod 100. Take any distinct 101 patterns, then by the Pigeonhole principle, at least two of them would be having the same hash. If you use one of them as a pattern then the presence of other string would err your output if you avoid character match.
Moreover, even with the hash you used, two anagrams can have the same hash value which can be obtained by solving two linear equations.
For example,
DCE = 4*1 + 3*2 + 5*3 = 25
CED = 3*1 + 5*2 + 4*3 = 25

Permutations of strings takes too long to solve

I'm creating an array of permutated and unique letters in a string, only to sort them alphabetically and find the middle element in the set.
def middle_permutation(string)
length = string.length
permutation_set = string.split("").permutation(length).to_a.map{|item| item.join}.sort
permutation_set.length.even? ? permutation_set[(permutation_set.length)/2-1] : permutation_set[(permutation_set.length/2)+1]
end
For example:
middle_permutation("zxcvbnmasd") should equal "mzxvsndcba"
Even for small strings (N >=10), the calculations take pretty long to finish, and I can forget about anything double that; is there a quicker way?
I'm assuming the letters are unique, as in the OP's question.
Sort
Pluck the middle letter of the sorted string (rounded down). This is the first letter of the middle permutation.
If the original list had an even number of letters, the rest of the permutation is the reverse sort of the remaining letters.
If not, take the middle letter again. Now the rest of the result is the reverse sort of the remaining letters.
The method below returns the desired permutation directly, without iterating through permutations.
The asker has stated that the string contains no duplicated letters, which is a requirement for this method. I assume the characters of the string are sorted. If they are not, the creation of a sorted string would be the first step:
str = "ebadc".chars.sort.join
#=> "abcde"
Code
def mid_perm(str)
return mid_perm_even_length_strings(str) if str.size.even?
first_char_index = str.size/2
str[first_char_index] << mid_perm_even_length_strings(str[0,first_char_index] +
str[first_char_index+1..-1])
end
def mid_perm_even_length_strings(str)
first_char_index = str.size/2-1
str[first_char_index] + (str[0,first_char_index] + str[first_char_index+1..-1]).reverse
end
Examples
mid_perm 'abcd'
#=> "bdca"
mid_perm 'abcde'
#=> "cbeda"
mid_perm 'abcdefghijklmnopqrstuvwxyz'
#=> "mzyxwvutsrqponlkjihgfedcba"
Explanation
Let's start by defining a method to produce permutations of the letters of a string.
def perms(str)
str.chars.permutation(str.size).map(&:join)
end
Strings containing an even number of characters
Consider
a = perms "abcd"
#=> ["abcd", "abdc", "acbd", "acdb", "adbc", "adcb",
# "bacd", "badc", "bcad", "bcda", "bdac", "bdca",
# "cabd", "cadb", "cbad", "cbda", "cdab", "cdba",
# "dabc", "dacb", "dbac", "dbca", "dcab", "dcba"]
a contains 4! #=> 4*3*2 => 24 elements, 4 being the length of the string.
Notice that since the characters in perms' argument are sorted, the array returned is also sorted1.
a == a.sort #=>true
As a.size #=> 24, the "middle" element is either a[11] #=> "bdca" or a[12] #=> "cabd" (where 11 = (24-1)/2 and 12 = 24/2), depending on how we want to round. The question stipulates that, for even-length strings, we are to round down, so that would be "bdca".
Now let's slice a into str.size equal arrays, each containing a.size/str.size #=> 24/4 => 6 elements:
b = a.each_slice(a.size/str.size).to_a
#=> [["abcd", "abdc", "acbd", "acdb", "adbc", "adcb"],
# ["bacd", "badc", "bcad", "bcda", "bdac", "bdca"],
# ["cabd", "cadb", "cbad", "cbda", "cdab", "cdba"],
# ["dabc", "dacb", "dbac", "dbca", "dcab", "dcba"]]
The desired element is therefore
b[(a.size/str.size-1)/2-1][-1]
#=> "bdca"
This value can be computed more directly as follows.
first_char_index = str.size/2-1
#=> 1
first_char = str[first_char_index]
#=> "b"
remaining_chars = (str[0,first_char_index] + str[first_char_index+1..-1]).reverse
#=> "dca"
first_char + remaining_chars
#=> "bdca"
The same logic applies to all strings having an even number of characters. We therefore can write the method mid_perm_even_length_strings shown in the Code section above.
For example (for a 12-character string)
mid_perm_even_length_strings 'abcdefghijkl'
#=> "flkjihgedcba"
Strings containing an odd number of characters
Now consider
str = "abcde"
a = perms str
#=> ["abcde", "abced", "abdce", "abdec", "abecd", "abedc",
# "acbde", "acbed", "acdbe", "acdeb", "acebd", "acedb",
# "adbce", "adbec", "adcbe", "adceb", "adebc", "adecb",
# "aebcd", "aebdc", "aecbd", "aecdb", "aedbc", "aedcb",
# "bacde", "baced", "badce", "badec", "baecd",..., "bedca",
# "cabde", "cabed", "cadbe", "cadeb", "caebd", "caedb",
# "cbade", "cbaed", "cbdae", "cbdea", "cbead", "cbeda",
# "cdabe", "cdaeb", "cdbae", "cdbea", "cdeab", "cdeba",
# "ceabd", "ceadb", "cebad", "cebda", "cedab", "cedba",
# "dabce", "dabec", "dacbe", "daceb", "daebc",..., "decba",
# "eabcd", "eabdc", "eacbd", "eacdb", "eadbc",..., "edcba"]
Here the permutation contains 5! #=> 100 elements, in 5 blocks of 20. (Again, a.each_cons(2).all? { |s1,s2| s1 < s2 } #=> true.)
The middle element of a is clearly the middle element of the block of elements that begin with
str[str.size/2] #=> "c"
That block would be the array
b = a.each_slice(a.size/str.size).to_a[str.size/2]
#=> ["cabde", "cabed", "cadbe", "cadeb", "caebd", "caedb",
# "cbade", "cbaed", "cbdae", "cbdea", "cbead", "cbeda",
# "cdabe", "cdaeb", "cdbae", "cdbea", "cdeab", "cdeba",
# "ceabd", "ceadb", "cebad", "cebda", "cedab", "cedba"]
which would be 'c' plus the middle element of the array
["abde", "abed", "adbe", "adeb", "aebd", "aedb",
"bade", "baed", "bdae", "bdea", "bead", "beda",
"dabe", "daeb", "dbae", "dbea", "deab", "deba",
"eabd", "eadb", "ebad", "ebda", "edab", "edba"]
That array is merely the permutations of the string "abde". Since that string contains an even number characters, its middle element is
mid_perm_even_length_strings 'abde'
#=> "beda"
It follows that the middle element of the permutations of the letters of "abcde" is therefore
'c' + 'abde'
#=> "cabde"
This clearly applies to all strings containing an odd number of characters.
1. The doc for Array#permutation states, "The implementation makes no guarantees about the order in which the permutations are yielded.". We therefore might need to tack .sort to the end of the operative line of perms, but with Ruby v2.4 (and I suspect, earlier versions) that is, in fact not necessary here.
I was able to compact it like this:
def middle_permutation(string)
list = string.chars.permutation.map(&:join).sort
list[list.length / 2 - (list.length.even? ? 1 : 0)]
end
Which yields:
middle_permutation('zxcvbnmasd')
# => "mzxvsndcba"
You don't need to generate all permutations. Just find overall number of permutations as PN = N! where N is string (of different chars) length and calculate only needed PN/2-th permutation by its number - for example, using this approach
public static int[] perm(int n, int k)
{
int i, ind, m=k;
int[] permuted = new int[n];
int[] elems = new int[n];
for(i=0;i<n;i++) elems[i]=i;
for(i=0;i<n;i++)
{
ind=m%(n-i);
m=m/(n-i);
permuted[i]=elems[ind];
elems[ind]=elems[n-i-1];
}
return permuted;
}
So it turns out there are two tracks to this, odd strings and even strings.
For odd strings, you take out the middle character Element of the sorted array and the one before it, in that order. When you do that you have two remaining arrays, the one the right and left, both alphabetically sorted. You tack on elements of the right array, starting with the last element, then do the same for the one on the left.
For even strings, Do the same but only take one character in the first step: the (N/2) element.
Here's my solution:
def middle_permutation(string)
string_array = string.chars.sort
mid_string = []
length = string.length
if length.even?
mid_string << string_array[length/2-1]
string_array.delete_at(length/2-1)
(mid_string << string_array.reverse).flatten.join
else
mid_string << string_array[(length/2)-1..length/2].reverse
string_array.slice!((length/2)-1, 2)
(mid_string << string_array.reverse).flatten.join
end
end

sorting programme on Ruby

I've seen a following sorting programme on Ruby but I don't think I fully understand how it actually works:
def sort arr
rec_sort arr, []
end
def rec_sort unsorted, sorted
if unsorted.length <= 0
return sorted
end
smallest = unsorted.pop
still_unsorted = []
unsorted.each do |tested|
if tested < smallest
still_unsorted.push smallest
smallest = tested
else
still_unsorted.push tested
end
end
sorted.push smallest
rec_sort still_unsorted, sorted
end
puts sort ["satoshi", "Bitcoin", "technology", "universe", "smell"]
=> Bitcoin
satoshi
smell
technology
universe
But when I change the first argument of the "rec_sort" method from "still_unsorted" (as indicated above) to "unsorted", the programme gives :
=> Bitcoin
Bitcoin
Bitcoin
Bitcoin
satoshi
I understand that the each loop selects the word "Bitcoin" first (because it would indeed come first when sorted), and "Bitcoin" would be put into the array "sorted". What I dont't quite understand is why there are several "Bitcoin" here, since it should have been excluded from the "unsorted" array in the first iteration of the each loop and, therefore, could not appear in the following iterations, making it impossible for "Bitcoin" to be in the "sorted" array several times.
Could you tell me what makes the two so different?
Any suggestions will be appreciated. Thank you.
The still_unsorted array has the smallest element removed, but the unsorted array only has its last element removed.
as far as I understand its a recursive implementation of bubble sort. and for you confusion unsorted is not being modified except for the statement unsorted.pop but only is being replicated into the still_unsorted except for the smallest element in that array
I am dry running this on [3,1,2] for you
unsorted = [3,1,2]
sm = unsorted.pop # sm = 2 unsorted = [3,1]
still_unsorted = []
#after loop unsorted.each
# still_unsorted = [2,3]
# unsorted = [3,1]
# sm = 1
# sorted = [1]
do the next 2 iterations you'll understand what's happening

Replace near working array code with hash code in ruby to find mode

I was attempting to find a mode without using a hash, but now do not know if its possible, so I am wondering if someone can help me to translate my near working array code, into hash mode to make it work.
I have seen a shorter solution which I will post, but I do not quite follow it, I'm hoping this translation will help me to understand a hash better.
Here is my code, with my comments - I have bolded the part that I know will not work, as I'm comparing a frequency value, to the value of an element itself
#new = [0]
def mode(arr)
arr.each do |x| #do each element in the array
freq = arr.count(x) #set freq equal to the result of the count of each element
if freq > #new[0] && #new.include?(x) == false #if **frequency of element is greater than the frequency of the first element in #new array** and is not already there
#new.unshift(x) #send that element to the front of the array
#new.pop #and get rid of the element at the end(which was the former most frequent element)
elsif freq == #new[0] && #new.include?(x) == false #else **if frequency of element is equal to the frequency of the first element in #new array** and is not already there
#new << x #send that element to #new array
end
end
if #new.length > 1 #if #new array has multiple elements
#new.inject(:+)/#new.length.to_f #find the average of the elements
end
#new #return the final value
end
mode([2,2,6,9,9,15,15,15])
mode([2,2,2,3,3,3,4,5])
Now I have read this post:
Ruby: How to find item in array which has the most occurrences?
And looked at this code
arr = [1, 1, 1, 2, 3]
freq = arr.inject(Hash.new(0)) { |h,v| h[v] += 1; h }
arr.sort_by { |v| freq[v] }.last
But I dont quite understand it.
What I'd like my code to do, is, as it finds the most frequent element,
to store that element as a key, and its frequency as its value.
And then I'd like to compare the next elements frequency to the frequency of the existing pair,
and if it is equal to the most frequent, store it as well,
if it is greater, replace the existing,
and if it is less than, to disregard and move to the next element.
Then of course, I'd like to return the element which has most frequencies, not the amount of frequencies,
and if two or more elements share the most frequencies, then to find the average of those numbers.
I'd love to see it with some hint of my array attempt, and maybe an explanation of that hash method that I posted above, or one that is broken down a little more simply.
This seems to fit your requirements:
def mode(array)
histogram = array.each_with_object(Hash.new(0)) do |element, histogram|
histogram[element] += 1
end
most_frequent = histogram.delete_if do |element, frequency|
frequency < histogram.values.max
end
most_frequent.keys.reduce(&:+) / most_frequent.size.to_f
end
It creates a hash of frequencies histogram, where the keys are the elements of the input array and the values are the frequency of that element in the array. Then, it removes all but the most frequent elements. Finally, it averages the remaining keys.

Calculate missing number

Here's the exercise:
You have been given a list of sequential numbers from 1 to 10,000, but
they are all out of order; furthermore, a single number is missing
from the list. The object of the task is to find out which number is
missing.
The strategy to this problem is to sum the elements in the array, then sum the range 1 to 10,000, and subtract the difference. This is equal to the missing number. The formula for calculating the sum of the range from 1..n being n(n+1)/2.
This is my current approach:
def missing_number(array)
sum = 0
array.each do |element|
sum += element
end
((10000*10001)/2) - sum
end
Where I am getting tripped up is the output when I input an array such as this:
puts missing_number(*1..10000) #=> 0
Why does this happen?
Thanks!
No need to sort the array. An array of length N is supposed to have all but one of the numbers 1..(N+1) so the array length + 1 is the basis for figuring out what the grand_sum would be if all values were there.
def missing_number(array)
grand_sum = (array.length + 1) * (array.length + 2) / 2
grand_sum - array.inject(:+)
end
ADDENDUM
This method takes an array as an argument, not a range. You can't use a range directly because there wouldn't be a missing value. Before calling the method you need some mechanism for generating an array which meets the problem description. Here's one possible solution:
PROBLEM_SIZE = 10_000
# Create an array corresponding to the range
test_array = (1..PROBLEM_SIZE).to_a
# Target a random value for deletion -- rand(N) generates values in
# the range 0..N-1, inclusive, so add 1 to shift the range to 1..N
target_value = rand(PROBLEM_SIZE) + 1
# Delete the value and print so we can check the algorithm
printf "Deleting %d from the array\n", test_array.delete(target_value)
# Randomize the order of remaining values, as per original problem description
test_array.shuffle!
# See what the missing_number() method identifies as the missing number
printf "Algorithm identified %d as the deleted value\n", \
missing_number(test_array)
An alternative approach to solving the problem if it's not performance critical, because of its readability:
def missing_number(array)
(1..10_000).to_a - array
end
Instead of *1..10000, the argument should be (1..10000).to_a.
You shouldn't be using *1..10000, this will just expand to 10,000 arguments. (1..10000).to_a will return zero because there are no elements missing between 1..10000 you need to remove one. Below is some code with a detailed explanation.
def missing_number array
# put elements in order
array.sort!
# get value of last array element
last = array[-1]
# compute the expected total of the numbers
# 1 - last
# (n + 1)(n)/2
expected = (last + 1) * (last / 2)
# actual sum
actual = array.inject{|sum,x| sum + x}
# find missing number by subtracting
(expected - actual)
end
test = (1..10000).to_a
test.delete 45
puts "Missing number is: #{missing_number(test)}"

Resources