I'd like to be able to find the 10 most common questions in a array of 300-500 strings, in Ruby.
An example element being
["HI, I'd like your product. I just have one question. How do I change
my password?", "Can someone tell me how I change my password?", "I
can't afford this. How do I cancel my account?", "Account
cancelation?", "I forgot my password, how do I change my password?",
.....]
Basically, I'm going to have an array of a lot of strings, and I have to extract the question, and find the 10 most common questions from that array.
I've tried looking around (checked out n-grams, but it didn't seem too relevant) and have yet to come up with any ideas.
Do you know of any algorithms you'd suggest I take a look at? A link to a couple examples would be terrific!
I would say the first step would be to actually determine which Strings (or Substrings) are actually questions. A no-brainer approach to that would be to look out for "?", but then again depending on your requirement you can enhance that - maybe lookout out for "question words". That would probably be the easier part of your task.
Once you get a list of strings (that are supposedly questions) - you need to cluster similar ones and return the 10 largest bins. The best way would be to combine a semantic + syntax based approach. You could probably have a look at this paper as they seem to tackle the problem of finding similarities between two strings. They present some compelling reasons as to why a dual syntactic-semantic approach is required.
Not sure about special algorithms, but if I were assigned this task:
array = ["my account is locked.", "can i have the account password to my account?", "what's my password?"]
array.map! {|x| x.split(' ')} #make each sentence an element
word_freq = Hash.new(0)
i = 0
while i < array.length
array[i].each {|x| word_freq[x] += 1}
i += 1
end
word_freq.each {|m, x| puts "#{m} appears #{x} times"} #words are now keys with frequency values
print word_freq.keys #an array of key words to mess with
Related
I'm learning Ruby from Chris Pine's "Learn To Program" book and I've been asked to write a method that sorts a set of given words in alphabetical order either with loops or recursion. I first gave looping a try.
def sort words
i = 0
checked = 0
while true
if (i+1 < words.length)
if (words[i]>words[i+1])
temp = words[i]
words[i] = words[i+1]
words[i+1] = temp
else
checked+=1
end
i+=1
elsif (checked == words.length-1)
break
else
i =0
checked =0
end
end
return words
end
The code works, but I wanted to see if any seasoned ruby-ists could offer some input on how to make it more efficient.
Thank You!
The first thing to learn when you're beginning to understand optimization is that the most obvious fixes are often the least productive. For example, you could spend a lot of time here tweaking some of these comparisons or switching to a slightly different way of evaluating the same thing and get a 5-10% performance increase.
You could also use a completely different algorithm and get a 5x-10x increase. Bubble-sort, which is what you have here, is nearly the worst performing sorting algorithm ever made. This is a technique you should learn if only to understand that it's terrible and you should immediately move on to other methods, like Quicksort which is not all that hard to implement if you approach the problem systematically.
So in other words, before you start tweaking little things, step back and ask yourself "Am I approaching this problem the right way?" Always consider other angles when you have a performance problem.
That being said, here's how to make your code more Ruby-like:
def sort(words)
# Make a copy so the original isn't mangled
words = words.dup
# Iterate over ranges:
# (n..m) goes from N to M inclusive
# (n...m) goes from N up to but not including M
(0...words.length-1).each do |i|
(0...words.length-1-i).each do |j|
# Examine the pair of words at this offset using an array slice
a, b = words[j, 2]
# If A is ahead of B then...
if (a > b)
# ...swap these elements.
words[j, 2] = [ b, a ]
end
end
end
words
end
# Quick test function that uses randomized data
p sort(%w[ a c d f b e ].shuffle)
To improve as a developer you should always try and measure your progress somehow. Tools like Rubocop will help identify inefficient coding practices. Test-driven development can help to identify flaws early in your programming and to make sure that changes don't cause regressions. Benchmarking tools help you better understand the perfomance of your code.
For example:
require 'benchmark'
CHARS = ('a'..'z').to_a
def random_data
Array.new(1000) { CHARS.sample }
end
count = 100
Benchmark.bm do |bm|
bm.report('my sort:') do
count.times do
sort(random_data)
end
end
bm.report('built-in sort:') do
count.times do
random_data.sort
end
end
end
# user system total real
# my sort: 19.220000 0.060000 19.280000 ( 19.358073)
# built-in sort: 0.030000 0.000000 0.030000 ( 0.025662)
So this algorithm is 642x slower than the built-in method. I'm sure you can get a lot closer with a better algorithm.
Firstly, you dont have to reinvent the Wheel. I mean, see this example:
> ['a', 'abc', 'bac', 'cad'].sort
# => ["a", "abc", "bac", "cad"]
Ruby has a extensively huge set of libraries. Common stuffs are so efficiently supported by Ruby. You just have to have enough knowledge to use the language features efficiently.
I would recommend you to go through the Ruby core libraries and learn to use to combine the features to achieve something special.
Give a try to this Ruby Koans http://rubykoans.com/
RubyKoans is the most efficient to achieve mastery over Ruby language.
Here is a list of Sorting algorithm examples by type in this site https://www.sitepoint.com/sorting-algorithms-ruby/
You have to choose between algorithms wisely based on size of problems domain and usecases.
In Ruby, suppose we have a 2-dimensional array, why is this syntax fine:
array.each do |x|
x.each do |y|
puts y
end
end
But this is not:
array.each{|x|.each{|y| puts y}}
Any ideas? Thanks
This should be fine array.each{|x| x.each{|y| puts y}}
You forget to refer x first.
I.e. . is supposed to be left associate operator. If you have noting on the left side - this is an error.
If you replace your do...end blocks with {...} carefully you'll find that your second form works the same as your first. But puts array accomplishes the same thing as this whole double loop.
If I may offer some polite meta-advice, your two Ruby questions today seem like you maybe were asked to do some things in a language you don't know, and are frustrated. This is understandable. But the good news is that, compared to many other languages, Ruby is built on a very small number of pieces. If you spend a little time getting really familiar with Array and Hash, you'll find the going much smoother thereafter.
I've studied the poignant guide and it really has helped me pick up the
language pretty fast. After that, I started solving some coding puzzles
using Ruby. It just helps a lot to get used to the language I feel.
I'm stuck with one such puzzle. I have solved it very easily since it is
pretty straight-forward, but the solution is being rejected (by the host
website) with the error 'Time Exceded'! I know that Ruby cannot compete
with the speed of C/C++ but it has got to be able to answer a tiny puzzle on a website which accepts
solutions in Ruby?
The puzzle is just a normal sort.
This is my solution
array ||= []
gets.to_i.times do
array << gets
end
puts array.sort
My question is, is there any other way I can achieve high-speed sorting with Ruby? I'm using the basic Array#sort here, but is there a way to do it faster, even though it means lot more lines of code?
I've solved that problem, and let me tell you using a nlogn algorithm to pass that is almost impossible unless you are using a very optimized C/Assembly version of it.
You need to explore other algorithms. Hint: O(n) Algorithm will do the trick, even for ruby.
Good Luck.
You're sorting strings when you should be sorting ints. Try:
array << gets.to_i
If there is no need for duplicate values to be repeated:
h = {}
gets.to_i.times{h[gets.to_i] = true}
(0..100000).each{|n| puts(n) if h[n]}
If duplicate values must be repeated:
h = Hash.new(0)
gets.to_i.times{h[gets.to_i] += 1}
(0..100000).each{|n| h[n].times{puts(n)}}
Words like "a", "the", "best", "kind". i am pretty sure there are good ways of achieving this
Just to be clear, I am looking for
The simplest solution that can be implemented, preferably in ruby.
I have a high level of tolerance for errors
If a library of common phrases is what i need, perfectly happy with that too
These common words are known as "stop words" - there is a similar stackoverflow question about this here: "Stop words" list for English?
To summarize:
If you have a large amount of text to deal with, it would be worth gathering statistics about the frequency of words in that particular data set, and taking the most frequent words for your stop word list. (That you include "kind" in your examples suggests to me that you might have quite an unusual set of data, e.g. with lots of colloquial expressions like "kind of", so perhaps you would need to do this.)
Since you say you don't mind much about errors, then it may be sufficient to just use a list of stop words for English that someone else has produced, e.g. the fairly long one used by MySQL or anything else that Google turns up.
If you just put these words into a hash in your program it should be easy to filter any list of words.
Common = %w{ a and or to the is in be }
Uncommon = %{
To be, or not to be: that is the question:
Whether 'tis nobler in the mind to suffer
The slings and arrows of outrageous fortune,
Or to take arms against a sea of troubles,
And by opposing end them? To die: to sleep;
No more; and by a sleep to say we end
The heart-ache and the thousand natural shocks
That flesh is heir to, 'tis a consummation
Devoutly to be wish'd. To die, to sleep;
To sleep: perchance to dream: ay, there's the rub;
For in that sleep of death what dreams may come
}.split /\b/
ignore_me, result = {}, []
Common.each { |w| ignore_me[w.downcase] = :Common }
Uncommon.each { |w| result << w unless ignore_me[w.downcase[/\w*/]] }
puts result.join
, not : that question:
Whether 'tis nobler mind suffer
slings arrows of outrageous fortune,
take arms against sea of troubles,
by opposing end them? die: sleep;
No more; by sleep say we end
heart-ache thousand natural shocks
That flesh heir , 'tis consummation
Devoutly wish'd. die, sleep;
sleep: perchance dream: ay, there's rub;
For that sleep of death what dreams may come
This is a variation on DigitalRoss answer.
str=<<EOF
To be, or not to be: that is the question:
Whether 'tis nobler in the mind to suffer
The slings and arrows of outrageous fortune,
Or to take arms against a sea of troubles,
And by opposing end them? To die: to sleep;
No more; and by a sleep to say we end
The heart-ache and the thousand natural shocks
That flesh is heir to, 'tis a consummation
Devoutly to be wish'd. To die, to sleep;
To sleep: perchance to dream: ay, there's the rub;
For in that sleep of death what dreams may come
EOF
common = {}
%w{ a and or to the is in be }.each{|w| common[w] = true}
puts str.gsub(/\b\w+\b/){|word| common[word.downcase] ? '': word}.squeeze(' ')
Also relevant:
What's the fastest way to check if a word from one string is in another string?
Hold on, you need to do some research before you take out stopwords (aka noise words, junk words). Index size and processing resources aren't the only issues. A lot depends on whether end-users will be typing queries, or you will be working with long automated queries.
All search log analysis shows that people tend to type one to three words per query. When that's all a search has to work with, we can't afford to lose anything. For example, a collection might have the word "copyright" on many documents -- making it very common -- but if there's no word in the index, it's impossible to do exact phrase searches or proximity relevance ranking. In addition, there are perfectly legitimate reasons to search for the most common words: people may be looking for "The Who", or worse, "The The".
So while there are technical issues to consider, and taking out stopwords is one solution, it may not be the right solution for the overall problem that you are trying to solve.
If you have an array of words to remove named stop_words, then you get the result from this expression:
description.scan(/\w+/).reject do |word|
stop_words.include? word
end.join ' '
If you want to preserve the non-word characters between each word,
description.scan(/(\w+)(\W+)/).reject do |(word, other)|
stop_words.include? word
end.flatten.join
The Problem
I'm working on a problem that involves sharding. As part of the problem I need to find the fastest way to partition a large Ruby hash (> 200,0000 entries) in two or more pieces.
Are there any non O(n) approaches?
Is there a non-Ruby i.e. C/C++ implementation?
Please don't reply with examples using the trivial approach of converting the hash to an array and rebuilding N distinct hashes.
My concern is that Ruby is too slow to do this kind of work.
The initial approach
This was the first solution I tried. What was appealing about it was:
it didn't need to loop slavishly across the hash
it didn't need to manage a counter to allocate the members evenly among the shards.
it's short and neat looking
Ok, it isn't O(n) but it relies on methods in the standard library which I figured would be faster than writing my own Ruby code.
pivot = s.size / 2
slices = s.each_slice(pivot)
s1 = Hash[*slices.entries[0].flatten]
s2 = Hash[*slices.entries[1].flatten]
A better solution
Mark and Mike were kind enough to suggest approaches. I have to admit that Mark's approach felt wrong - it did exactly what I didn't want - it looped over all of the members of the has and evaluated a conditional as it went - but since he'd taken the time to do the evaluation, I figured that I should try a similar approach and benchmark that. This is my adapted version of his approach (My keys aren't numbers so I can't take his approach verbatim)
def split_shard(s)
shard1 = {}
shard2 = {}
t = Benchmark.measure do
n = 0
pivot = s.size / 2
s.each_pair do |k,v|
if n < pivot
shard1[k] = v
else
shard2[k] = v
end
n += 1
end
end
$b += t.real
$e += s.size
return shard1, shard2
end
The results
In both cases, a large number of hashes are split into shards. The total number of elements across all of the hashes in the test data set was 1,680,324.
My initial solution - which had to be faster because it uses methods in the standard library and minimises the amount of Ruby code (no loop, no conditional) - runs in just over 9s
Mark's approach runs in just over 5s
That's a significant win
Take away
Don't be fooled by 'intuition' - measure the performance of competing algorithm
Don't worry about Ruby's performance as a language - my initial concern is that if I'm doing ten million of these operations, it could take a significant amount of time in Ruby but it doesn't really.
Thanks to Mark and Mike who both get points from me for their help.
Thanks!
I don't see how you can achieve this using an unmodified "vanilla" Hash - I'd expect that you'd need to get into the internals in order to make partitioning into some kind of bulk memory-copying operation. How good is your C?
I'd be more inclined to look into partitioning instead of creating a Hash in the first place, especially if the only reason for the 200K-item Hash existing in the first place is to be subdivided.
EDIT: After thinking about it at the gym...
The problem with finding some existing solution is that someone else needs to have (a) experienced the pain, (b) had the technical ability to address it and (c) felt community-friendly enough to have released it into the wild. Oh, and for your OS platform.
What about using a B-Tree instead of a Hash? Hold your data sorted by key and it can be traversed by memcpy(). B-Tree retrieval is O(log N), which isn't much of a hit against Hash most of the time.
I found something here which might help, and I'd expect there'd only be a little duck-typing wrapper needed to make it quack like a Hash.
Still gonna need those C/C++ skills, though. (Mine are hopelessly rusty).
This probably isn't fast enough for your needs (which sound like they'll require an extension in C), but perhaps you could use Hash#select?
I agree with Mike Woodhouse's idea. Is it possible for you to construct your shards at the same place where the original 200k-item hash is being constructed? If the items are coming out of a database, you could split your query into multiple disjoint queries, based either on some aspect of the key or by repeatedly using something like LIMIT 10000 to grab a chunk at a time.
Additional
Hi Chris, I just compared your approach to using Hash#select:
require 'benchmark'
s = {}
1.upto(200_000) { |i| s[i] = i}
Benchmark.bm do |x|
x.report {
pivot = s.size / 2
slices = s.each_slice(pivot)
s1 = Hash[*slices.entries[0].flatten]
s2 = Hash[*slices.entries[1].flatten]
}
x.report {
s1 = {}
s2 = {}
s.each_pair do |k,v|
if k < 100_001
s1[k] = v
else
s2[k] = v
end
end
}
end
It looks like Hash#select is much faster, even though it goes through the entire large hash for each one of the sub-hashes:
# ruby test.rb
user system total real
0.560000 0.010000 0.570000 ( 0.571401)
0.320000 0.000000 0.320000 ( 0.323099)
Hope this helps.