Can you search an array by keyword? - ruby

How do you take an array like ["foo",1,2,3] and turn it into something that can quickly be searched by keyword "foo"?
I'm trying to take a csv file, and sort/filter it based on a condition. For example, given the following csv and criteria:
foo,bar,foobar
1,2,3
4,5,6
7,8,9
#criteria = ["foobar", "foo"]
the output should be the following (order is important):
foobar,foo
3,1
6,4
9,7
I'm using a nested loop to check every item in #criteria against every index[0] of the csv.
require 'csv'
#criteria = ["foobar", "foo"]
#newcsv = []
csv = CSV.read("./foo.csv", { headers: true, return_headers: false })
csv = csv.to_a.transpose
#criteria.each do |n|
csv.each do |i|
if i[0] == n
#newcsv.push(i)
end
end
end
#newcsv = #newcsv.transpose
CSV.open("./transpose.csv", "wb") do |lines|
#newcsv.each { |line| lines << line }
end
It works on small matrices, but I'm sure it won't scale. I'm wondering if a hash might give me better performance. How can I only get the rows in #criteria without using a nested loop?

So this answer was posted by another user and later deleted because he or she "hated it", but I think it at least adds some useful information to the original poster, so I'm reposting it here.
Note that I'm not sure if this code has asymptotic performance that's faster than O(n^2) for an n * n matrix, but the original author disagreed with me. Here, at least, is my reasoning:
If you have an n * n matrix, and you have n - 1 criteria, then wouldn't creating indices take in the worst-case n-1 + n-2 + .. + 2 + 1 = O(n^2) steps, depending on how the criteria and columns of the matrix are sorted?
And then you still end up needing to collect n(n - 1) cells, even if it is by constant-time array index access.
That was my reasoning at least. Maybe I am wrong. If I am, please explain how so, and what the correct asymptotic runtime complexity of the code below is!
Answer from Original Author
Scanning an array for an element is inefficient, but once you have an index, looking up for an element at that index is fast.
Given the header line header = ["foo", "bar", "foobar"] and #criteria = ["foobar", "foo"], you can convert them into indices:
indices = #criteria.map{|column| header.index(column)}
# => [2, 0]
Then, using indices, you can map the rows:
[
[1, 2, 3],
[4, 5, 6],
[7, 8, 9],
]
.map{|row| row.values_at(*indices)}
which gives:
[
[3, 1],
[6, 4],
[9, 7],
]
This way, majority of the computational complexity lies in creating indices, which is done only once and the time spent on it is ignorable, and all the rest is element look up by index, and the complexity is small, unlike what a user comments.
Here is some example code using the above methods:
require 'csv'
#criteria = ['foobar', 'foo']
table = CSV.read('./foo.csv', headers: true)
indices = #criteria.map { |column| table.headers.index(column) }
table.map { |row| row.values_at(*indices) }

Related

Efficient way of removing similar arrays in an array of arrays

I am trying to analyze some documents and find similarities in them. After analysis, I have an array, the elements of which are arrays of data from documents considered similar. But sometimes I have two almost similar elements, and naturally I want to leave the biggest of them. For simplification:
data = [[1,2,3,4,5,6], [7,8,9,10], [1,2,3,5,6]...]
How do I efficiently process the data that I get:
data = [[1,2,3,4,5,6], [7,8,9,10]...]
I suppose I could intersect every array, and if the intersected array matches one of the original arrays - I ignore it. Here is a quick code I wrote:
data = [[1,2,3,4,5,6], [7,8,9,10], [1,2,3,5,6], [7,9,10]]
cleaned = []
data.each_index do |i|
similar = false
data.each_index do |j|
if i == j
next
elsif data[i]&data[j] == data[i]
similar = true
break
end
end
unless similar
cleaned << data[i]
end
end
puts cleaned.inspect
Is this an efficient way to go? Also, the current behaviour only allows to leave out arrays that are a few elements short, and I might want to merge similar arrays if they occur:
[[1,2,3,4,5], [1,3,4,5,6]] => [[1,2,3,4,5,6]]
You can delete any element in the list if it is fully contained in another element:
data.delete_if do |arr|
data.any? { |a2| !a2.equal?(arr) && arr - a2 == [] }
end
# => [[1, 2, 3, 4, 5, 6], [7, 8, 9, 10]]
This is a bit more efficient than your suggestion since once you decide that an element should be removed, you don't check against it in the next iterations.

The most idiomatic way to iterate through a Ruby array, exiting when an arbitrary condition met?

I want to iterate through an array, each element of which is an array of two integers (e.g. `[3,5]'); for each of these elements, I want to calculate the sum of the two integers, exiting the loop when any of these sums exceeds a certain arbitrary value. The source array is quite large, and I will likely find the desired value near the beginning, so looping through all of the unneeded elements is not a good option.
I have written three loops to do this, all of which produce the desired result. My question is: which is more idiomatic Ruby? Or--better yet--is there a better way? I try not to use non-local loop variables in, but break statements look kind of hackish to my (admittedly novice) eye.
# Loop A
pairs.each do |pair|
pair_sum = pair.inject(:+)
arr1 << pair_sum
break if pair_sum > arr2.max
end
#Loop B - (just A condensed)
pairs.each { |pair| arr1.last <= arr2.max ? arr1 << pair.inject(:+) : break }
#Loop C
i = 0
pair_sum = 0
begin
pair_sum = pairs[i].inject(:+)
arr1 << pair_sum
i += 1
end until pair_sum > arr2.max
A similar question was asked at escaping the .each { } iteration early in Ruby, but the responses were essentially that, while using .each or .each_with_index and exiting with break when the target index was reached would work, .take(num_elements).each is more idiomatic. In my situation, however, I don't know in advance how many elements I'll have to iterate through, presenting me with what appears to be a boundary case.
This is from a project Euler-type problem I've already solved, btw. Just wondering about the community-preferred syntax. Thanks in advance for your valuable time.
take and drop have a variant take_while and drop_while where instead of providing a fixed number of elements you provide a block. Ruby will accumulate values from the receiver (in the case of take_while) as long as the block returns true. Your code could be rewritten as
array.take_while {|pair| pair.sum < foo}.map(&:sum)
This does mean that you calculate the sum of some of these pairs twice.
In Ruby 2.0 there's Enumerable#lazy which returns a lazy enumerator:
sums = pairs.lazy.map { |a, b| a + b }.take_while { |pair_sum| pair_sum < some_max_value }.force
This avoids calculating the sums twice.
[[1, 2], [3, 4], [5, 6]].find{|x, y| x + y > 6}
# => [3, 4]
[[1, 2], [3, 4], [5, 6]].find{|x, y| x + y > 6}.inject(:+)
#=> 7

Code to write a random array of x numbers with no duplicates [duplicate]

This is what I have so far:
myArray.map!{ rand(max) }
Obviously, however, sometimes the numbers in the list are not unique. How can I make sure my list only contains unique numbers without having to create a bigger list from which I then just pick the n unique numbers?
Edit:
I'd really like to see this done w/o loop - if at all possible.
(0..50).to_a.sort{ rand() - 0.5 }[0..x]
(0..50).to_a can be replaced with any array.
0 is "minvalue", 50 is "max value"
x is "how many values i want out"
of course, its impossible for x to be permitted to be greater than max-min :)
In expansion of how this works
(0..5).to_a ==> [0,1,2,3,4,5]
[0,1,2,3,4,5].sort{ -1 } ==> [0, 1, 2, 4, 3, 5] # constant
[0,1,2,3,4,5].sort{ 1 } ==> [5, 3, 0, 4, 2, 1] # constant
[0,1,2,3,4,5].sort{ rand() - 0.5 } ==> [1, 5, 0, 3, 4, 2 ] # random
[1, 5, 0, 3, 4, 2 ][ 0..2 ] ==> [1, 5, 0 ]
Footnotes:
It is worth mentioning that at the time this question was originally answered, September 2008, that Array#shuffle was either not available or not already known to me, hence the approximation in Array#sort
And there's a barrage of suggested edits to this as a result.
So:
.sort{ rand() - 0.5 }
Can be better, and shorter expressed on modern ruby implementations using
.shuffle
Additionally,
[0..x]
Can be more obviously written with Array#take as:
.take(x)
Thus, the easiest way to produce a sequence of random numbers on a modern ruby is:
(0..50).to_a.shuffle.take(x)
This uses Set:
require 'set'
def rand_n(n, max)
randoms = Set.new
loop do
randoms << rand(max)
return randoms.to_a if randoms.size >= n
end
end
Ruby 1.9 offers the Array#sample method which returns an element, or elements randomly selected from an Array. The results of #sample won't include the same Array element twice.
(1..999).to_a.sample 5 # => [389, 30, 326, 946, 746]
When compared to the to_a.sort_by approach, the sample method appears to be significantly faster. In a simple scenario I compared sort_by to sample, and got the following results.
require 'benchmark'
range = 0...1000000
how_many = 5
Benchmark.realtime do
range.to_a.sample(how_many)
end
=> 0.081083
Benchmark.realtime do
(range).sort_by{rand}[0...how_many]
end
=> 2.907445
Just to give you an idea about speed, I ran four versions of this:
Using Sets, like Ryan's suggestion.
Using an Array slightly larger than necessary, then doing uniq! at the end.
Using a Hash, like Kyle suggested.
Creating an Array of the required size, then sorting it randomly, like Kent's suggestion (but without the extraneous "- 0.5", which does nothing).
They're all fast at small scales, so I had them each create a list of 1,000,000 numbers. Here are the times, in seconds:
Sets: 628
Array + uniq: 629
Hash: 645
fixed Array + sort: 8
And no, that last one is not a typo. So if you care about speed, and it's OK for the numbers to be integers from 0 to whatever, then my exact code was:
a = (0...1000000).sort_by{rand}
Yes, it's possible to do this without a loop and without keeping track of which numbers have been chosen. It's called a Linear Feedback Shift Register: Create Random Number Sequence with No Repeats
[*1..99].sample(4) #=> [64, 99, 29, 49]
According to Array#sample docs,
The elements are chosen by using random and unique indices
If you need SecureRandom (which uses computer noise instead of pseudorandom numbers):
require 'securerandom'
[*1..99].sample(4, random: SecureRandom) #=> [2, 75, 95, 37]
How about a play on this? Unique random numbers without needing to use Set or Hash.
x = 0
(1..100).map{|iter| x += rand(100)}.shuffle
You could use a hash to track the random numbers you've used so far:
seen = {}
max = 100
(1..10).map { |n|
x = rand(max)
while (seen[x])
x = rand(max)
end
x
}
Rather than add the items to a list/array, add them to a Set.
If you have a finite list of possible random numbers (i.e. 1 to 100), then Kent's solution is good.
Otherwise there is no other good way to do it without looping. The problem is you MUST do a loop if you get a duplicate. My solution should be efficient and the looping should not be too much more than the size of your array (i.e. if you want 20 unique random numbers, it might take 25 iterations on average.) Though the number of iterations gets worse the more numbers you need and the smaller max is. Here is my above code modified to show how many iterations are needed for the given input:
require 'set'
def rand_n(n, max)
randoms = Set.new
i = 0
loop do
randoms << rand(max)
break if randoms.size > n
i += 1
end
puts "Took #{i} iterations for #{n} random numbers to a max of #{max}"
return randoms.to_a
end
I could write this code to LOOK more like Array.map if you want :)
Based on Kent Fredric's solution above, this is what I ended up using:
def n_unique_rand(number_to_generate, rand_upper_limit)
return (0..rand_upper_limit - 1).sort_by{rand}[0..number_to_generate - 1]
end
Thanks Kent.
No loops with this method
Array.new(size) { rand(max) }
require 'benchmark'
max = 1000000
size = 5
Benchmark.realtime do
Array.new(size) { rand(max) }
end
=> 1.9114e-05
Here is one solution:
Suppose you want these random numbers to be between r_min and r_max. For each element in your list, generate a random number r, and make list[i]=list[i-1]+r. This would give you random numbers which are monotonically increasing, guaranteeing uniqueness provided that
r+list[i-1] does not over flow
r > 0
For the first element, you would use r_min instead of list[i-1]. Once you are done, you can shuffle the list so the elements are not so obviously in order.
The only problem with this method is when you go over r_max and still have more elements to generate. In this case, you can reset r_min and r_max to 2 adjacent element you have already computed, and simply repeat the process. This effectively runs the same algorithm over an interval where there are no numbers already used. You can keep doing this until you have the list populated.
As far as it is nice to know in advance the maxium value, you can do this way:
class NoLoopRand
def initialize(max)
#deck = (0..max).to_a
end
def getrnd
return #deck.delete_at(rand(#deck.length - 1))
end
end
and you can obtain random data in this way:
aRndNum = NoLoopRand.new(10)
puts aRndNum.getrnd
you'll obtain nil when all the values will be exausted from the deck.
Method 1
Using Kent's approach, it is possible to generate an array of arbitrary length keeping all values in a limited range:
# Generates a random array of length n.
#
# #param n length of the desired array
# #param lower minimum number in the array
# #param upper maximum number in the array
def ary_rand(n, lower, upper)
values_set = (lower..upper).to_a
repetition = n/(upper-lower+1) + 1
(values_set*repetition).sample n
end
Method 2
Another, possibly more efficient, method modified from same Kent's another answer:
def ary_rand2(n, lower, upper)
v = (lower..upper).to_a
(0...n).map{ v[rand(v.length)] }
end
Output
puts (ary_rand 5, 0, 9).to_s # [0, 8, 2, 5, 6] expected
puts (ary_rand 5, 0, 9).to_s # [7, 8, 2, 4, 3] different result for same params
puts (ary_rand 5, 0, 1).to_s # [0, 0, 1, 0, 1] repeated values from limited range
puts (ary_rand 5, 9, 0).to_s # [] no such range :)

Get index of array element faster than O(n)

Given I have a HUGE array, and a value from it. I want to get index of the value in array. Is there any other way, rather then call Array#index to get it? The problem comes from the need of keeping really huge array and calling Array#index enormous amount of times.
After a couple of tries I found that caching indexes inside elements by storing structs with (value, index) fields instead of the value itself gives a huge step in performance (20x times win).
Still I wonder if there's a more convenient way of finding index of en element without caching (or there's a good caching technique that will boost up the performance).
Why not use index or rindex?
array = %w( a b c d e)
# get FIRST index of element searched
puts array.index('a')
# get LAST index of element searched
puts array.rindex('a')
index: http://www.ruby-doc.org/core-1.9.3/Array.html#method-i-index
rindex: http://www.ruby-doc.org/core-1.9.3/Array.html#method-i-rindex
Convert the array into a hash. Then look for the key.
array = ['a', 'b', 'c']
hash = Hash[array.map.with_index.to_a] # => {"a"=>0, "b"=>1, "c"=>2}
hash['b'] # => 1
Other answers don't take into account the possibility of an entry listed multiple times in an array. This will return a hash where each key is a unique object in the array and each value is an array of indices that corresponds to where the object lives:
a = [1, 2, 3, 1, 2, 3, 4]
=> [1, 2, 3, 1, 2, 3, 4]
indices = a.each_with_index.inject(Hash.new { Array.new }) do |hash, (obj, i)|
hash[obj] += [i]
hash
end
=> { 1 => [0, 3], 2 => [1, 4], 3 => [2, 5], 4 => [6] }
This allows for a quick search for duplicate entries:
indices.select { |k, v| v.size > 1 }
=> { 1 => [0, 3], 2 => [1, 4], 3 => [2, 5] }
Is there a good reason not to use a hash? Lookups are O(1) vs. O(n) for the array.
If your array has a natural order use binary search.
Use binary search.
Binary search has O(log n) access time.
Here are the steps on how to use binary search,
What is the ordering of you array? For example, is it sorted by name?
Use bsearch to find elements or indices
Code example
# assume array is sorted by name!
array.bsearch { |each| "Jamie" <=> each.name } # returns element
(0..array.size).bsearch { |n| "Jamie" <=> array[n].name } # returns index
If it's a sorted array you could use a Binary search algorithm (O(log n)). For example, extending the Array-class with this functionality:
class Array
def b_search(e, l = 0, u = length - 1)
return if lower_index > upper_index
midpoint_index = (lower_index + upper_index) / 2
return midpoint_index if self[midpoint_index] == value
if value < self[midpoint_index]
b_search(value, lower_index, upper_index - 1)
else
b_search(value, lower_index + 1, upper_index)
end
end
end
Taking a combination of #sawa's answer and the comment listed there you could implement a "quick" index and rindex on the array class.
class Array
def quick_index el
hash = Hash[self.map.with_index.to_a]
hash[el]
end
def quick_rindex el
hash = Hash[self.reverse.map.with_index.to_a]
array.length - 1 - hash[el]
end
end
Still I wonder if there's a more convenient way of finding index of en element without caching (or there's a good caching technique that will boost up the performance).
You can use binary search (if your array is ordered and the values you store in the array are comparable in some way). For that to work you need to be able to tell the binary search whether it should be looking "to the left" or "to the right" of the current element. But I believe there is nothing wrong with storing the index at insertion time and then using it if you are getting the element from the same array.

How do I generate a list of n unique random numbers in Ruby?

This is what I have so far:
myArray.map!{ rand(max) }
Obviously, however, sometimes the numbers in the list are not unique. How can I make sure my list only contains unique numbers without having to create a bigger list from which I then just pick the n unique numbers?
Edit:
I'd really like to see this done w/o loop - if at all possible.
(0..50).to_a.sort{ rand() - 0.5 }[0..x]
(0..50).to_a can be replaced with any array.
0 is "minvalue", 50 is "max value"
x is "how many values i want out"
of course, its impossible for x to be permitted to be greater than max-min :)
In expansion of how this works
(0..5).to_a ==> [0,1,2,3,4,5]
[0,1,2,3,4,5].sort{ -1 } ==> [0, 1, 2, 4, 3, 5] # constant
[0,1,2,3,4,5].sort{ 1 } ==> [5, 3, 0, 4, 2, 1] # constant
[0,1,2,3,4,5].sort{ rand() - 0.5 } ==> [1, 5, 0, 3, 4, 2 ] # random
[1, 5, 0, 3, 4, 2 ][ 0..2 ] ==> [1, 5, 0 ]
Footnotes:
It is worth mentioning that at the time this question was originally answered, September 2008, that Array#shuffle was either not available or not already known to me, hence the approximation in Array#sort
And there's a barrage of suggested edits to this as a result.
So:
.sort{ rand() - 0.5 }
Can be better, and shorter expressed on modern ruby implementations using
.shuffle
Additionally,
[0..x]
Can be more obviously written with Array#take as:
.take(x)
Thus, the easiest way to produce a sequence of random numbers on a modern ruby is:
(0..50).to_a.shuffle.take(x)
This uses Set:
require 'set'
def rand_n(n, max)
randoms = Set.new
loop do
randoms << rand(max)
return randoms.to_a if randoms.size >= n
end
end
Ruby 1.9 offers the Array#sample method which returns an element, or elements randomly selected from an Array. The results of #sample won't include the same Array element twice.
(1..999).to_a.sample 5 # => [389, 30, 326, 946, 746]
When compared to the to_a.sort_by approach, the sample method appears to be significantly faster. In a simple scenario I compared sort_by to sample, and got the following results.
require 'benchmark'
range = 0...1000000
how_many = 5
Benchmark.realtime do
range.to_a.sample(how_many)
end
=> 0.081083
Benchmark.realtime do
(range).sort_by{rand}[0...how_many]
end
=> 2.907445
Just to give you an idea about speed, I ran four versions of this:
Using Sets, like Ryan's suggestion.
Using an Array slightly larger than necessary, then doing uniq! at the end.
Using a Hash, like Kyle suggested.
Creating an Array of the required size, then sorting it randomly, like Kent's suggestion (but without the extraneous "- 0.5", which does nothing).
They're all fast at small scales, so I had them each create a list of 1,000,000 numbers. Here are the times, in seconds:
Sets: 628
Array + uniq: 629
Hash: 645
fixed Array + sort: 8
And no, that last one is not a typo. So if you care about speed, and it's OK for the numbers to be integers from 0 to whatever, then my exact code was:
a = (0...1000000).sort_by{rand}
Yes, it's possible to do this without a loop and without keeping track of which numbers have been chosen. It's called a Linear Feedback Shift Register: Create Random Number Sequence with No Repeats
[*1..99].sample(4) #=> [64, 99, 29, 49]
According to Array#sample docs,
The elements are chosen by using random and unique indices
If you need SecureRandom (which uses computer noise instead of pseudorandom numbers):
require 'securerandom'
[*1..99].sample(4, random: SecureRandom) #=> [2, 75, 95, 37]
How about a play on this? Unique random numbers without needing to use Set or Hash.
x = 0
(1..100).map{|iter| x += rand(100)}.shuffle
You could use a hash to track the random numbers you've used so far:
seen = {}
max = 100
(1..10).map { |n|
x = rand(max)
while (seen[x])
x = rand(max)
end
x
}
Rather than add the items to a list/array, add them to a Set.
If you have a finite list of possible random numbers (i.e. 1 to 100), then Kent's solution is good.
Otherwise there is no other good way to do it without looping. The problem is you MUST do a loop if you get a duplicate. My solution should be efficient and the looping should not be too much more than the size of your array (i.e. if you want 20 unique random numbers, it might take 25 iterations on average.) Though the number of iterations gets worse the more numbers you need and the smaller max is. Here is my above code modified to show how many iterations are needed for the given input:
require 'set'
def rand_n(n, max)
randoms = Set.new
i = 0
loop do
randoms << rand(max)
break if randoms.size > n
i += 1
end
puts "Took #{i} iterations for #{n} random numbers to a max of #{max}"
return randoms.to_a
end
I could write this code to LOOK more like Array.map if you want :)
Based on Kent Fredric's solution above, this is what I ended up using:
def n_unique_rand(number_to_generate, rand_upper_limit)
return (0..rand_upper_limit - 1).sort_by{rand}[0..number_to_generate - 1]
end
Thanks Kent.
No loops with this method
Array.new(size) { rand(max) }
require 'benchmark'
max = 1000000
size = 5
Benchmark.realtime do
Array.new(size) { rand(max) }
end
=> 1.9114e-05
Here is one solution:
Suppose you want these random numbers to be between r_min and r_max. For each element in your list, generate a random number r, and make list[i]=list[i-1]+r. This would give you random numbers which are monotonically increasing, guaranteeing uniqueness provided that
r+list[i-1] does not over flow
r > 0
For the first element, you would use r_min instead of list[i-1]. Once you are done, you can shuffle the list so the elements are not so obviously in order.
The only problem with this method is when you go over r_max and still have more elements to generate. In this case, you can reset r_min and r_max to 2 adjacent element you have already computed, and simply repeat the process. This effectively runs the same algorithm over an interval where there are no numbers already used. You can keep doing this until you have the list populated.
As far as it is nice to know in advance the maxium value, you can do this way:
class NoLoopRand
def initialize(max)
#deck = (0..max).to_a
end
def getrnd
return #deck.delete_at(rand(#deck.length - 1))
end
end
and you can obtain random data in this way:
aRndNum = NoLoopRand.new(10)
puts aRndNum.getrnd
you'll obtain nil when all the values will be exausted from the deck.
Method 1
Using Kent's approach, it is possible to generate an array of arbitrary length keeping all values in a limited range:
# Generates a random array of length n.
#
# #param n length of the desired array
# #param lower minimum number in the array
# #param upper maximum number in the array
def ary_rand(n, lower, upper)
values_set = (lower..upper).to_a
repetition = n/(upper-lower+1) + 1
(values_set*repetition).sample n
end
Method 2
Another, possibly more efficient, method modified from same Kent's another answer:
def ary_rand2(n, lower, upper)
v = (lower..upper).to_a
(0...n).map{ v[rand(v.length)] }
end
Output
puts (ary_rand 5, 0, 9).to_s # [0, 8, 2, 5, 6] expected
puts (ary_rand 5, 0, 9).to_s # [7, 8, 2, 4, 3] different result for same params
puts (ary_rand 5, 0, 1).to_s # [0, 0, 1, 0, 1] repeated values from limited range
puts (ary_rand 5, 9, 0).to_s # [] no such range :)

Resources