Unexpected access performance differences between arrays and hashes - ruby

I have evaluated access times for a two-dimensional array, implemented as
an array of arrays
a hash of arrays
a hash with arrays as keys
My expectation was to see similar acess times for all 3. I also have expected the measurements to yield similar results for MRI and JRuby.
However, for a reason that I do not understand, on MRI accessing elements within an array of arrays or within a hash of arrays is an order of magnitude faster than accessing elements of a hash.
On JRuby, instead of being 10 times as expensive, hash access was about 50 times as expensive as with an array of arrays.
The results:
MRI (2.1.1):
user system total real
hash of arrays: 1.300000 0.000000 1.300000 ( 1.302235)
array of arrays: 0.890000 0.000000 0.890000 ( 0.896942)
flat hash: 16.830000 0.000000 16.830000 ( 16.861716)
JRuby (1.7.5):
user system total real
hash of arrays: 0.280000 0.000000 0.280000 ( 0.265000)
array of arrays: 0.250000 0.000000 0.250000 ( 0.182000)
flat hash: 77.450000 0.240000 77.690000 ( 75.156000)
Here are two of my benchmarks:
ary = (0...n).map { Array.new(n, 1) }
bm.report('array of arrays:') do
iterations.times do
(0...n).each { |x|
(0...n).each { |y|
v = ary[x][y]
}
}
end
end
.
hash = {}
(0...n).each { |x|
(0...n).each { |y|
hash[[x, y]] = 1
}
}
prepared_indices = (0...n).each_with_object([]) { |x, ind|
(0...n).each { |y|
ind << [x, y]
}
}
bm.report('flat hash:') do
iterations.times do
prepared_indices.each { |i|
v = hash[i]
}
end
end
All container elements are initialized with a numeric value and have the same total number of elements.
The arrays for accessing the hash are preinitialized in order to benchmark the element access only.
Here is the complete code
I have consulted this thread and this article but still have no clue about the unexpected performance differences.
Why are the results so different from my expectations? What am I missing?

Consider the memory layout of an array of arrays, say with dimensions 3x3... you've got something like this:
memory address usage/content
base [0][0]
base+sizeof(int) [0][1]
base+2*sizeof(int) [0][2]
base+3*sizeof(int) [1][0]
base+4*sizeof(int) [1][1]
...
Given an array of dimensions [M][N], all that's needed to access an element in at indices [i][j] is to add the base memory address to the data element size times (i * M + j)... a tiny bit of simple arithmetic, and therefore extremely fast.
Hashes are far more complicated data structures and inherently slower. With a hash, you need to take time to hash the key (and the harder the hash tries to make sure different keys will - statistically - scatter pretty randomly throughout the hash output range even if they're similar keys the slower the hash tends to be, if the hash function doesn't make that effort you'll have more collisions in the hash table and slower performance there), then the hash value needs to be mapped on to the current hash table size (usually using "%"), then you need to compare keys to see if you've found the hoped-for key or a colliding element or an empty element. It's a far more involved process than array indexing. You should probably do some background reading about hash function and hash table implementations....
The reason hashes are so often useful is that the key doesn't need to be numeric (you can always work out some formula to generate a number from arbitrary key data) and need not be near-contiguous for memory efficiency (i.e. a hash table with say memory capacity for 5 integers could happily store keys 1, 1000 and 12398239 - whereas for an array keyed on those values there would be a lot of virtual address space wasted for all the indices in between, which have no data anyway, and anyway more data packed into a memory page means more cache hits).
Further - you should be careful with benchmarks - when you do clearly repetitive work with unchanging values overwriting the same variable, an optimiser may avoid it and you may not be timing what you think you are. It's good to use some run-time inputs (e.g. storing different values in the containers) and accumulate some dependent result (e.g. summing the element accesses rather than overwriting it), then outputting the result so any lazy evaluation is forced to conclude. With things like JITs and VMs involved there can also be kinks in your benchmarks as compilation kicks in or branch prediction results are incorporated.

Related

Maintaining Ruby Set by object ID

I'm developing an algorithm in Ruby with the following properties:
It works on two objects of type Set, where each element is an Array, where all elements are of type String
Each Array involved has the same number of elements
No two arrays happen to be have the same content (when comparing with ==)
The algorithm involves many operations of moving an array from one Set to the other (or back), storing references to certain Arrays, and testing whether or not that reference is part of the Array
There is no duplication of the Arrays; all Arrays keep their object ID during all the time.
A native implementation would do something like this (to give you the idea); in practice, the arrays here have longer strings and more elements:
# Set up all Arrays involved
master=[
%w(a b c d),
%w(a b c x),
%w(u v w y),
# .... and so on
]
# Create initial sets.
x=Set.new
y=Set.new
# ....
x.add(master[0])
x.add(master[2])
y.add(master[1])
# ....
# Operating on the sets.
i=1
# ...
arr=master[i]
# Move element arr from y to x, if it is in y
if(y.member?(arr)
y.delete(arr)
x.add(arr)
end
# Do something with the sets
x.each { |arr| puts arr.pretty_print }
This would indeed work, simply because the arrays are all different in content. However, testing for membership means that y.member?(arr) tests that we don't have already an object with the same array content like arrin our Set, while it would be sufficient to verify to test that we don't have already an element with the same object_id in our Set, so I'm worried about performance. From my understanding, finding the the object id of an object is cheap, and since it is just a number, maintaining a set of numbers is more performant than maintaining a set of arrays of strings.
Therefore I could try to define my two sets as sets of object_id, and membership test would be faster. However when iterating over a Set, using the object_id to find the array itself is expensive (I would have to search ObjectSpace).
Another possibility would be to not maintain the set of arrays, but the set of indexes into my master array. My code would then be, for example,
x.add(0) # instead of x.add(master[0])
and iterating over a Set would be, i.e.
x.each { |i| puts master[i].pretty_print }
I wonder whether there is a better way - for instance that we can somehow "teach" Set.new to use object identity for maintaining its members, instead of equality.
I think you’re looking for Set#compare_by_identity, which makes the set use the object’s identity (i.e. object ID) of its contents.
x = Set.new
x.compare_by_identity

How can I return hash pairs of keys that sum up to less than a maximum value?

Given this hash:
numsHash = {5=>10, 3=>9, 4=>7, 2=>5, 20=>4}
How can I return the key-value pair of this hash if and when the sum of its keys would be under or equal to a maximum value such as 10?
The expected result would be something like:
newHash = { 5=>10, 3=>9, 2=>5 }
because the sum of these keys equals 10.
I've been obsessing with this for hours now and can't find anything that leads up to a solution.
Summary
In the first section, I provide some context and a well-commented working example of how to solve the defined knapsack problem in a matter of microseconds using a little brute force and some Ruby core classes.
In the second section, I refactor and expand on the code to demonstrate the conversion of the knapsack solution into output similar to what you want, although (as explained and demonstrated in the answer below) the correct output when there are multiple results must be a collection of Hash objects rather than a single Hash unless there are additional selection criteria not included in your original post.
Please note that this answer uses syntax and classes from Ruby 3.0, and was specifically tested against Ruby 3.0.3. While it should work on Ruby 2.7.3+ without changes, and with most currently-supported Ruby 2.x versions with some minor refactoring, your mileage may vary.
Solving the Knapsack Problem with Ruby Core Methods
This seems to be a variant of the knapsack problem, where you're trying to optimize filling a container of a given size. This is actually a complex problem that is NP-complete, so a real-world application of this type will have many different solutions and possible algorithmic approaches.
I do not claim that the following solution is optimal or suitable for general purpose solutions to this class of problem. However, it works very quickly given the provided input data from your original post.
Its suitability is primarily based on the fact that you have a fairly small number of Hash keys, and the built-in Ruby 3.0.3 core methods of Hash#permutation and Enumerable#sum are fast enough to solve this particular problem in anywhere from 44-189 microseconds on my particular machine. That seems more than sufficiently fast for the problem as currently defined, but your mileage and real objectives may vary.
# This is the size of your knapsack.
MAX_VALUE = 10
# It's unclear why you need a Hash or what you plan to do with the values of the
# Hash, but that's irrelevant to the problem. For now, just grab the keys.
#
# NB: You have to use hash rockets or the parser complains about using an
# Integer as a Symbol using the colon notation and raises SyntaxError.
nums_hash = {5 => 10, 3 => 9, 4 => 7, 2 => 5, 20 => 4}
keys = nums_hash.keys
# Any individual element above MAX_VALUE won't fit in the knapsack anyway, so
# discard it before permutation.
keys.reject! { _1 > MAX_VALUE }
# Brute force it by evaluating all possible permutations of your array, dropping
# elements from the end of each sub-array until all remaining elements fit.
keys.permutation.map do |permuted_array|
loop { permuted_array.sum > MAX_VALUE ? permuted_array.pop : break }
permuted_array
end
Returning an Array of Matching Hashes
The code above just returns the list of keys that will fit into your knapsack, but per your original post you then want to return a Hash of matching key/value pairs. The problem here is that you actually have more than one set of Hash objects that will fit the criteria, so your collection should actually be an Array rather than a single Hash. Returning only a single Hash would basically return the original Hash minus any keys that exceed your MAX_VALUE, and that's unlikely to be what's intended.
Instead, now that you have a list of keys that fit into your knapsack, you can iterate through your original Hash and use Hash#select to return an Array of unique Hash objects with the appropriate key/value pairs. One way to do this is to use Enumerable#reduce to call Hash#merge on each Hash element in the subarrays to convert the final result to an Array of Hash objects. Next, you should call Enumerable#unique to remove any Hash that is equivalent except for its internal ordering.
For example, consider this redesigned code:
MAX_VALUE = 10
def possible_knapsack_contents hash
hash.keys.reject! { _1 > MAX_VALUE }.permutation.map do |a|
loop { a.sum > MAX_VALUE ? a.pop : break }; a
end.sort
end
def matching_elements_from hash
possible_knapsack_contents(hash).map do |subarray|
subarray.map { |i| hash.select { |k, _| k == i } }.
reduce({}) { _1.merge _2 }
end.uniq
end
hash = {5 => 10, 3 => 9, 4 => 7, 2 => 5, 20 => 4}
matching_elements_from hash
Given the defined input, this would yield 24 hashes if you didn't address the uniqueness issue. However, by calling #uniq on the final Array of Hash objects, this will correctly yield the 7 unique hashes that fit your defined criteria if not necessarily the single Hash you seem to expect:
[{2=>5, 3=>9, 4=>7},
{2=>5, 3=>9, 5=>10},
{2=>5, 4=>7},
{2=>5, 5=>10},
{3=>9, 4=>7},
{3=>9, 5=>10},
{4=>7, 5=>10}]

Bloom filters and its multiple hash functions

I'm implementing a simple Bloom filter as an exercise.
Bloom filters require multiple hash functions, which for practical purposes I don't have.
Assuming I want to have 3 hash functions, isn't it enough to just take the hash of the object I'm checking membership for, hashing it (with murmur3) and then add +1, +2, +3 (for the 3 different hashes) before hashing them again?
As the murmur3 function has a very good avalanche effect (really spreads out results) wouldn't this for all purposes be reasonable?
Pseudo-code:
function generateHashes(obj) {
long hash = murmur3_hash(obj);
long hash1 = murmur3_hash(hash+1);
long hash2 = murmur3_hash(hash+2);
long hash3 = murmur3_hash(hash+3);
(hash1, hash2, hash3)
}
If not, what would be a simple, useful approach to this? I'd like to have a solution that would allow me to easily scale for more hash functions if needed be.
AFAIK, the usual approach is to not actually use multiple hash functions. Rather, hash once and split the resulting hash into 2, 3, or how many parts you want for your Bloom filter. So for example create a hash of 128 bits and split it into 2 hashes 64 bit each.
https://github.com/Claudenw/BloomFilter/wiki/Bloom-Filters----An-overview
The hashing functions of Bloom filter should be independent and random enough. MurmurHash is great for this purpose. So your approach is correct, and you can generate as many new hashes your way. For the educational purposes it is fine.
But in real world, running hashing function multiple times is slow, so the usual approach is to create ad-hoc hashes without actually calculating the hash.
To correct #memo, this is done not by splitting the hash into multiple parts, as the width of the hash should remain constant (and you can't split 64 bit hash to more than 64 parts ;) ). The approach is to get a two independent hashes and combine them.
function generateHashes(obj) {
// initialization phase
long h1 = murmur3_hash(obj);
long h2 = murmur3_hash(h1);
int k = 3; // number of desired hash functions
long hash[k];
// generation phase
for (int i=0; i<k; i++) {
hash[i] = h1 + (i*h2);
}
return hash;
}
As you see, this way creating a new hash is a simple multiply-add operation.
It would not be a good approach. Let me try and explain. Bloom filter allows you to test if an element most likely belongs to a set, or if it absolutely doesn’t. In others words, false positives may occur, but false negatives won’t.
Reference: https://sc5.io/posts/what-are-bloom-filters-and-why-are-they-useful/
Let us consider an example:
You have an input string 'foo' and we pass it to the multiple hash functions. murmur3 hash gives the output K, and subsequent hashes on this hash value give x, y and z
Now assume you have another string 'bar' and as it happens, its murmur3 hash is also K. The remaining hash values? They will be x, y and z because in your proposed approach the subsequent hash functions are not dependent on the input, but instead on the output of first hash function.
long hash1 = murmur3_hash(hash+1);
long hash2 = murmur3_hash(hash+2);
long hash3 = murmur3_hash(hash+3);
As explained in the link, the purpose is to perform a probabilistic search in a set. If we perform search for 'foo' or for 'bar' we would say that it is 'likely' that both of them are present. So the % of false positives will increase.
In other words this bloom filter will behave like a simple hash-function. The 'bloom' aspect of it will not come into picture because only the first hash function is determining the outcome of search.
Hope I was able to explain sufficiently. Let me know in comments if you have some more follow-up queries. Would be happy to assist.

How expensive is generating a random number in Ruby?

Say you want to generate a random number between 1 and 1 billion:
rand(1..1_000_000_000)
Will Ruby create an array from that range every time you call this line of code?
Rubocop suggests this approach over rand(1_000_000_000)+1 but it seems there's potential for pain.
Ruby's docs say this:
# When +max+ is a Range, +rand+ returns a random number where
# range.member?(number) == true.
Where +max+ is the argument passed to rand, but it doesn't say how it gets the number argument. I'm also not sure if calling .member? on a range is performant.
Any ideas?
I can use benchmark but still curious about the inner workings here.
No, Ruby will not create an array from that range, unless you explicitly call the .to_a method on the Range object. In fact, rand() doesn't work on arrays - .sample is the method to use for returning a random element from an array.
The Range class includes Enumerable so you get Enumerable's iteration methods without having to convert the range into an array. The lower and upper limits for a Range are (-Float::INFINITY..Float::INFINITY), although that will result in a Numerical argument out of domain error if you pass it into rand.
As for .member?, that method simply calls a C function called range_cover that calls another one called r_cover_p which checks if a value is between two numbers or strings.
To test the difference in speed between passing a range to rand and calling sample on an array, you can perform the following test:
require 'benchmark'
puts Benchmark.measure { rand(0..10_000_000) }
=> 0.000000 0.000000 0.000000 ( 0.000009)
puts Benchmark.measure { (0..10_000_000).to_a.sample }
=> 0.300000 0.030000 0.330000 ( 0.347752)
As you can see in the first example, passing in a range as a parameter to rand is extremely rapid.
Contrarily, calling .to_a.sample on a range is rather slow. This is due to the array creation process which requires allocating the appropriate data into memory. The .sample method should be relatively fast as it simply passes a random and unique index into the array and returns that element.
To check out the code for range have a look here.

How fast is Ruby's method for converting a range to an array?

(1..999).to_a
Is this method O(n)? I'm wondering if the conversion involves an implicit iteration so Ruby can write the values one-by-one into consecutive memory addresses.
The method is actually slightly worse than O(n). Not only does it do a naive iteration, but it doesn't check ahead of time what the size will be, so it has to repeatedly allocate more memory as it iterates. I've opened an issue for that aspect and it's been discussed a few times on the mailing list (and briefly added to ruby-core). The problem is that, like almost anything in Ruby, Range can be opened up and messed with, so Ruby can't really optimize the method. It can't even count on Range#size returning the correct result. Worse, some enumerables even have their size method delegate to to_a.
In general, it shouldn't be necessary to make this conversion, but if you really need array methods, you might be able to use Array#fill instead, which lets you populate a (potentially pre-allocated) array using values derived from its indices.
Range.instance_methods(false).include? :to_a
# false
Range doesn't have to_a, it inherits it from the Enumerable mix-in, so it builds the array by pushing each value on one at a time. That seems like it would be really inefficient, but I'll let the benchmark speak for itself:
require 'benchmark'
size = 100_000_000
Benchmark.bmbm do |r|
r.report("range") { (0...size).to_a }
r.report("fill") { a = Array.new(size); a.fill { |i| i } }
r.report("array") { Array.new(size) { |i| i } }
end
# Rehearsal -----------------------------------------
# range 4.530000 0.180000 4.710000 ( 4.716628)
# fill 5.810000 0.150000 5.960000 ( 5.966710)
# array 7.630000 0.250000 7.880000 ( 7.879940)
# ------------------------------- total: 18.550000sec
#
# user system total real
# range 4.540000 0.120000 4.660000 ( 4.660249)
# fill 5.980000 0.110000 6.090000 ( 6.089962)
# array 7.880000 0.110000 7.990000 ( 7.985818)
Isn't that weird? It's actually the fastest by a significant margin. And manually filling an Array is somehow faster that just using the constructor.
But as is usually the case with Ruby, don't worry too much about this. For reasonably sized ranges the performance difference will be negligible.

Resources