Maintaining Ruby Set by object ID - ruby

I'm developing an algorithm in Ruby with the following properties:
It works on two objects of type Set, where each element is an Array, where all elements are of type String
Each Array involved has the same number of elements
No two arrays happen to be have the same content (when comparing with ==)
The algorithm involves many operations of moving an array from one Set to the other (or back), storing references to certain Arrays, and testing whether or not that reference is part of the Array
There is no duplication of the Arrays; all Arrays keep their object ID during all the time.
A native implementation would do something like this (to give you the idea); in practice, the arrays here have longer strings and more elements:
# Set up all Arrays involved
master=[
%w(a b c d),
%w(a b c x),
%w(u v w y),
# .... and so on
]
# Create initial sets.
x=Set.new
y=Set.new
# ....
x.add(master[0])
x.add(master[2])
y.add(master[1])
# ....
# Operating on the sets.
i=1
# ...
arr=master[i]
# Move element arr from y to x, if it is in y
if(y.member?(arr)
y.delete(arr)
x.add(arr)
end
# Do something with the sets
x.each { |arr| puts arr.pretty_print }
This would indeed work, simply because the arrays are all different in content. However, testing for membership means that y.member?(arr) tests that we don't have already an object with the same array content like arrin our Set, while it would be sufficient to verify to test that we don't have already an element with the same object_id in our Set, so I'm worried about performance. From my understanding, finding the the object id of an object is cheap, and since it is just a number, maintaining a set of numbers is more performant than maintaining a set of arrays of strings.
Therefore I could try to define my two sets as sets of object_id, and membership test would be faster. However when iterating over a Set, using the object_id to find the array itself is expensive (I would have to search ObjectSpace).
Another possibility would be to not maintain the set of arrays, but the set of indexes into my master array. My code would then be, for example,
x.add(0) # instead of x.add(master[0])
and iterating over a Set would be, i.e.
x.each { |i| puts master[i].pretty_print }
I wonder whether there is a better way - for instance that we can somehow "teach" Set.new to use object identity for maintaining its members, instead of equality.

I think you’re looking for Set#compare_by_identity, which makes the set use the object’s identity (i.e. object ID) of its contents.
x = Set.new
x.compare_by_identity

Related

How can I return hash pairs of keys that sum up to less than a maximum value?

Given this hash:
numsHash = {5=>10, 3=>9, 4=>7, 2=>5, 20=>4}
How can I return the key-value pair of this hash if and when the sum of its keys would be under or equal to a maximum value such as 10?
The expected result would be something like:
newHash = { 5=>10, 3=>9, 2=>5 }
because the sum of these keys equals 10.
I've been obsessing with this for hours now and can't find anything that leads up to a solution.
Summary
In the first section, I provide some context and a well-commented working example of how to solve the defined knapsack problem in a matter of microseconds using a little brute force and some Ruby core classes.
In the second section, I refactor and expand on the code to demonstrate the conversion of the knapsack solution into output similar to what you want, although (as explained and demonstrated in the answer below) the correct output when there are multiple results must be a collection of Hash objects rather than a single Hash unless there are additional selection criteria not included in your original post.
Please note that this answer uses syntax and classes from Ruby 3.0, and was specifically tested against Ruby 3.0.3. While it should work on Ruby 2.7.3+ without changes, and with most currently-supported Ruby 2.x versions with some minor refactoring, your mileage may vary.
Solving the Knapsack Problem with Ruby Core Methods
This seems to be a variant of the knapsack problem, where you're trying to optimize filling a container of a given size. This is actually a complex problem that is NP-complete, so a real-world application of this type will have many different solutions and possible algorithmic approaches.
I do not claim that the following solution is optimal or suitable for general purpose solutions to this class of problem. However, it works very quickly given the provided input data from your original post.
Its suitability is primarily based on the fact that you have a fairly small number of Hash keys, and the built-in Ruby 3.0.3 core methods of Hash#permutation and Enumerable#sum are fast enough to solve this particular problem in anywhere from 44-189 microseconds on my particular machine. That seems more than sufficiently fast for the problem as currently defined, but your mileage and real objectives may vary.
# This is the size of your knapsack.
MAX_VALUE = 10
# It's unclear why you need a Hash or what you plan to do with the values of the
# Hash, but that's irrelevant to the problem. For now, just grab the keys.
#
# NB: You have to use hash rockets or the parser complains about using an
# Integer as a Symbol using the colon notation and raises SyntaxError.
nums_hash = {5 => 10, 3 => 9, 4 => 7, 2 => 5, 20 => 4}
keys = nums_hash.keys
# Any individual element above MAX_VALUE won't fit in the knapsack anyway, so
# discard it before permutation.
keys.reject! { _1 > MAX_VALUE }
# Brute force it by evaluating all possible permutations of your array, dropping
# elements from the end of each sub-array until all remaining elements fit.
keys.permutation.map do |permuted_array|
loop { permuted_array.sum > MAX_VALUE ? permuted_array.pop : break }
permuted_array
end
Returning an Array of Matching Hashes
The code above just returns the list of keys that will fit into your knapsack, but per your original post you then want to return a Hash of matching key/value pairs. The problem here is that you actually have more than one set of Hash objects that will fit the criteria, so your collection should actually be an Array rather than a single Hash. Returning only a single Hash would basically return the original Hash minus any keys that exceed your MAX_VALUE, and that's unlikely to be what's intended.
Instead, now that you have a list of keys that fit into your knapsack, you can iterate through your original Hash and use Hash#select to return an Array of unique Hash objects with the appropriate key/value pairs. One way to do this is to use Enumerable#reduce to call Hash#merge on each Hash element in the subarrays to convert the final result to an Array of Hash objects. Next, you should call Enumerable#unique to remove any Hash that is equivalent except for its internal ordering.
For example, consider this redesigned code:
MAX_VALUE = 10
def possible_knapsack_contents hash
hash.keys.reject! { _1 > MAX_VALUE }.permutation.map do |a|
loop { a.sum > MAX_VALUE ? a.pop : break }; a
end.sort
end
def matching_elements_from hash
possible_knapsack_contents(hash).map do |subarray|
subarray.map { |i| hash.select { |k, _| k == i } }.
reduce({}) { _1.merge _2 }
end.uniq
end
hash = {5 => 10, 3 => 9, 4 => 7, 2 => 5, 20 => 4}
matching_elements_from hash
Given the defined input, this would yield 24 hashes if you didn't address the uniqueness issue. However, by calling #uniq on the final Array of Hash objects, this will correctly yield the 7 unique hashes that fit your defined criteria if not necessarily the single Hash you seem to expect:
[{2=>5, 3=>9, 4=>7},
{2=>5, 3=>9, 5=>10},
{2=>5, 4=>7},
{2=>5, 5=>10},
{3=>9, 4=>7},
{3=>9, 5=>10},
{4=>7, 5=>10}]

Looping multiple arrays

I have an array containing multiple arrays (the number may vary) of significant dimensions (an array may contain up to 150 objects).
With the array, I need to find a combination (one element for each sub-array) that matches a condition.
Due to the dimensions, I tried to use Enumerator::Lazy as follows
catch :match do
array[0].product(*array[1..-1]).lazy.each do |combination|
throw :match if ConditionMatcher.match(combination)
end
end
However, I realize when I call each the enumerator is evaluated and it performs very slowly.
I have tried to replace each with methods included in Enumerator::Lazy such as take_while
array[0].product(*array[1..-1]).lazy.take_while do |combination|
return false if ConditionMatcher.match(combination)
end
But also, in this case, the product is evaluated with low performance.
For better performance, even id I don't really like it, I'm thinking to replace product with a nested each loop. Something like
catch :match do
array[0].each do |first|
array[1].each do |second|
array[2].each do |third|
throw :match if ConditionMatcher.match([first, second, third])
end
end
end
end
Due to the fact that the number of sub-arrays changes from time to time. I'm not sure how to implement it.
Moreover, is there a better way to loop through all the sub-arrays without loading the entire set of combinations?
Update 1 -
Each sub-array contains an ActiveRecord::Relation on a polymorphic association. Therefore, each element of each combination responds to the same 2 methods (start_time and end_time) each returning an instance of Time.
The matcher checks if all the objects in the combination don't have overlapping times.
The problem is that Array#product already returns a huge array containing all combinations. With 3 sub-arrays containing 150 items each, it returns a 150 × 150 × 150 = 3,375,000 element array. Calling lazy on that array won't speed up anything.
To make product calculate the Cartesian product lazily (i.e. one combination after the other), you simply have to (directly) pass a block to it:
first, *others = array
first.product(*others) do |combination|
# ...
end

Bloom filters and its multiple hash functions

I'm implementing a simple Bloom filter as an exercise.
Bloom filters require multiple hash functions, which for practical purposes I don't have.
Assuming I want to have 3 hash functions, isn't it enough to just take the hash of the object I'm checking membership for, hashing it (with murmur3) and then add +1, +2, +3 (for the 3 different hashes) before hashing them again?
As the murmur3 function has a very good avalanche effect (really spreads out results) wouldn't this for all purposes be reasonable?
Pseudo-code:
function generateHashes(obj) {
long hash = murmur3_hash(obj);
long hash1 = murmur3_hash(hash+1);
long hash2 = murmur3_hash(hash+2);
long hash3 = murmur3_hash(hash+3);
(hash1, hash2, hash3)
}
If not, what would be a simple, useful approach to this? I'd like to have a solution that would allow me to easily scale for more hash functions if needed be.
AFAIK, the usual approach is to not actually use multiple hash functions. Rather, hash once and split the resulting hash into 2, 3, or how many parts you want for your Bloom filter. So for example create a hash of 128 bits and split it into 2 hashes 64 bit each.
https://github.com/Claudenw/BloomFilter/wiki/Bloom-Filters----An-overview
The hashing functions of Bloom filter should be independent and random enough. MurmurHash is great for this purpose. So your approach is correct, and you can generate as many new hashes your way. For the educational purposes it is fine.
But in real world, running hashing function multiple times is slow, so the usual approach is to create ad-hoc hashes without actually calculating the hash.
To correct #memo, this is done not by splitting the hash into multiple parts, as the width of the hash should remain constant (and you can't split 64 bit hash to more than 64 parts ;) ). The approach is to get a two independent hashes and combine them.
function generateHashes(obj) {
// initialization phase
long h1 = murmur3_hash(obj);
long h2 = murmur3_hash(h1);
int k = 3; // number of desired hash functions
long hash[k];
// generation phase
for (int i=0; i<k; i++) {
hash[i] = h1 + (i*h2);
}
return hash;
}
As you see, this way creating a new hash is a simple multiply-add operation.
It would not be a good approach. Let me try and explain. Bloom filter allows you to test if an element most likely belongs to a set, or if it absolutely doesn’t. In others words, false positives may occur, but false negatives won’t.
Reference: https://sc5.io/posts/what-are-bloom-filters-and-why-are-they-useful/
Let us consider an example:
You have an input string 'foo' and we pass it to the multiple hash functions. murmur3 hash gives the output K, and subsequent hashes on this hash value give x, y and z
Now assume you have another string 'bar' and as it happens, its murmur3 hash is also K. The remaining hash values? They will be x, y and z because in your proposed approach the subsequent hash functions are not dependent on the input, but instead on the output of first hash function.
long hash1 = murmur3_hash(hash+1);
long hash2 = murmur3_hash(hash+2);
long hash3 = murmur3_hash(hash+3);
As explained in the link, the purpose is to perform a probabilistic search in a set. If we perform search for 'foo' or for 'bar' we would say that it is 'likely' that both of them are present. So the % of false positives will increase.
In other words this bloom filter will behave like a simple hash-function. The 'bloom' aspect of it will not come into picture because only the first hash function is determining the outcome of search.
Hope I was able to explain sufficiently. Let me know in comments if you have some more follow-up queries. Would be happy to assist.

Shared Array with more complicated element type

A simplified example:
nsize = 100
vsize = 10000
varray = [rand(vsize) for i in 1:nsize] #say, I have a set of vectors.
for k in 1:nsize
varray[k] = rand(vsize, vsize) * varray[k]
end
Obviously, the above for loop can be parallelized.
According to Parallel Map and Loops in Julia manual,
I need to used SharedArray. However, ShardArray cannot have Array{Float64,1} as element type.
julia> a = SharedArray(Array{Float64,1}, nsize)
ERROR: ArgumentError: type of SharedArray elements must be bits types, got Array{Float64,1}
in __SharedArray#138__ at sharedarray.jl:45
in SharedArray at sharedarray.jl:116
How can I solve this problem?
Currently, you can't, because a SharedArray requires a contiguous block of memory, which means that its elements must be "bits types," and this isn't true for Array. (Array is implemented in C and has some header information, which makes them not densely-packable.)
However, if all of your "element" arrays have the same size, and you don't absolutely require the ability to modify single elements of the "inner" arrays, you could try using StaticArrays as elements. (Thanks to #Wouter in the comments below for pointing out that this needed updating.)

Count, size, length...too many choices in Ruby?

I can't seem to find a definitive answer on this and I want to make sure I understand this to the "n'th level" :-)
a = { "a" => "Hello", "b" => "World" }
a.count # 2
a.size # 2
a.length # 2
a = [ 10, 20 ]
a.count # 2
a.size # 2
a.length # 2
So which to use? If I want to know if a has more than one element then it doesn't seem to matter but I want to make sure I understand the real difference. This applies to arrays too. I get the same results.
Also, I realize that count/size/length have different meanings with ActiveRecord. I'm mostly interested in pure Ruby (1.92) right now but if anyone wants to chime in on the difference AR makes that would be appreciated as well.
Thanks!
For arrays and hashes size is an alias for length. They are synonyms and do exactly the same thing.
count is more versatile - it can take an element or predicate and count only those items that match.
> [1,2,3].count{|x| x > 2 }
=> 1
In the case where you don't provide a parameter to count it has basically the same effect as calling length. There can be a performance difference though.
We can see from the source code for Array that they do almost exactly the same thing. Here is the C code for the implementation of array.length:
static VALUE
rb_ary_length(VALUE ary)
{
long len = RARRAY_LEN(ary);
return LONG2NUM(len);
}
And here is the relevant part from the implementation of array.count:
static VALUE
rb_ary_count(int argc, VALUE *argv, VALUE ary)
{
long n = 0;
if (argc == 0) {
VALUE *p, *pend;
if (!rb_block_given_p())
return LONG2NUM(RARRAY_LEN(ary));
// etc..
}
}
The code for array.count does a few extra checks but in the end calls the exact same code: LONG2NUM(RARRAY_LEN(ary)).
Hashes (source code) on the other hand don't seem to implement their own optimized version of count so the implementation from Enumerable (source code) is used, which iterates over all the elements and counts them one-by-one.
In general I'd advise using length (or its alias size) rather than count if you want to know how many elements there are altogether.
Regarding ActiveRecord, on the other hand, there are important differences. check out this post:
Counting ActiveRecord associations: count, size or length?
There is a crucial difference for applications which make use of database connections.
When you are using many ORMs (ActiveRecord, DataMapper, etc.) the general understanding is that .size will generate a query that requests all of the items from the database ('select * from mytable') and then give you the number of items resulting, whereas .count will generate a single query ('select count(*) from mytable') which is considerably faster.
Because these ORMs are so prevalent I following the principle of least astonishment. In general if I have something in memory already, then I use .size, and if my code will generate a request to a database (or external service via an API) I use .count.
In most cases (e.g. Array or String) size is an alias for length.
count normally comes from Enumerable and can take an optional predicate block. Thus enumerable.count {cond} is [roughly] (enumerable.select {cond}).length -- it can of course bypass the intermediate structure as it just needs the count of matching predicates.
Note: I am not sure if count forces an evaluation of the enumeration if the block is not specified or if it short-circuits to the length if possible.
Edit (and thanks to Mark's answer!): count without a block (at least for Arrays) does not force an evaluation. I suppose without formal behavior it's "open" for other implementations, if forcing an evaluation without a predicate ever even really makes sense anyway.
I found a good answare at http://blog.hasmanythrough.com/2008/2/27/count-length-size
In ActiveRecord, there are several ways to find out how many records
are in an association, and there are some subtle differences in how
they work.
post.comments.count - Determine the number of elements with an SQL
COUNT query. You can also specify conditions to count only a subset of
the associated elements (e.g. :conditions => {:author_name =>
"josh"}). If you set up a counter cache on the association, #count
will return that cached value instead of executing a new query.
post.comments.length - This always loads the contents of the
association into memory, then returns the number of elements loaded.
Note that this won't force an update if the association had been
previously loaded and then new comments were created through another
way (e.g. Comment.create(...) instead of post.comments.create(...)).
post.comments.size - This works as a combination of the two previous
options. If the collection has already been loaded, it will return its
length just like calling #length. If it hasn't been loaded yet, it's
like calling #count.
Also I have a personal experience:
<%= h(params.size.to_s) %> # works_like_that !
<%= h(params.count.to_s) %> # does_not_work_like_that !
We have a several ways to find out how many elements in an array like .length, .count and .size. However, It's better to use array.size rather than array.count. Because .size is better in performance.
Adding more to Mark Byers answer. In Ruby the method array.size is an alias to Array#length method. There is no technical difference in using any of these two methods. Possibly you won't see any difference in performance as well. However, the array.count also does the same job but with some extra functionalities Array#count
It can be used to get total no of elements based on some condition. Count can be called in three ways:
Array#count # Returns number of elements in Array
Array#count n # Returns number of elements having value n in Array
Array#count{|i| i.even?} Returns count based on condition invoked on each element array
array = [1,2,3,4,5,6,7,4,3,2,4,5,6,7,1,2,4]
array.size # => 17
array.length # => 17
array.count # => 17
Here all three methods do the same job. However here is where the count gets interesting.
Let us say, I want to find how many array elements does the array contains with value 2
array.count 2 # => 3
The array has a total of three elements with value as 2.
Now, I want to find all the array elements greater than 4
array.count{|i| i > 4} # =>6
The array has total 6 elements which are > than 4.
I hope it gives some info about count method.

Resources