Ruby performance: Multi-key hashes - ruby

Suppose I have some lookup table, q(w, x, y, z), where various combination of keys map to different values; i.e., q(0, 0, 0, 0) = a, q(0, 0, 0, 1) = b, q(15, 16, 23, "b") = c.
What's the best way to implement this structure in Ruby in terms of efficiency? The keys will be generated dynamically and will generally be strings. I can think of three different keying methods with hashes:
Use a string as the key: q["a, b, c, d"] = 0
Use a single array as the key: q[["a", "b", "c", "d"]] = 0
Use a hash of hashes: q["a"]["b"]["c"]["d"] = 0
I'm currently using method 2, and it's a little slower than I would like. These key combinations are generated dynamically—if I were to use a hash that takes a single string, will string concatenation be faster? Should I have started with a hash of hashes in the first place? Will this method take more space in memory?

I would opt for something like your #1: create a single string which will then act as your map key. However ensure that your 'surrogate hash key' will be appropriately unique for various combinations of values. In this case you only have to build a simple string and need a single map.
Generally speaking you want map keys to be as immutable as possible. (A key mutating could mess up the table). Sometimes messy in Ruby since strings are mutable but still a worthwhile goal.

Related

Maintaining Ruby Set by object ID

I'm developing an algorithm in Ruby with the following properties:
It works on two objects of type Set, where each element is an Array, where all elements are of type String
Each Array involved has the same number of elements
No two arrays happen to be have the same content (when comparing with ==)
The algorithm involves many operations of moving an array from one Set to the other (or back), storing references to certain Arrays, and testing whether or not that reference is part of the Array
There is no duplication of the Arrays; all Arrays keep their object ID during all the time.
A native implementation would do something like this (to give you the idea); in practice, the arrays here have longer strings and more elements:
# Set up all Arrays involved
master=[
%w(a b c d),
%w(a b c x),
%w(u v w y),
# .... and so on
]
# Create initial sets.
x=Set.new
y=Set.new
# ....
x.add(master[0])
x.add(master[2])
y.add(master[1])
# ....
# Operating on the sets.
i=1
# ...
arr=master[i]
# Move element arr from y to x, if it is in y
if(y.member?(arr)
y.delete(arr)
x.add(arr)
end
# Do something with the sets
x.each { |arr| puts arr.pretty_print }
This would indeed work, simply because the arrays are all different in content. However, testing for membership means that y.member?(arr) tests that we don't have already an object with the same array content like arrin our Set, while it would be sufficient to verify to test that we don't have already an element with the same object_id in our Set, so I'm worried about performance. From my understanding, finding the the object id of an object is cheap, and since it is just a number, maintaining a set of numbers is more performant than maintaining a set of arrays of strings.
Therefore I could try to define my two sets as sets of object_id, and membership test would be faster. However when iterating over a Set, using the object_id to find the array itself is expensive (I would have to search ObjectSpace).
Another possibility would be to not maintain the set of arrays, but the set of indexes into my master array. My code would then be, for example,
x.add(0) # instead of x.add(master[0])
and iterating over a Set would be, i.e.
x.each { |i| puts master[i].pretty_print }
I wonder whether there is a better way - for instance that we can somehow "teach" Set.new to use object identity for maintaining its members, instead of equality.
I think you’re looking for Set#compare_by_identity, which makes the set use the object’s identity (i.e. object ID) of its contents.
x = Set.new
x.compare_by_identity

How can I return hash pairs of keys that sum up to less than a maximum value?

Given this hash:
numsHash = {5=>10, 3=>9, 4=>7, 2=>5, 20=>4}
How can I return the key-value pair of this hash if and when the sum of its keys would be under or equal to a maximum value such as 10?
The expected result would be something like:
newHash = { 5=>10, 3=>9, 2=>5 }
because the sum of these keys equals 10.
I've been obsessing with this for hours now and can't find anything that leads up to a solution.
Summary
In the first section, I provide some context and a well-commented working example of how to solve the defined knapsack problem in a matter of microseconds using a little brute force and some Ruby core classes.
In the second section, I refactor and expand on the code to demonstrate the conversion of the knapsack solution into output similar to what you want, although (as explained and demonstrated in the answer below) the correct output when there are multiple results must be a collection of Hash objects rather than a single Hash unless there are additional selection criteria not included in your original post.
Please note that this answer uses syntax and classes from Ruby 3.0, and was specifically tested against Ruby 3.0.3. While it should work on Ruby 2.7.3+ without changes, and with most currently-supported Ruby 2.x versions with some minor refactoring, your mileage may vary.
Solving the Knapsack Problem with Ruby Core Methods
This seems to be a variant of the knapsack problem, where you're trying to optimize filling a container of a given size. This is actually a complex problem that is NP-complete, so a real-world application of this type will have many different solutions and possible algorithmic approaches.
I do not claim that the following solution is optimal or suitable for general purpose solutions to this class of problem. However, it works very quickly given the provided input data from your original post.
Its suitability is primarily based on the fact that you have a fairly small number of Hash keys, and the built-in Ruby 3.0.3 core methods of Hash#permutation and Enumerable#sum are fast enough to solve this particular problem in anywhere from 44-189 microseconds on my particular machine. That seems more than sufficiently fast for the problem as currently defined, but your mileage and real objectives may vary.
# This is the size of your knapsack.
MAX_VALUE = 10
# It's unclear why you need a Hash or what you plan to do with the values of the
# Hash, but that's irrelevant to the problem. For now, just grab the keys.
#
# NB: You have to use hash rockets or the parser complains about using an
# Integer as a Symbol using the colon notation and raises SyntaxError.
nums_hash = {5 => 10, 3 => 9, 4 => 7, 2 => 5, 20 => 4}
keys = nums_hash.keys
# Any individual element above MAX_VALUE won't fit in the knapsack anyway, so
# discard it before permutation.
keys.reject! { _1 > MAX_VALUE }
# Brute force it by evaluating all possible permutations of your array, dropping
# elements from the end of each sub-array until all remaining elements fit.
keys.permutation.map do |permuted_array|
loop { permuted_array.sum > MAX_VALUE ? permuted_array.pop : break }
permuted_array
end
Returning an Array of Matching Hashes
The code above just returns the list of keys that will fit into your knapsack, but per your original post you then want to return a Hash of matching key/value pairs. The problem here is that you actually have more than one set of Hash objects that will fit the criteria, so your collection should actually be an Array rather than a single Hash. Returning only a single Hash would basically return the original Hash minus any keys that exceed your MAX_VALUE, and that's unlikely to be what's intended.
Instead, now that you have a list of keys that fit into your knapsack, you can iterate through your original Hash and use Hash#select to return an Array of unique Hash objects with the appropriate key/value pairs. One way to do this is to use Enumerable#reduce to call Hash#merge on each Hash element in the subarrays to convert the final result to an Array of Hash objects. Next, you should call Enumerable#unique to remove any Hash that is equivalent except for its internal ordering.
For example, consider this redesigned code:
MAX_VALUE = 10
def possible_knapsack_contents hash
hash.keys.reject! { _1 > MAX_VALUE }.permutation.map do |a|
loop { a.sum > MAX_VALUE ? a.pop : break }; a
end.sort
end
def matching_elements_from hash
possible_knapsack_contents(hash).map do |subarray|
subarray.map { |i| hash.select { |k, _| k == i } }.
reduce({}) { _1.merge _2 }
end.uniq
end
hash = {5 => 10, 3 => 9, 4 => 7, 2 => 5, 20 => 4}
matching_elements_from hash
Given the defined input, this would yield 24 hashes if you didn't address the uniqueness issue. However, by calling #uniq on the final Array of Hash objects, this will correctly yield the 7 unique hashes that fit your defined criteria if not necessarily the single Hash you seem to expect:
[{2=>5, 3=>9, 4=>7},
{2=>5, 3=>9, 5=>10},
{2=>5, 4=>7},
{2=>5, 5=>10},
{3=>9, 4=>7},
{3=>9, 5=>10},
{4=>7, 5=>10}]

Bloom filters and its multiple hash functions

I'm implementing a simple Bloom filter as an exercise.
Bloom filters require multiple hash functions, which for practical purposes I don't have.
Assuming I want to have 3 hash functions, isn't it enough to just take the hash of the object I'm checking membership for, hashing it (with murmur3) and then add +1, +2, +3 (for the 3 different hashes) before hashing them again?
As the murmur3 function has a very good avalanche effect (really spreads out results) wouldn't this for all purposes be reasonable?
Pseudo-code:
function generateHashes(obj) {
long hash = murmur3_hash(obj);
long hash1 = murmur3_hash(hash+1);
long hash2 = murmur3_hash(hash+2);
long hash3 = murmur3_hash(hash+3);
(hash1, hash2, hash3)
}
If not, what would be a simple, useful approach to this? I'd like to have a solution that would allow me to easily scale for more hash functions if needed be.
AFAIK, the usual approach is to not actually use multiple hash functions. Rather, hash once and split the resulting hash into 2, 3, or how many parts you want for your Bloom filter. So for example create a hash of 128 bits and split it into 2 hashes 64 bit each.
https://github.com/Claudenw/BloomFilter/wiki/Bloom-Filters----An-overview
The hashing functions of Bloom filter should be independent and random enough. MurmurHash is great for this purpose. So your approach is correct, and you can generate as many new hashes your way. For the educational purposes it is fine.
But in real world, running hashing function multiple times is slow, so the usual approach is to create ad-hoc hashes without actually calculating the hash.
To correct #memo, this is done not by splitting the hash into multiple parts, as the width of the hash should remain constant (and you can't split 64 bit hash to more than 64 parts ;) ). The approach is to get a two independent hashes and combine them.
function generateHashes(obj) {
// initialization phase
long h1 = murmur3_hash(obj);
long h2 = murmur3_hash(h1);
int k = 3; // number of desired hash functions
long hash[k];
// generation phase
for (int i=0; i<k; i++) {
hash[i] = h1 + (i*h2);
}
return hash;
}
As you see, this way creating a new hash is a simple multiply-add operation.
It would not be a good approach. Let me try and explain. Bloom filter allows you to test if an element most likely belongs to a set, or if it absolutely doesn’t. In others words, false positives may occur, but false negatives won’t.
Reference: https://sc5.io/posts/what-are-bloom-filters-and-why-are-they-useful/
Let us consider an example:
You have an input string 'foo' and we pass it to the multiple hash functions. murmur3 hash gives the output K, and subsequent hashes on this hash value give x, y and z
Now assume you have another string 'bar' and as it happens, its murmur3 hash is also K. The remaining hash values? They will be x, y and z because in your proposed approach the subsequent hash functions are not dependent on the input, but instead on the output of first hash function.
long hash1 = murmur3_hash(hash+1);
long hash2 = murmur3_hash(hash+2);
long hash3 = murmur3_hash(hash+3);
As explained in the link, the purpose is to perform a probabilistic search in a set. If we perform search for 'foo' or for 'bar' we would say that it is 'likely' that both of them are present. So the % of false positives will increase.
In other words this bloom filter will behave like a simple hash-function. The 'bloom' aspect of it will not come into picture because only the first hash function is determining the outcome of search.
Hope I was able to explain sufficiently. Let me know in comments if you have some more follow-up queries. Would be happy to assist.

Associatively sorting a table by value in Lua

I have a key => value table I'd like to sort in Lua. The keys are all integers, but aren't consecutive (and have meaning). Lua's only sort function appears to be table.sort, which treats tables as simple arrays, discarding the original keys and their association with particular items. Instead, I'd essentially like to be able to use PHP's asort() function.
What I have:
items = {
[1004] = "foo",
[1234] = "bar",
[3188] = "baz",
[7007] = "quux",
}
What I want after the sort operation:
items = {
[1234] = "bar",
[3188] = "baz",
[1004] = "foo",
[7007] = "quux",
}
Any ideas?
Edit: Based on answers, I'm going to assume that it's simply an odd quirk of the particular embedded Lua interpreter I'm working with, but in all of my tests, pairs() always returns table items in the order in which they were added to the table. (i.e. the two above declarations would iterate differently).
Unfortunately, because that isn't normal behavior, it looks like I can't get what I need; Lua doesn't have the necessary tools built-in (of course) and the embedded environment is too limited for me to work around it.
Still, thanks for your help, all!
You seem to misunderstand something. What you have here is a associative array. Associative arrays have no explicit order on them, e.g. it's only the internal representation (usually sorted) that orders them.
In short -- in Lua, both of the arrays you posted are the same.
What you would want instead, is such a representation:
items = {
{1004, "foo"},
{1234, "bar"},
{3188, "baz"},
{7007, "quux"},
}
While you can't get them by index now (they are indexed 1, 2, 3, 4, but you can create another index array), you can sort them using table.sort.
A sorting function would be then:
function compare(a,b)
return a[1] < b[1]
end
table.sort(items, compare)
As Komel said, you're dealing with associative arrays, which have no guaranteed ordering.
If you want key ordering based on its associated value while also preserving associative array functionality, you can do something like this:
function getKeysSortedByValue(tbl, sortFunction)
local keys = {}
for key in pairs(tbl) do
table.insert(keys, key)
end
table.sort(keys, function(a, b)
return sortFunction(tbl[a], tbl[b])
end)
return keys
end
items = {
[1004] = "foo",
[1234] = "bar",
[3188] = "baz",
[7007] = "quux",
}
local sortedKeys = getKeysSortedByValue(items, function(a, b) return a < b end)
sortedKeys is {1234,3188,1004,7007}, and you can access your data like so:
for _, key in ipairs(sortedKeys) do
print(key, items[key])
end
result:
1234 bar
3188 baz
1004 foo
7007 quux
hmm, missed the part about not being able to control the iteration. there
But in lua there is usually always a way.
http://lua-users.org/wiki/OrderedAssociativeTable
Thats a start. Now you would need to replace the pairs() that the library uses. That could be a simples as pairs=my_pairs. You could then use the solution in the link above
PHP arrays are different from Lua tables.
A PHP array may have an ordered list of key-value pairs.
A Lua table always contains an unordered set of key-value pairs.
A Lua table acts as an array when a programmer chooses to use integers 1, 2, 3, ... as keys. The language syntax and standard library functions, like table.sort offer special support for tables with consecutive-integer keys.
So, if you want to emulate a PHP array, you'll have to represent it using list of key-value pairs, which is really a table of tables, but it's more helpful to think of it as a list of key-value pairs. Pass a custom "less-than" function to table.sort and you'll be all set.
N.B. Lua allows you to mix consecutive-integer keys with any other kinds of keys in the same table—and the representation is efficient. I use this feature sometimes, usually to tag an array with a few pieces of metadata.
Coming to this a few months later, with the same query. The recommended answer seemed to pinpoint the gap between what was required and how this looks in LUA, but it didn't get me what I was after exactly :- which was a Hash sorted by Key.
The first three functions on this page DID however : http://lua-users.org/wiki/SortedIteration
I did a brief bit of Lua coding a couple of years ago but I'm no longer fluent in it.
When faced with a similar problem, I copied my array to another array with keys and values reversed, then used sort on the new array.
I wasn't aware of a possibility to sort the array using the method Kornel Kisielewicz recommends.
The proposed compare function works but only if the values in the first column are unique.
Here is a bit enhanced compare function to ensure, if the values of a actual column equals, it takes values from next column to evaluate...
With {1234, "baam"} < {1234, "bar"} to be true the items the array containing "baam" will be inserted before the array containing the "bar".
local items = {
{1004, "foo"},
{1234, "bar"},
{1234, "baam"},
{3188, "baz"},
{7007, "quux"},
}
local function compare(a, b)
for inx = 1, #a do
-- print("A " .. inx .. " " .. a[inx])
-- print("B " .. inx .. " " .. b[inx])
if a[inx] == b[inx] and a[inx + 1] < b[inx + 1] then
return true
elseif a[inx] ~= b[inx] and a[inx] < b[inx] == true then
return true
else
return false
end
end
return false
end
table.sort(items,compare)

Can't sort table with associative indexes

Why I can't use table.sort to sort tables with associative indexes?
In general, Lua tables are pure associative arrays. There is no "natural" order other than the as a side effect of the particular hash table implementation used in the Lua core. This makes sense because values of any Lua data type (other than nil) can be used as both keys and values; but only strings and numbers have any kind of sensible ordering, and then only between values of like type.
For example, what should the sorted order of this table be:
unsortable = {
answer=42,
true="Beauty",
[function() return 17 end] = function() return 42 end,
[math.pi] = "pi",
[ {} ] = {},
12, 11, 10, 9, 8
}
It has one string key, one boolean key, one function key, one non-integral key, one table key, and five integer keys. Should the function sort ahead of the string? How do you compare the string to a number? Where should the table sort? And what about userdata and thread values which don't happen to appear in this table?
By convention, values indexed by sequential integers beginning with 1 are commonly used as lists. Several functions and common idioms follow this convention, and table.sort is one example. Functions that operate over lists usually ignore any values stored at keys that are not part of the list. Again, table.sort is an example: it sorts only those elements that are stored at keys that are part of the list.
Another example is the # operator. For the above table, #unsortable is 5 because unsortable[5] ~= nil and unsortable[6] == nil. Notice that the value stored at the numeric index math.pi is not counted even though pi is between 3 and 4 because it is not an integer. Furthermore, none of the other non-integer keys are counted either. This means that a simple for loop can iterate over the entire list:
for i in 1,#unsortable do
print(i,unsortable[i])
end
Although that is often written as
for i,v in ipairs(unsortable) do
print(i,v)
end
In short, Lua tables are unordered collections of values, each indexed by a key; but there is a special convention for sequential integer keys beginning at 1.
Edit: For the special case of non-integral keys with a suitable partial ordering, there is a work-around involving a separate index table. The described content of tables keyed by string values is a suitable example for this trick.
First, collect the keys in a new table, in the form of a list. That is, make a table indexed by consecutive integers beginning at 1 with keys as values and sort that. Then, use that index to iterate over the original table in the desired order.
For example, here is foreachinorder(), which uses this technique to iterate over all values of a table, calling a function for each key/value pair, in an order determined by a comparison function.
function foreachinorder(t, f, cmp)
-- first extract a list of the keys from t
local keys = {}
for k,_ in pairs(t) do
keys[#keys+1] = k
end
-- sort the keys according to the function cmp. If cmp
-- is omitted, table.sort() defaults to the < operator
table.sort(keys,cmp)
-- finally, loop over the keys in sorted order, and operate
-- on elements of t
for _,k in ipairs(keys) do
f(k,t[k])
end
end
It constructs an index, sorts it with table.sort(), then loops over each element in the sorted index and calls the function f for each one. The function f is passed the key and value. The sort order is determined by an optional comparison function which is passed to table.sort. It is called with two elements to compare (the keys to the table t in this case) and must return true if the first is less than the second. If omitted, table.sort uses the built-in < operator.
For example, given the following table:
t1 = {
a = 1,
b = 2,
c = 3,
}
then foreachinorder(t1,print) prints:
a 1
b 2
c 3
and foreachinorder(t1,print,function(a,b) return a>b end) prints:
c 3
b 2
a 1
You can only sort tables with consecutive integer keys starting at 1, i.e., lists. If you have another table of key-value pairs, you can make a list of pairs and sort that:
function sortpairs(t, lt)
local u = { }
for k, v in pairs(t) do table.insert(u, { key = k, value = v }) end
table.sort(u, lt)
return u
end
Of course this is useful only if you provide a custom ordering (lt) which expects as arguments key/value pairs.
This issue is discussed at greater length in a related question about sorting Lua tables.
Because they don't have any order in the first place. It's like trying to sort a garbage bag full of bananas.

Resources