Ruby - Check if intersect exists - ruby

I'm trying to speed up a search function in a RoR app w/ Postgres DB. I won't explain how it works currently...just go with an /achieve approach!
I have x number of records (potentially a substantial number) which each have an associated array of Facebook ID numbers...potentially up to 5k. I need to search against this with an individual's list of friend IDs to ascertain if an intersect between the search array and any (and which) of the records' arrays exists.
I don't need to know the result of the intersection, just whether it's true or false.
Any bright ideas?!
Thanks!

Just using pure ruby since you don't mention your datastore:
friend_ids = user.friend_ids
results = records.select { |record| !(record.friend_ids & friend_ids).empty? }
results will contain all records that have at least 1 friend_id in common. This will not be very fast if you have to check a very large number of records.
& is the array intersection operator, which is implemented in C, you can see it here: http://www.ruby-doc.org/core-1.9.3/Array.html#method-i-26

A probably faster version of #ctcherry's answer, especially when user.friend_ids has high cardinality:
require 'set'
user_friend_ids = Set[ user.friend_ids ]
results = records.select { |record|
record.friend_ids.any? { |friend_id| user_friend_ids.include? friend_id }
}
Since this constructs the test set(hash) for user.freind_ids only once, it's probably also faster than the Array#memory_efficient_intersect linked by #Tass.
This may also be faster performed in the db, but without more info on the models, it's hard to compose an approach.

Related

Designing an algorithm to check combinations

I´m having serious performance issues with a job that is running everyday and I think i cannot improve the algorithm; so I´m gonnga explain you what is the problem to solve and the algorithm we have, and maybe you have some other ideas to solve the problem better.
So the problem we have to solve is:
There is a set of Rules, ~ 120.000 Rules.
Every rule has a set of combinations of Codes. Codes are basically strings. So we have ~8 combinations per rule. Example of a combination: TTAAT;ZZUHH;GGZZU;WWOOF;SSJJW;FFFOLL
There is a set of Objects, ~800 objects.
Every object has a set of ~200 codes.
We have to check for every Rule, if there is at least one Combination of Codes that is fully contained in the Objects. It means =>
loop in Rules
Loop in Combinations of the rule
Loop in Objects
every code of the combination found in the Object? => create relationship rule/object and continue with the next object
end of loop
end of loop
end of loop
For example, if we have the Rule with this combination of two codes: HHGGT; ZZUUF
And let´s say we have an object with this codes: HHGGT; DHZZU; OIJUH; ZHGTF; HHGGT; JUHZT; ZZUUF; TGRFE; UHZGT; FCDXS
Then we create a relationship between the Object and the Rule because every code of the combination of the rule is contained in the codes of the object => this is what the algorithm has to do.
As you can see this is quite expensive, because we need 120.000 x 8 x 800 = 750 millions of times in the worst-case scenario.
This is a simplified scenario of the real problem; actually what we do in the loops is a little bit more complicated, that´s why we have to reduce this somehow.
I tried to think in a solution but I don´t have any ideas!
Do you see something wrong here?
Best regards and thank you for the time :)
Something like this might work better if I'm understanding correctly (this is in python):
RULES = [
['abc', 'def',],
['aaa', 'sfd',],
['xyy', 'eff',]]
OBJECTS = [
('rrr', 'abc', 'www', 'def'),
('pqs', 'llq', 'aaa', 'sdr'),
('xyy', 'hjk', 'fed', 'eff'),
('pnn', 'rrr', 'mmm', 'qsq')
]
MapOfCodesToObjects = {}
for obj in OBJECTS:
for code in obj:
if (code in MapOfCodesToObjects):
MapOfCodesToObjects[code].add(obj)
else:
MapOfCodesToObjects[code] = set({obj})
RELATIONS = []
for rule in RULES:
if (len(rule) == 0):
continue
if (rule[0] in MapOfCodesToObjects):
ValidObjects = MapOfCodesToObjects[rule[0]]
else:
continue
for i in range(1, len(rule)):
if (rule[i] in MapOfCodesToObjects):
codeObjects = MapOfCodesToObjects[rule[i]]
else:
ValidObjects = set()
break
ValidObjects = ValidObjects.intersection(codeObjects)
if (len(ValidObjects) == 0):
break
for vo in ValidObjects:
RELATIONS.append((rule, vo))
for R in RELATIONS:
print(R)
First you build a map of codes to objects. If there are nObj objects and nCodePerObj codes on average per object, this takes O(nObj*nCodePerObj * log(nObj*nCodePerObj).
Next you iterate through the rules and look up each code in each rule in the map you built. There is a relation if a certain object occurs for every code in the rule, i.e. if it is in the set intersection of the objects for every code in the rule. Since hash lookups have O(1) time complexity on average, and set intersection has time complexity O(min of the lengths of the 2 sets), this will take O(nRule * nCodePerRule * nObjectsPerCode), (note that is nObjectsPerCode, not nCodePerObj, the performance gets worse when one code is included in many objects).

Bloom filters and its multiple hash functions

I'm implementing a simple Bloom filter as an exercise.
Bloom filters require multiple hash functions, which for practical purposes I don't have.
Assuming I want to have 3 hash functions, isn't it enough to just take the hash of the object I'm checking membership for, hashing it (with murmur3) and then add +1, +2, +3 (for the 3 different hashes) before hashing them again?
As the murmur3 function has a very good avalanche effect (really spreads out results) wouldn't this for all purposes be reasonable?
Pseudo-code:
function generateHashes(obj) {
long hash = murmur3_hash(obj);
long hash1 = murmur3_hash(hash+1);
long hash2 = murmur3_hash(hash+2);
long hash3 = murmur3_hash(hash+3);
(hash1, hash2, hash3)
}
If not, what would be a simple, useful approach to this? I'd like to have a solution that would allow me to easily scale for more hash functions if needed be.
AFAIK, the usual approach is to not actually use multiple hash functions. Rather, hash once and split the resulting hash into 2, 3, or how many parts you want for your Bloom filter. So for example create a hash of 128 bits and split it into 2 hashes 64 bit each.
https://github.com/Claudenw/BloomFilter/wiki/Bloom-Filters----An-overview
The hashing functions of Bloom filter should be independent and random enough. MurmurHash is great for this purpose. So your approach is correct, and you can generate as many new hashes your way. For the educational purposes it is fine.
But in real world, running hashing function multiple times is slow, so the usual approach is to create ad-hoc hashes without actually calculating the hash.
To correct #memo, this is done not by splitting the hash into multiple parts, as the width of the hash should remain constant (and you can't split 64 bit hash to more than 64 parts ;) ). The approach is to get a two independent hashes and combine them.
function generateHashes(obj) {
// initialization phase
long h1 = murmur3_hash(obj);
long h2 = murmur3_hash(h1);
int k = 3; // number of desired hash functions
long hash[k];
// generation phase
for (int i=0; i<k; i++) {
hash[i] = h1 + (i*h2);
}
return hash;
}
As you see, this way creating a new hash is a simple multiply-add operation.
It would not be a good approach. Let me try and explain. Bloom filter allows you to test if an element most likely belongs to a set, or if it absolutely doesn’t. In others words, false positives may occur, but false negatives won’t.
Reference: https://sc5.io/posts/what-are-bloom-filters-and-why-are-they-useful/
Let us consider an example:
You have an input string 'foo' and we pass it to the multiple hash functions. murmur3 hash gives the output K, and subsequent hashes on this hash value give x, y and z
Now assume you have another string 'bar' and as it happens, its murmur3 hash is also K. The remaining hash values? They will be x, y and z because in your proposed approach the subsequent hash functions are not dependent on the input, but instead on the output of first hash function.
long hash1 = murmur3_hash(hash+1);
long hash2 = murmur3_hash(hash+2);
long hash3 = murmur3_hash(hash+3);
As explained in the link, the purpose is to perform a probabilistic search in a set. If we perform search for 'foo' or for 'bar' we would say that it is 'likely' that both of them are present. So the % of false positives will increase.
In other words this bloom filter will behave like a simple hash-function. The 'bloom' aspect of it will not come into picture because only the first hash function is determining the outcome of search.
Hope I was able to explain sufficiently. Let me know in comments if you have some more follow-up queries. Would be happy to assist.

Hash Tables and Separate Chaining: How do you know which value to return from the bucket's list?

We're learning about hash tables in my data structures and algorithms class, and I'm having trouble understanding separate chaining.
I know the basic premise: each bucket has a pointer to a Node that contains a key-value pair, and each Node contains a pointer to the next (potential) Node in the current bucket's mini linked list. This is mainly used to handle collisions.
Now, suppose for simplicity that the hash table has 5 buckets. Suppose I wrote the following lines of code in my main after creating an appropriate hash table instance.
myHashTable["rick"] = "Rick Sanchez";
myHashTable["morty"] = "Morty Smith";
Let's imagine whatever hashing function we're using just so happens to produce the same bucket index for both string keys rick and morty. Let's say that bucket index is index 0, for simplicity.
So at index 0 in our hash table, we have two nodes with values of Rick Sanchez and Morty Smith, in whatever order we decide to put them in (the first pointing to the second).
When I want to display the corresponding value for rick, which is Rick Sanchez per our code here, the hashing function will produce the bucket index of 0.
How do I decide which node needs to be returned? Do I loop through the nodes until I find the one whose key matches rick?
To resolve Hash Tables conflicts, that's it, to put or get an item into the Hash Table whose hash value collides with another one, you will end up reducing a map to the data structure that is backing the hash table implementation; this is generally a linked list. In the case of a collision this is the worst case for the Hash Table structure and you will end up with an O(n) operation to get to the correct item in the linked list. That's it, a loop as you said, that will search the item with the matching key. But, in the cases that you have a data structure like a balanced tree to search, it can be O(logN) time, as the Java8 implementation.
As JEP 180: Handle Frequent HashMap Collisions with Balanced Trees says:
The principal idea is that once the number of items in a hash bucket
grows beyond a certain threshold, that bucket will switch from using a
linked list of entries to a balanced tree. In the case of high hash
collisions, this will improve worst-case performance from O(n) to
O(log n).
This technique has already been implemented in the latest version of
the java.util.concurrent.ConcurrentHashMap class, which is also slated
for inclusion in JDK 8 as part of JEP 155. Portions of that code will
be re-used to implement the same idea in the HashMap and LinkedHashMap
classes.
I strongly suggest to always look at some existing implementation. To say about one, you could look at the Java 7 implementation. That will increase your code reading skills, that is almost more important or you do more often than writing code. I know that it is more effort but it will pay off.
For example, take a look at the HashTable.get method from Java 7:
public synchronized V get(Object key) {
Entry<?,?> tab[] = table;
int hash = key.hashCode();
int index = (hash & 0x7FFFFFFF) % tab.length;
for (Entry<?,?> e = tab[index] ; e != null ; e = e.next) {
if ((e.hash == hash) && e.key.equals(key)) {
return (V)e.value;
}
}
return null;
}
Here we see that if ((e.hash == hash) && e.key.equals(key)) is trying to find the correct item with the matching key.
And here is the full source code: HashTable.java

Ruby - Optimize code finding the optimal choice from an array

I asked a question that was basically a knapsack problem - I needed to find the combination of several different array of objects that gave the optimal output. So for example, the highest sum "value" from the objects with respect to a limit on the "cost" of each object. The answer I received here was the following-
a.product(b,c)
.select{ |arr| arr.reduce(0) { |sum,h| sum + h[:cost] } < 30 }
.max_by{ |arr| arr.reduce(0) { |sum,h| sum + h[:value] } }
Which works great, but as I get into 6 arrays with ~40 choices each, the possible combinations get upwards of 4 million and take too long to process. I made some changes to the code that made processing faster -
#creating the array doesn't take too long
combinations = a.product(b,c,d,e)
possibles = []
combinations.each do |array_of_objects|
#max_cost is a numeric parameter, and I can't have the same exact object used twice
if !(array_of_objects.sum(&:salary) > max_cost) or !(array_of_objects.uniq.count < array_of_objects.count)
possibles << array_of_objects
end
end
possibles.max_by{ |ar| ar.sum(&:std_proj) }
Breaking it into two separate arrays helped the performance a lot as I only had to check the max_by for many less possible combinations that fit the criteria.
Does anyone see a way to optimize this code? Since I'm typically dealing with tens of thousands or millions of combinations, any little bit could greatly help. Thanks.
If we are talking about millions of rows, and the operations are like unique and max.
I suggest you to solve it by using DISINCT and MAX() in your query and You can even use WHERE filtering by cost.
Looping over the objects in Ruby, is clearly more expensive.

Count, size, length...too many choices in Ruby?

I can't seem to find a definitive answer on this and I want to make sure I understand this to the "n'th level" :-)
a = { "a" => "Hello", "b" => "World" }
a.count # 2
a.size # 2
a.length # 2
a = [ 10, 20 ]
a.count # 2
a.size # 2
a.length # 2
So which to use? If I want to know if a has more than one element then it doesn't seem to matter but I want to make sure I understand the real difference. This applies to arrays too. I get the same results.
Also, I realize that count/size/length have different meanings with ActiveRecord. I'm mostly interested in pure Ruby (1.92) right now but if anyone wants to chime in on the difference AR makes that would be appreciated as well.
Thanks!
For arrays and hashes size is an alias for length. They are synonyms and do exactly the same thing.
count is more versatile - it can take an element or predicate and count only those items that match.
> [1,2,3].count{|x| x > 2 }
=> 1
In the case where you don't provide a parameter to count it has basically the same effect as calling length. There can be a performance difference though.
We can see from the source code for Array that they do almost exactly the same thing. Here is the C code for the implementation of array.length:
static VALUE
rb_ary_length(VALUE ary)
{
long len = RARRAY_LEN(ary);
return LONG2NUM(len);
}
And here is the relevant part from the implementation of array.count:
static VALUE
rb_ary_count(int argc, VALUE *argv, VALUE ary)
{
long n = 0;
if (argc == 0) {
VALUE *p, *pend;
if (!rb_block_given_p())
return LONG2NUM(RARRAY_LEN(ary));
// etc..
}
}
The code for array.count does a few extra checks but in the end calls the exact same code: LONG2NUM(RARRAY_LEN(ary)).
Hashes (source code) on the other hand don't seem to implement their own optimized version of count so the implementation from Enumerable (source code) is used, which iterates over all the elements and counts them one-by-one.
In general I'd advise using length (or its alias size) rather than count if you want to know how many elements there are altogether.
Regarding ActiveRecord, on the other hand, there are important differences. check out this post:
Counting ActiveRecord associations: count, size or length?
There is a crucial difference for applications which make use of database connections.
When you are using many ORMs (ActiveRecord, DataMapper, etc.) the general understanding is that .size will generate a query that requests all of the items from the database ('select * from mytable') and then give you the number of items resulting, whereas .count will generate a single query ('select count(*) from mytable') which is considerably faster.
Because these ORMs are so prevalent I following the principle of least astonishment. In general if I have something in memory already, then I use .size, and if my code will generate a request to a database (or external service via an API) I use .count.
In most cases (e.g. Array or String) size is an alias for length.
count normally comes from Enumerable and can take an optional predicate block. Thus enumerable.count {cond} is [roughly] (enumerable.select {cond}).length -- it can of course bypass the intermediate structure as it just needs the count of matching predicates.
Note: I am not sure if count forces an evaluation of the enumeration if the block is not specified or if it short-circuits to the length if possible.
Edit (and thanks to Mark's answer!): count without a block (at least for Arrays) does not force an evaluation. I suppose without formal behavior it's "open" for other implementations, if forcing an evaluation without a predicate ever even really makes sense anyway.
I found a good answare at http://blog.hasmanythrough.com/2008/2/27/count-length-size
In ActiveRecord, there are several ways to find out how many records
are in an association, and there are some subtle differences in how
they work.
post.comments.count - Determine the number of elements with an SQL
COUNT query. You can also specify conditions to count only a subset of
the associated elements (e.g. :conditions => {:author_name =>
"josh"}). If you set up a counter cache on the association, #count
will return that cached value instead of executing a new query.
post.comments.length - This always loads the contents of the
association into memory, then returns the number of elements loaded.
Note that this won't force an update if the association had been
previously loaded and then new comments were created through another
way (e.g. Comment.create(...) instead of post.comments.create(...)).
post.comments.size - This works as a combination of the two previous
options. If the collection has already been loaded, it will return its
length just like calling #length. If it hasn't been loaded yet, it's
like calling #count.
Also I have a personal experience:
<%= h(params.size.to_s) %> # works_like_that !
<%= h(params.count.to_s) %> # does_not_work_like_that !
We have a several ways to find out how many elements in an array like .length, .count and .size. However, It's better to use array.size rather than array.count. Because .size is better in performance.
Adding more to Mark Byers answer. In Ruby the method array.size is an alias to Array#length method. There is no technical difference in using any of these two methods. Possibly you won't see any difference in performance as well. However, the array.count also does the same job but with some extra functionalities Array#count
It can be used to get total no of elements based on some condition. Count can be called in three ways:
Array#count # Returns number of elements in Array
Array#count n # Returns number of elements having value n in Array
Array#count{|i| i.even?} Returns count based on condition invoked on each element array
array = [1,2,3,4,5,6,7,4,3,2,4,5,6,7,1,2,4]
array.size # => 17
array.length # => 17
array.count # => 17
Here all three methods do the same job. However here is where the count gets interesting.
Let us say, I want to find how many array elements does the array contains with value 2
array.count 2 # => 3
The array has a total of three elements with value as 2.
Now, I want to find all the array elements greater than 4
array.count{|i| i > 4} # =>6
The array has total 6 elements which are > than 4.
I hope it gives some info about count method.

Resources