Data structure to Filter Data Quickly - data-structures

I am doing a bit of research into making an efficient filtering algorithm when it comes to many properties of specific data. This is kind of a fun project for me to learn new data structures.
for example, say I wanted All RPG's on Playstation Which had English releases.
Now I want to allow for much more complex queries.
Is there a good data structure to handle filtering attributes like this, without the need to give all of the attributes. Instead I can give only a few and still find the correct games?
I currently plan to have "buckets" which will describe an attribute, for example all Genre's game ID's will be in one bucket, and so forth. Then I will use a hash algorithm to add 1 to that game, and only use games which have the correct value after the search.
But I want to try to find a faster or easier method, any suggestions when it comes to filtering many attributes to find sets of items?
Thanks,

What do you mean by "without the need to give all of the attributes"? Are you saying you have N attributes and you want to find the items that match l < N of the attributes, or are you saying that you don't want to compute an index for each attribute?
Hashing each attribute into buckets will give you O(1) time at the expense of O(n) space to store each index.
You could sort your list by one or two attributes to make some lookups O(logn) at the expense of having to do the sorting up front for O(nlogn) time
You could get kinda clever with bloom filters for your attributes and let some attributes overlap. This would lead to some false-positives, but you could filter those out after the fact. This gives you constant-space with constant-time lookup in the average case (but O(n) time in the worse-case).

Related

Indexing by float or double field algorithm

I have a task to perform fast search in huge in-memory array of objects by some object's fields. I need to select the subset of objects satisfying some criteria.
The criteria may be specified as a floating point value or range of such values (eg. 2.5..10).
The problem is that the float property to be searched on is not quite uniformly distributed; it could contain few objects with value range 10-20 (for example) and another million objects with values 0-1, and another million with values 100-150.
So, how possible is it to build index for effective searching those objects? Code samples are welcome.
If the in memory array is ordered then binary search would be my first attempt. Wikipedia entry has example code as well.
http://en.wikipedia.org/wiki/Binary_search_algorithm
If you're doing lookups only, a single sort followed by multiple binary searches is good.
You could also try a perfect hash algorithm, if you want the ultimate in lookup speed and little more.
If you need more than just lookups, check out treaps and red-black trees. The former are fast on average, while the latter are decent performers with a low operation duration variability.
You could try a range tree, for the range requirement.
I fail to see what the distribution of values has to do with building an index (with the possible exception of exact duplicates). Since the data fits in memory, just extract all the fields with their original position, sort them, and use a binary search as suggested by #MattiLyra.
Are we missing something?

Best data structure for a given set of operations - Add, Retrieve Min/Max and Retrieve a specific object

I am looking for the optimal (time and space) optimal data structure for supporting the following operations:
Add Persons (name, age) to a global data store of persons
Fetch Person with minimum and maximum age
Search for Person's age given the name
Here's what I could think of:
Keep an array of Persons, and keep adding to end of array when a new Person is to be added
Keep a hash of Person name vs. age, to assist in fetching person's age with given name
Maintain two objects minPerson and maxPerson for Person with min and max age. Update this if needed, when a new Person is added.
Now, although I keep a hash for better performance of (3), I think it may not be the best way if there are many collisions in the hash. Also, addition of a Person would mean an overhead of adding to the hash.
Is there anything that can be further optimized here?
Note: I am looking for the best (balanced) approach to support all these operations in minimum time and space.
You can get rid of the array as it doesn't provide anything that the other two structures can't do.
Otherwise, a hashtable + min/max is likely to perform well for your use case. In fact, this is precisely what I would use.
As to getting rid of the hashtable because a poor hash function might lead to collisions: well, don't use a poor hash function. I bet that the default hash function for strings that's provided by your programming language of choice is going to do pretty well out of the box.
It looks like that you need a data structure that needs fast inserts and that also supports fast queries on 2 different keys (name and age).
I would suggest keeping two data structures, one a sorted data structure (e.g. a balanced binary search tree) where the key is the age and the value is a pointer to the Person object, the other a hashtable where the key is the name and the value is a pointer to the Person object. Notice we don't keep two copies of the same object.
A balanced binary search tree would provide O(log(n)) inserts and max/min queries, while the hastable would give us O(1) (amortized) inserts and lookups.
When we add a new Person, we just add a pointer to it to both data structures. For a min/max age query, we can retrieve the Object by querying the BST. For a name query we can just query the hashtable.
Your question does not ask for updates/deletes, but those are also doable by suitably updating both data structures.
It sounds like you're expecting the name to be the unique idenitifer; otherwise your operation 3 is ambiguous (What is the correct return result if you have two entries for John Smith?)
Assuming that the uniqueness of a name is guaranteed, I would go with a plain hashtable keyed by names. Operation 1 and 3 are trivial to execute. Operation 2 could be done in O(N) time if you want to search through the data structure manually, or you can do like you suggest and keep track of the min/max and update it as you add/delete entries in the hash table.

Is there a method to generate a single key that remembers all the string that we have come across

I am dealing with hundreds of thousands of files,
I have to process those files 1-by-1,
In doing so, I need to remember the files that are already processed.
All I can think of is strong the file path of each file in a lo----ong array, and then checking it every time for duplication.
But, I think that there should be some better way,
Is it possible for me to generate a KEY (which is a number) or something, that just remembers all the files that have been processed?
You could use some kind of hash function (MD5, SHA1).
Pseudocode:
for each F in filelist
hash = md5(F name)
if not hash in storage
process file F
store hash in storage to remember
see https://www.rfc-editor.org/rfc/rfc1321 for a C implementation of MD5
There are probabilistic methods that give approximate results, but if you want to know for sure whether a string is one you've seen before or not, you must store all the strings you've seen so far, or equivalent information. It's a pigeonhole principle argument. Of course you can get by without doing a linear search of the strings you've seen so far using all sorts of different methods like hash tables, binary trees, etc.
If I understand your question correctly, you want to create a SINGLE key that should take on a specific value, and from that value you should be able to deduce which files have been processed already? I don't know if you are going to be able to do that, simply from the point that your space is quite big and generating unique key presentations in such a huge space requires a lot of memory.
As mentioned, what you can do is simply to store each path URL in a HashSet. Putting a hundred thousand entries into the Set is not that bad, and lookup time is amortized constant time O(1), so it will be quite fast.
Bloom filter can solve your problem.
Idea of bloom filter is simple. It begins with having an empty array of some length, with all its members having zero value. We shall have K number of hash functions.
When ever we need to insert an item to the bloom filter, we has the item with all K hash functions. These hash functions would get K indexes on the bloom filter. For these indexes, we need to change the member value as 1.
To check if an item exists in the bloom filter, simply hash it with all of the K hashes and check the corresponding array indexes. If all of them are 1's , the item is present in the bloom filter.
Kindly note that bloom filter can provide false positive results. But this would never give false negative results. You need to tweak the bloom filter algorithm to address these false positive case.
What you need, IMHO, is a some sort of tree or hash based set implementation. It is basically a data structure that supports very fast add, remove and query operations and keeps only one instance of each elements (i.e. no duplicates). A few hundred thousand strings (assuming they are themselves not hundreds of thousands characters long) should not be problem for such a data structure.
You programming language of choice probably already has one, so you don't need to write one yourself. C++ has std::set. Java has the Set implementations TreeSet and HashSet. Python has a Set. They all allow you to add elements and check for the presence of an element very fast (O(1) for hashtable based sets, O(log(n)) for tree based sets). Other than those, there are lots of free implementations of sets as well as general purpose binary search trees and hashtables that you can use.

The algorithm used to generate recommendations in Google News?

I'm study recommendation engines, and I went through the paper that defines how Google News generates recommendations to users for news items which might be of their interest, based on collaborative filtering.
One interesting technique that they mention is Minhashing. I went through what it does, but I'm pretty sure that what I have is a fuzzy idea and there is a strong chance that I'm wrong. The following is what I could make out of it :-
Collect a set of all news items.
Define a hash function for a user. This hash function returns the index of the first item from the news items which this user viewed, in the list of all news items.
Collect, say "n" number of such values, and represent a user with this list of values.
Based on the similarity count between these lists, we can calculate the similarity between users as the number of common items. This reduces the number of comparisons a lot.
Based on these similarity measures, group users into different clusters.
This is just what I think it might be. In Step 2, instead of defining a constant hash function, it might be possible that we vary the hash function in a way that it returns the index of a different element. So one hash function could return the index of the first element from the user's list, another hash function could return the index of the second element from the user's list, and so on. So the nature of the hash function satisfying the minwise independent permutations condition, this does sound like a possible approach.
Could anyone please confirm if what I think is correct? Or the minhashing portion of Google News Recommendations, functions in some other way? I'm new to internal implementations of recommendations. Any help is appreciated a lot.
Thanks!
I think you're close.
First of all, the hash function first randomly permutes all the news items, and then for any given person looks at the first item. Since everyone had the same permutation, two people have a decent chance of having the same first item.
Then, to get a new hash function, rather than choosing the second element (which would have some confusing dependencies on the first element), they choose a whole new permutation and take the first element again.
People who happen to have the same hash value 2-4 times (that is, the same first element in 2-4 permutations) are put together in a cluster. This algorithm is repeated 10-20 times, so that each person gets put into 10-20 clusters. Finally, recommendations are given based (the small number of) other people in the 10-20 clusters. Since all this work is done by hashing, people are put directly into buckets for their clusters, and large numbers of comparisons aren't needed.

I was asked this in a recent interview

I was asked to stay away from HashMap or any sort of Hashing.
The question went something like this -
Lets say you have PRODUCT IDs of up to 20 decimals, along with Product Descriptions. Without using Maps or any sort of hashing function, what's the best/most efficient way to store/retrieve these product IDs along with their descriptions?
Why is using Maps a bad idea for such a scenario?
What changes would you make to sell your solution to Amazon?
A map is good to use when insert/remove/lookup operations are interleaved. Every operations are amortized in O(log n).
In your exemple you are only doing search operation. You may consider that any database update (inserting/removing a product) won't happen so much time. Therefore probably the interviewer want you to get the best data structure for lookup operations.
In this case I can see only some as already proposed in other answers:
Sorted array (doing a binary search)
Hasmap
trie
With a trie , if product ids do not share a common prefix, there is good chance to find the product description only looking at the first character of the prefix (or only the very first characters). For instance, let's take that product id list , with 125 products:
"1"
"2"
"3"
...
"123"
"124"
"1234567"
Let's assume you are looking for the product id titled "1234567" in your trie, only looking to the first letters: "1" then "2" then "3" then "4" will lead to the good product description. No need to read the remaining of the product id as there is no other possibilities.
Considering the product id length as n , your lookup will be in O(n). But as in the exemple explained it above it could be even faster to retreive the product description. As the procduct ID is limited in size (20 characters) the trie height will be limited to 20 levels. That actually means you can consider the look up operations will never goes beyond a constant time, as your search will never goes beyong the trie height => O(1). While any BST lookups are at best amortized O(log N), N being the number of items in your tree .
While an hashmap could lead you to slower lookup as you'll need to compute an index with an hash function that is probably implemented reading the whole product id length. Plus browsing a list in case of collision with other product ids.
Doing a binary search on a sorted array, and performance in lookup operations will depends on the number of items in your database.
A B-Tree in my opinion. Does that still count as a Map?
Mostly because you can have many items loaded at once in memory. Searching these items in memory is very fast.
Consecutive integer numbers give perfect choice for the hash map but it only has one problem, as it does not have multithreaded access by default. Also since Amazon was mentioned in your question I may think that you need to take into account concurency and RAM limitation issues.
What you might do in the response to such question is to explain that since
you are dissallowed to use any built-in data storage schemes, all you can do is to "emulate" one.
So, let's say you have M = 10^20 products with their numbers and descriptions.
You can partition this set to the groups of N subsets.
Then you can organize M/N containers which have sugnificantly reduced number of elements. Using this idea recursively will give you a way to store the whole set in containers with such property that access to them would have accepted performance rate.
To illustrate this idea, consider a smaller example of only 20 elements.
I would like you to imagive the file system with directories "1", "2", "3", "4".
In each directory you store the product descriptions as files in the following way:
folder 1: files 1 to 5
folder 2: files 6 to 10
...
folder 4: files 16 to 20
Then your search would only need two steps to find the file.
First, you search for a correct folder by dividing 20 / 5 (your M/N).
Then, you use the given ID to read the product description stored in a file.
This is just a very rough description, however, the idea is very intuitive.
So, perhaps this is what your interviewer wanted to hear.
As for myself, when I face such questions on interview, even if I fail to get the question correctly (which is the worst case :)) I always try to get the correct answer from the interviewer.
Best/efficient for what? Would have been my answer.
E.g. for storing them, probably the fast thing to do are two arrays with 20 elements each. One for the ids, on for the description. Iterating over those is pretty fast to. And it is efficient memory wise.
Of course the solution is pretty useless for any real application, but so is the question.
There is an interesting alternative to B-Tree: Radix Tree
I think what he wanted you to do, and I'm not saying it's a good idea, is to use the computer memory space.
If you use a 64-bit (virtual) memory address, and assuming you have all the address space for your data (which is never the case) you can store a one-byte value.
You could use the ProductID as an address, casting it to a pointer, and then get that byte, which might be an offset in another memory for actual data.
I wouldn't do it this way, but perhaps that is the answer they were looking for.
Asaf
I wonder if they wanted you to note that in an ecommerce application (such as Amazon's), a common use case is "reverse lookup": retrieve the product ID using the description. For this, an inverted index is used, where each keyword in a description is an index key, which is associated with a list of relevant product identifiers. Binary trees or skip lists are good ways to index these key words.
Regarding the product identifier index: In practice, B-Trees (which are not binary search trees) would be used for a large, disk-based index of 20-digit identifiers. However, they may have been looking for a toy solution that could be implemented in RAM. Since the "alphabet" of decimal numbers is so small, it lends itself very nicely to a trie.
The hashmaps work really well if the hashing function gives you a very uniform distribution of the hashvalues of the existing keys. With really bad hash function it can happen so that hash values of your 20 values will be the same, which will push the retrieval time to O(n). The binary search on the other hand guaranties you O(log n), but inserting data is more expensive.
All of this is very incremental, the bigger your dataset is the less are the chances of a bad key distribution (if you are using a good, proven hash algorithm), and on smaller data sets the difference between O(n) and O(log n) is not much to worry about.
If the size is limited sometimes it's faster to use a sorted list.
When you use Hash-anything, you first have to calculate a hash, then locate the hash bucket, then use equals on all elements in the bucket. So it all adds up.
On the other hand you could use just a simple ArrayList ( or any other List flavor that is suitable for the application), sort it with java.util.Collections.sort and use java.util.Collections.binarySearch to find an element.
But as Artyom has pointed out maybe a simple linear search would be much faster in this case.
On the other hand, from maintainability point of view, I would normally use HashMap ( or LinkedHashMap ) here, and would only do something special here when profiler would tell me to do it. Also collections of 20 have a tendency to become collections of 20000 over time and all this optimization would be wasted.
There's nothing wrong with hashing or B-trees for this kind of situation - your interviewer probably just wanted you to think a little, instead of coming out with the expected answer. It's a good sign, when interviewers want candidates to think. It shows that the organization values thought, as opposed to merely parroting out something from the lecture notes from CS0210.
Incidentally, I'm assuming that "20 decimal product ids" means "a large collection of product ids, whose format is 20 decimal characters".... because if there's only 20 of them, there's no value in considering the algorithm. If you can't use hashing or Btrees code a linear search and move on. If you like, sort your array, and use a binary search.
But if my assumption is right, then what the interviewer is asking seems to revolve around the time/space tradeoff of hashmaps. It's possible to improve on the time/space curve of hashmaps - hashmaps do have collisions. So you might be able to get some improvement by converting the 20 decimal digits to a number, and using that as an index to a sparsely populated array... a really big array. :)
Selling it to Amazon? Good luck with that. Whatever you come up with would have to be patentable, and nothing in this discussion seems to rise to that level.
20 decimal PRODUCT IDs, along with Product Description
Simple linear search would be very good...
I would create one simple array with ids. And other array with data.
Linear search for small amount of keys (20!) is much more efficient then any binary-tree or hash.
I have a feeling based on their answer about product ids and two digits the answer they were looking for is to convert the numeric product ids into a different base system or packed form.
They made a point to indicate the product description was with the product ids to tell you that a higher base system could be used within the current fields datatype.
Your interviewer might be looking for a trie. If you have a [small] constant upper bound on your key, then you have O(1) insert and lookup.
I think what he wanted you to do, and
I'm not saying it's a good idea, is to
use the computer memory space.
If you use a 64-bit (virtual) memory
address, and assuming you have all the
address space for your data (which is
never the case) you can store a
one-byte value.
Unfortunately 2^64 =approx= 1.8 * 10^19. Just slightly below 10^20. Coincidence?
log2(10^20) = 66.43.
Here's a slightly evil proposal.
OK, 2^64 bits can fit inside a memory space.
Assume a bound of N bytes for the description, say N=200. (who wants to download Anna Karenina when they're looking for toasters?)
Commandeer 8*N 64-bit machines with heavy RAM. Amazon can swing this.
Every machine loads in their (very sparse) bitmap one bit of the description text for all descriptions. Let the MMU/virtual memory handle the sparsity.
Broadcast the product tag as a 59-bit number and the bit mask for one byte. (59 = ceil(log2(10^20)) - 8)
Every machine returns one bit from the product description. Lookups are a virtual memory dereference. You can even insert and delete.
Of course paging will start to be a bitch at some point!
Oddly enough, it will work the best if product-id's are as clumpy and ungood a hash as possible.

Resources