Indexing by float or double field algorithm - algorithm

I have a task to perform fast search in huge in-memory array of objects by some object's fields. I need to select the subset of objects satisfying some criteria.
The criteria may be specified as a floating point value or range of such values (eg. 2.5..10).
The problem is that the float property to be searched on is not quite uniformly distributed; it could contain few objects with value range 10-20 (for example) and another million objects with values 0-1, and another million with values 100-150.
So, how possible is it to build index for effective searching those objects? Code samples are welcome.

If the in memory array is ordered then binary search would be my first attempt. Wikipedia entry has example code as well.
http://en.wikipedia.org/wiki/Binary_search_algorithm

If you're doing lookups only, a single sort followed by multiple binary searches is good.
You could also try a perfect hash algorithm, if you want the ultimate in lookup speed and little more.
If you need more than just lookups, check out treaps and red-black trees. The former are fast on average, while the latter are decent performers with a low operation duration variability.

You could try a range tree, for the range requirement.

I fail to see what the distribution of values has to do with building an index (with the possible exception of exact duplicates). Since the data fits in memory, just extract all the fields with their original position, sort them, and use a binary search as suggested by #MattiLyra.
Are we missing something?

Related

What data structure should I use where the key falls within a range?

TL;DR: What data structure should I use for looking up key-value pairs where the key needs to fall within a range?
I'm looking for something like a Dictionary but with a twist.
I have a HexEditor with lines, say 8 bytes per line (this can and does differ though).
Any byte within the memblock displayed by the hexeditor can have a comment.
One or zero Comments are associated with one byte-address.
Obviously a range of bytes can have multiple comments and if so all comments will be displayed on a line.
I thought about storing the comments in a Dictionary<Int, String> however that will not work, because I need to lookup if the comment falls within a range and a Dict only matches on exact matches.
The range can change dynamically so I can't link to that either.
It is possible to change the number of bytes per line on the fly and I don't want to have to reconstitute the data store/recalculate all my hashes, so using a dictionary with start-end values as the key is out.
I don't want to do a query to the Dict for every byte in a line.
I suspect the answer is "binary tree" but I'm hoping for something a bit more O(1)ish.
Beware of O(1) when there is a high constant time involved, like is the case for hashed dictionaries, as the cost of hashing is never negligible.
Binary search (as in a binary tree or for an ordered list) is only O(log n), and log is a function that grows very slowly.
When looking up an Integer key, odds are you will be able to perform a score of comparisons in the same time it takes to compute a single hash, and a score of comparisons is enough to perform a binary search among a million elements.

How does Lucene filter over a range of continuous values

From a data structure point of view how does Lucene filter over a range of continuous values?
I understand that Lucene relies upon a compressed bit array data structure akin to CONCISE. Conceptually this bit array holds a 0 for every document that doesn't match a term and a 1 for every document that does match a term. But the cool/awesome part is that this array can be highly compressed and is very fast at boolean operations. For example if you want to know which documents contain the terms "red" and "blue" then you grab the bit array corresponding to "red" and the bit array corresponding to "blue" and AND them together to get a bit array corresponding to matching documents.
But Lucene also provides very fast lookup on ranges of continuous values and if Lucene still uses this same compressed bit array approach, I don't understand how this happens efficiently in computation or memory. Here's my assumption, you tell me how close I am: Lucene discretizes the continuous field and creates a bit array for each of these (now discrete) values. Then when you want a range across values you grab the bit arrays corresponding to this range and just AND them together. Since I suspect there would be TONS of arrays to AND together, then maybe you do this at several levels of granularity so that you can AND together big chunks as much as possible.
UPDATE:
Oh! I just realized another alternative that would be faster. Since we know that the bit arrays for the discretized ranges mentioned above will not overlap, then we can store the bit arrays sequentially. If we have a start and end values for our range, then we can keep an index to the corresponding points in this array of bit arrays. At that point we just jump in the the array of bit arrays and start scanning it as if we're scanning a single bit array. Am I close?
Range queries (say 0 to 100) are a union of the lists of all the terms (1, 2, 3, 4, 5...) in the range. The problem is, if the range has to visit many terms, because it means processing many short term lists.
it would be better to process only a few long lists (which is what lucene is optimized for). So when you use a numeric field and index a number (like 4), we actually index it redundantly several times, adding some "fake terms" at lower precision. This allows us to instead process a range like 0 to 100 by processing say 7 terms instead of 100: "0-63", "64-95", 96, 97, 98, 99, 100. In this example "0-63" and "64-95" are the additional redundant terms that represent ranges of values.
I got a chance to study up on this in depth. The tl;dr if you don't want to read these links at the bottom:
At index time, a number such as 1234 this gets converted into multiple tokens, the original number [1234] and several tokens that represent less precise versions of the original number: [123x], [12xx], and [1xxx]. (This is somewhat of a simplification for ease of communicating the idea.)
At query time you take advantage of the fact that the less precise tokens allow us to search over ranges of tokens. So Lucene doesn't do the naive thing - a search by sweeping through the term dictionary of full precision terms, pulling out all matching numbers, and doing a Boolean OR search over all terms. Instead Lucene uses the term that covers the largest ranges possible and ORs together this much smaller set. For example, to search for all numbers that range from 1776 to 2015 Lucene would OR together these tokens: [1776], [1777], [1778], [1779], [18xx], [19xx], [200x], [2010], [2011], [2012], [2013], [2014], [2015].
Here's a nice walk though:
What's an inverted index.
How are they queried? Technically queries with naive techniques are O(n) where n is the num documents, but practically it's still a very fast O(n) and I think you'll see why when you read it.
How can they be queried faster with skip lists? (TBH I haven't read this, but I know what it's going to say.)
Inverted Indices can be used to construct fast numeric range queries when the numbers are indexed properly.
Here's the class that queries the inverted index. And it has good documentation.
Here's the class that indexes the data. and here's the most important spot in the code that does the tokenization.

Is there a method to generate a single key that remembers all the string that we have come across

I am dealing with hundreds of thousands of files,
I have to process those files 1-by-1,
In doing so, I need to remember the files that are already processed.
All I can think of is strong the file path of each file in a lo----ong array, and then checking it every time for duplication.
But, I think that there should be some better way,
Is it possible for me to generate a KEY (which is a number) or something, that just remembers all the files that have been processed?
You could use some kind of hash function (MD5, SHA1).
Pseudocode:
for each F in filelist
hash = md5(F name)
if not hash in storage
process file F
store hash in storage to remember
see https://www.rfc-editor.org/rfc/rfc1321 for a C implementation of MD5
There are probabilistic methods that give approximate results, but if you want to know for sure whether a string is one you've seen before or not, you must store all the strings you've seen so far, or equivalent information. It's a pigeonhole principle argument. Of course you can get by without doing a linear search of the strings you've seen so far using all sorts of different methods like hash tables, binary trees, etc.
If I understand your question correctly, you want to create a SINGLE key that should take on a specific value, and from that value you should be able to deduce which files have been processed already? I don't know if you are going to be able to do that, simply from the point that your space is quite big and generating unique key presentations in such a huge space requires a lot of memory.
As mentioned, what you can do is simply to store each path URL in a HashSet. Putting a hundred thousand entries into the Set is not that bad, and lookup time is amortized constant time O(1), so it will be quite fast.
Bloom filter can solve your problem.
Idea of bloom filter is simple. It begins with having an empty array of some length, with all its members having zero value. We shall have K number of hash functions.
When ever we need to insert an item to the bloom filter, we has the item with all K hash functions. These hash functions would get K indexes on the bloom filter. For these indexes, we need to change the member value as 1.
To check if an item exists in the bloom filter, simply hash it with all of the K hashes and check the corresponding array indexes. If all of them are 1's , the item is present in the bloom filter.
Kindly note that bloom filter can provide false positive results. But this would never give false negative results. You need to tweak the bloom filter algorithm to address these false positive case.
What you need, IMHO, is a some sort of tree or hash based set implementation. It is basically a data structure that supports very fast add, remove and query operations and keeps only one instance of each elements (i.e. no duplicates). A few hundred thousand strings (assuming they are themselves not hundreds of thousands characters long) should not be problem for such a data structure.
You programming language of choice probably already has one, so you don't need to write one yourself. C++ has std::set. Java has the Set implementations TreeSet and HashSet. Python has a Set. They all allow you to add elements and check for the presence of an element very fast (O(1) for hashtable based sets, O(log(n)) for tree based sets). Other than those, there are lots of free implementations of sets as well as general purpose binary search trees and hashtables that you can use.

Choosing a Data structure for very large data

I have x (millions) positive integers, where their values can be as big as allowed (+2,147,483,647). Assuming they are unique, what is the best way to store them for a lookup intensive program.
So far i thought of using a binary AVL tree or a hash table, where the integer is the key to the mapped data (a name). However am not to sure whether i can implement such large keys and in such large quantity with a hash table (wouldn't that create a >0.8 load factor in addition to be prone for collisions?)
Could i get some advise on which data structure might be suitable for my situation
The choice of structure depends heavily on how much memory you have available. I'm assuming based on the description that you need lookup but not to loop over them, find nearest, or other similar operations.
Best is probably a bucketed hash table. By placing hash collisions into buckets and keeping separate arrays in the bucket for keys and values, you can both reduce the size of the table proper and take advantage of CPU cache speedup when searching a bucket. Linear search within a bucket may even end up faster than binary search!
AVL trees are nice for data sets that are read-intensive but not read-only AND require ordered enumeration, find nearest and similar operations, but they're an annoyingly amount of work to implement correctly. You may get better performance with a B-tree because of CPU cache behavior, though, especially a cache-oblivious B-tree algorithm.
Have you looked into B-trees? The efficiency runs between log_m(n) and log_(m/2)(n) so if you choose m to be around 8-10 or so you should be able to keep your search depth to below 10.
Bit Vector , with the index set if the number is present. You can tweak it to have the number of occurrences of each number. There is a nice column about bit vectors in Bentley's Programming Pearls.
If memory isn't an issue a map is probably your best bet. Maps are O(1) meaning that as you scale up the number of items to be looked up the time is takes to find a value is the same.
A map where the key is the int, and the value is the name.
Do try hash tables first. There are some variants that can tolerate being very dense without significant slowdown (like Brent's variation).
If you only need to store the 32-bit integers and not any associated record, use a set and not a map, like hash_set in most C++ libraries. It would use only 4-bytes records plus some constant overhead and a little slack to avoid being 100%. In the worst case, to handle 'millions' of numbers you'd need a few tens of megabytes. Big, but nothing unmanageable.
If you need it to be much tighter, just store them sorted in a plain array and use binary search to fetch them. It will be O(log n) instead of O(1), but for 'millions' of records it's still just twentysomething steps to get any one of them. In C you have bsearch(), which is as fast as it can get.
edit: just saw in your question you talk about some 'mapped data (a name)'. are those names unique? do they also have to be in memory? if yes, they would definitely dominate the memory requirements. Even so, if the names are the typical english words, most would be 10 bytes or less, keeping the total size in the 'tens of megabytes'; maybe up to a hundred megs, still very manageable.

I was asked this in a recent interview

I was asked to stay away from HashMap or any sort of Hashing.
The question went something like this -
Lets say you have PRODUCT IDs of up to 20 decimals, along with Product Descriptions. Without using Maps or any sort of hashing function, what's the best/most efficient way to store/retrieve these product IDs along with their descriptions?
Why is using Maps a bad idea for such a scenario?
What changes would you make to sell your solution to Amazon?
A map is good to use when insert/remove/lookup operations are interleaved. Every operations are amortized in O(log n).
In your exemple you are only doing search operation. You may consider that any database update (inserting/removing a product) won't happen so much time. Therefore probably the interviewer want you to get the best data structure for lookup operations.
In this case I can see only some as already proposed in other answers:
Sorted array (doing a binary search)
Hasmap
trie
With a trie , if product ids do not share a common prefix, there is good chance to find the product description only looking at the first character of the prefix (or only the very first characters). For instance, let's take that product id list , with 125 products:
"1"
"2"
"3"
...
"123"
"124"
"1234567"
Let's assume you are looking for the product id titled "1234567" in your trie, only looking to the first letters: "1" then "2" then "3" then "4" will lead to the good product description. No need to read the remaining of the product id as there is no other possibilities.
Considering the product id length as n , your lookup will be in O(n). But as in the exemple explained it above it could be even faster to retreive the product description. As the procduct ID is limited in size (20 characters) the trie height will be limited to 20 levels. That actually means you can consider the look up operations will never goes beyond a constant time, as your search will never goes beyong the trie height => O(1). While any BST lookups are at best amortized O(log N), N being the number of items in your tree .
While an hashmap could lead you to slower lookup as you'll need to compute an index with an hash function that is probably implemented reading the whole product id length. Plus browsing a list in case of collision with other product ids.
Doing a binary search on a sorted array, and performance in lookup operations will depends on the number of items in your database.
A B-Tree in my opinion. Does that still count as a Map?
Mostly because you can have many items loaded at once in memory. Searching these items in memory is very fast.
Consecutive integer numbers give perfect choice for the hash map but it only has one problem, as it does not have multithreaded access by default. Also since Amazon was mentioned in your question I may think that you need to take into account concurency and RAM limitation issues.
What you might do in the response to such question is to explain that since
you are dissallowed to use any built-in data storage schemes, all you can do is to "emulate" one.
So, let's say you have M = 10^20 products with their numbers and descriptions.
You can partition this set to the groups of N subsets.
Then you can organize M/N containers which have sugnificantly reduced number of elements. Using this idea recursively will give you a way to store the whole set in containers with such property that access to them would have accepted performance rate.
To illustrate this idea, consider a smaller example of only 20 elements.
I would like you to imagive the file system with directories "1", "2", "3", "4".
In each directory you store the product descriptions as files in the following way:
folder 1: files 1 to 5
folder 2: files 6 to 10
...
folder 4: files 16 to 20
Then your search would only need two steps to find the file.
First, you search for a correct folder by dividing 20 / 5 (your M/N).
Then, you use the given ID to read the product description stored in a file.
This is just a very rough description, however, the idea is very intuitive.
So, perhaps this is what your interviewer wanted to hear.
As for myself, when I face such questions on interview, even if I fail to get the question correctly (which is the worst case :)) I always try to get the correct answer from the interviewer.
Best/efficient for what? Would have been my answer.
E.g. for storing them, probably the fast thing to do are two arrays with 20 elements each. One for the ids, on for the description. Iterating over those is pretty fast to. And it is efficient memory wise.
Of course the solution is pretty useless for any real application, but so is the question.
There is an interesting alternative to B-Tree: Radix Tree
I think what he wanted you to do, and I'm not saying it's a good idea, is to use the computer memory space.
If you use a 64-bit (virtual) memory address, and assuming you have all the address space for your data (which is never the case) you can store a one-byte value.
You could use the ProductID as an address, casting it to a pointer, and then get that byte, which might be an offset in another memory for actual data.
I wouldn't do it this way, but perhaps that is the answer they were looking for.
Asaf
I wonder if they wanted you to note that in an ecommerce application (such as Amazon's), a common use case is "reverse lookup": retrieve the product ID using the description. For this, an inverted index is used, where each keyword in a description is an index key, which is associated with a list of relevant product identifiers. Binary trees or skip lists are good ways to index these key words.
Regarding the product identifier index: In practice, B-Trees (which are not binary search trees) would be used for a large, disk-based index of 20-digit identifiers. However, they may have been looking for a toy solution that could be implemented in RAM. Since the "alphabet" of decimal numbers is so small, it lends itself very nicely to a trie.
The hashmaps work really well if the hashing function gives you a very uniform distribution of the hashvalues of the existing keys. With really bad hash function it can happen so that hash values of your 20 values will be the same, which will push the retrieval time to O(n). The binary search on the other hand guaranties you O(log n), but inserting data is more expensive.
All of this is very incremental, the bigger your dataset is the less are the chances of a bad key distribution (if you are using a good, proven hash algorithm), and on smaller data sets the difference between O(n) and O(log n) is not much to worry about.
If the size is limited sometimes it's faster to use a sorted list.
When you use Hash-anything, you first have to calculate a hash, then locate the hash bucket, then use equals on all elements in the bucket. So it all adds up.
On the other hand you could use just a simple ArrayList ( or any other List flavor that is suitable for the application), sort it with java.util.Collections.sort and use java.util.Collections.binarySearch to find an element.
But as Artyom has pointed out maybe a simple linear search would be much faster in this case.
On the other hand, from maintainability point of view, I would normally use HashMap ( or LinkedHashMap ) here, and would only do something special here when profiler would tell me to do it. Also collections of 20 have a tendency to become collections of 20000 over time and all this optimization would be wasted.
There's nothing wrong with hashing or B-trees for this kind of situation - your interviewer probably just wanted you to think a little, instead of coming out with the expected answer. It's a good sign, when interviewers want candidates to think. It shows that the organization values thought, as opposed to merely parroting out something from the lecture notes from CS0210.
Incidentally, I'm assuming that "20 decimal product ids" means "a large collection of product ids, whose format is 20 decimal characters".... because if there's only 20 of them, there's no value in considering the algorithm. If you can't use hashing or Btrees code a linear search and move on. If you like, sort your array, and use a binary search.
But if my assumption is right, then what the interviewer is asking seems to revolve around the time/space tradeoff of hashmaps. It's possible to improve on the time/space curve of hashmaps - hashmaps do have collisions. So you might be able to get some improvement by converting the 20 decimal digits to a number, and using that as an index to a sparsely populated array... a really big array. :)
Selling it to Amazon? Good luck with that. Whatever you come up with would have to be patentable, and nothing in this discussion seems to rise to that level.
20 decimal PRODUCT IDs, along with Product Description
Simple linear search would be very good...
I would create one simple array with ids. And other array with data.
Linear search for small amount of keys (20!) is much more efficient then any binary-tree or hash.
I have a feeling based on their answer about product ids and two digits the answer they were looking for is to convert the numeric product ids into a different base system or packed form.
They made a point to indicate the product description was with the product ids to tell you that a higher base system could be used within the current fields datatype.
Your interviewer might be looking for a trie. If you have a [small] constant upper bound on your key, then you have O(1) insert and lookup.
I think what he wanted you to do, and
I'm not saying it's a good idea, is to
use the computer memory space.
If you use a 64-bit (virtual) memory
address, and assuming you have all the
address space for your data (which is
never the case) you can store a
one-byte value.
Unfortunately 2^64 =approx= 1.8 * 10^19. Just slightly below 10^20. Coincidence?
log2(10^20) = 66.43.
Here's a slightly evil proposal.
OK, 2^64 bits can fit inside a memory space.
Assume a bound of N bytes for the description, say N=200. (who wants to download Anna Karenina when they're looking for toasters?)
Commandeer 8*N 64-bit machines with heavy RAM. Amazon can swing this.
Every machine loads in their (very sparse) bitmap one bit of the description text for all descriptions. Let the MMU/virtual memory handle the sparsity.
Broadcast the product tag as a 59-bit number and the bit mask for one byte. (59 = ceil(log2(10^20)) - 8)
Every machine returns one bit from the product description. Lookups are a virtual memory dereference. You can even insert and delete.
Of course paging will start to be a bitch at some point!
Oddly enough, it will work the best if product-id's are as clumpy and ungood a hash as possible.

Resources