Data structure to store 2D points clustered near the origin?

Data structure to store 2D points clustered near the origin? - data-structures

I need to use a spatial 2d map for my application. The map usually contains small amount of values in (-200, -200) - (200, 200) rectangle, most of them around (0, 0).
I thought of using hash map but then I need a hash function. I thought of x * 200 + y but then adding (0, 0) and (1, 0) will require 800 bytes for the hash table only, and memory is a problem in my application.
The map is immutable after initial setup so insertion time isn't a matter, but there is a lot of access (about 600 a second) and the target CPU isn't really fast.
What are the general memory/access time trade-offs between hash map and ordinary map(I believe RB-Tree in stl) in small areas? what is a good hash function for small areas?

I think that there are a few things that I need to explain in a bit more detail to answer your question.
For starters, there is a strong distinction between a hash function as its typically used in a program and the number of buckets used in a hash table. In most implementations of a hash function, the hash function is some mapping from objects to integers. The hash table is then free to pick any number of buckets it wants, then maps back from the integers to those buckets. Commonly, this is done by taking the hash code and then modding it by the number of buckets. This means that if you want to store points in a hash table, you don't need to worry about how large the values that your hash function produces are. For example, if the hash table has only three buckets and you produce objects with hash codes 0 and 1,000,000,000, then the first object would hash to the zeroth bucket and the second object would hash to the 1,000,000,000 % 3 = 1st bucket. You wouldn't need 1,000,000,000 buckets. Consequently, you shouldn't worry about picking a hash function like x * 200 + y, since unless your hash table is implemented very oddly you don't need to worry about space usage.
If you are creating a hash table in a way where you will be inserting only once and then spending a lot of time doing accesses, you may want to look into perfect hash functions and perfect hash tables. These are data structures that work by trying to find a hash function for the set of points that you're storing such that no collisions ever occur. They take (expected) O(n) time to create, and can do lookups in worst-case O(1) time. Barring the overhead from computing the hash function, this is the fastest way to look up points in space.
If you were just to dump everything in a tree-based map like most implementations of std::map, though, you should be perfectly fine. With at most 400x400 = 160,000 points, the time required to look up a point would be about lg 160,000 &approx; 18 lookups. This is unlikely to be a bottleneck in any application, though if you really need all the performance you can get the aforementioned perfect hash table is likely to be the best option.
However, both of these solutions only work if the queries you are interested in are of the form "does point p exist in the set or not?" If you want to do more complex geometric queries like nearest-neighbor lookups or finding all the points in a bounding box, you may want to look into more complex data structures like the k-d tree, which supports extremely fast (O(log n)) lookups and fast nearest-neighbor and range searches.
Hope this helps!

Slightly confused by your terminology.
The "map" objects in the standard library are implementations of associative arrays (either via hash tables or binary search trees).
If you're doing 2D spatial processing and are looking to implement a search structure, there are many dedicated data objects - i.e. quadtrees and k-d trees.
Edit: For a few ideas on implementations, perhaps check: https://stackoverflow.com/questions/1402014/kdtree-implementation-c.
Honestly - the data structures aren't that complex - I've always rolled my own.

Related

How hashmap retrieves value with hash key?

I'm more confused with Hashmap or Hashtable concept, when people say Hashmap is faster over List. I'm clear with hashing concept, in which the value is stored in hash code for the given key.
But when I want to retrieve the data how it works,
For example, I'm storing n number of strings with n different keys in a HashMap.
If I want to retrieve a specific value associated with specific key, how it will return it in O(1) of time ? Because the hashed key will be compared with all other keys right ?

Lets go on a word journey, say you have a bunch weird m&m's with all the letters.
Now it's your job is to vend people m&m's in the letter color combo of their choosing.
You have some choices about how to organize your shop. ( This act of organization will be metaphorically our hash function. )
You can sort your M&M's into buckets by color or by letter or by both. The question follows, what provides you the fastest retrieval time of a specific request?
The answer is rather intuitive, being that the sorting providing the fewest different M&Ms in each bucket facilitates the most efficient queering.
Lets say someone asked if you had any green Q ; if all your M&M's are in a single bin or list or bucket or otherwise unstructured container the answer will be far from accessible in O(1) as compared to keeping an organized shop.
This analogy relies on the concept of Separate chaining where each hash-Key corresponds to a container of multiple elements.
Without this concept the idea of hashing is more generally to use keys from uniformly throughout an array such that the amortized performance is constant. Collisions can be resolved through a variety of method variations and the Wikipedia article will tell you all about it.
http://en.wikipedia.org/wiki/Hash_table
"If the set of key-value pairs is fixed and known ahead of time (so insertions and deletions are not allowed), one may reduce the average lookup cost by a careful choice of the hash function, bucket table size, and internal data structures. In particular, one may be able to devise a hash function that is collision-free, or even perfect "

Hash table - why is it faster than arrays?

In cases where I have a key for each element and I don't know the index of the element into an array, hashtables perform better than arrays (O(1) vs O(n)).
Why is that? I mean: I have a key, I hash it.. I have the hash.. shouldn't the algorithm compare this hash against every element's hash? I think there's some trick behind the memory disposition, isn't it?

In cases where I have a key for each element and I don't know the
index of the element into an array, hashtables perform better than
arrays (O(1) vs O(n)).
The hash table search performs O(1) in the average case. In the worst case, the hash table search performs O(n): when you have collisions and the hash function always returns the same slot. One may think "this is a remote situation," but a good analysis should consider it. In this case you should iterate through all the elements like in an array or linked lists (O(n)).
Why is that? I mean: I have a key, I hash it.. I have the hash..
shouldn't the algorithm compare this hash against every element's
hash? I think there's some trick behind the memory disposition, isn't
it?
You have a key, You hash it.. you have the hash: the index of the hash table where the element is present (if it has been located before). At this point you can access the hash table record in O(1). If the load factor is small, it's unlikely to see more than one element there. So, the first element you see should be the element you are looking for. Otherwise, if you have more than one element you must compare the elements you will find in the position with the element you are looking for. In this case you have O(1) + O(number_of_elements).
In the average case, the hash table search complexity is O(1) + O(load_factor) = O(1 + load_factor).
Remember, load_factor = n in the worst case. So, the search complexity is O(n) in the worst case.
I don't know what you mean with "trick behind the memory disposition". Under some points of view, the hash table (with its structure and collisions resolution by chaining) can be considered a "smart trick".
Of course, the hash table analysis results can be proven by math.

With arrays: if you know the value, you have to search on average half the values (unless sorted) to find its location.
With hashes: the location is generated based on the value. So, given that value again, you can calculate the same hash you calculated when inserting. Sometimes, more than 1 value results in the same hash, so in practice each "location" is itself an array (or linked list) of all the values that hash to that location. In this case, only this much smaller (unless it's a bad hash) array needs to be searched.

Hash tables are a bit more complex. They put elements in different buckets based on their hash % some value. In an ideal situation, each bucket holds very few items and there aren't many empty buckets.
Once you know the key, you compute the hash. Based on the hash, you know which bucket to look for. And as stated above, the number of items in each bucket should be relatively small.
Hash tables are doing a lot of magic internally to make sure buckets are as small as possible while not consuming too much memory for empty buckets. Also, much depends on the quality of the key -> hash function.
Wikipedia provides very comprehensive description of hash table.

A Hash Table will not have to compare every element in the Hash. It will calculate the hashcode according to the key. For example, if the key is 4, then hashcode may be - 4*x*y. Now the pointer knows exactly which element to pick.
Whereas if it has been an array, it will have to traverse through the whole array to search for this element.

Why is [it] that [hashtables perform lookups by key better than arrays (O(1) vs O(n))]? I mean: I have a key, I hash it.. I have the hash.. shouldn't the algorithm compare this hash against every element's hash? I think there's some trick behind the memory disposition, isn't it?
Once you have the hash, it lets you calculate an "ideal" or expected location in the array of buckets: commonly:
ideal bucket = hash % num_buckets
The problem is then that another value may have already hashed to that bucket, in which case the hash table implementation has two main choice:
1) try another bucket
2) let several distinct values "belong" to one bucket, perhaps by making the bucket hold a pointer into a linked list of values
For implementation 1, known as open addressing or closed hashing, you jump around other buckets: if you find your value, great; if you find a never-used bucket, then you can store your value in there if inserting, or you know you'll never find your value when searching. There's a potential for the searching to be even worse than O(n) if the way you traverse alternative buckets ends up searching the same bucket multiple times; for example, if you use quadratic probing you try the ideal bucket index +1, then +4, then +9, then +16 and so on - but you must avoid out-of-bounds bucket access using e.g. % num_buckets, so if there are say 12 buckets then ideal+4 and ideal+16 search the same bucket. It can be expensive to track which buckets have been searched, so it can be hard to know when to give up too: the implementation can be optimistic and assume it will always find either the value or an unused bucket (risking spinning forever), it can have a counter and after a threshold of tries either give up or start a linear bucket-by-bucket search.
For implementation 2, known as closed addressing or separate chaining, you have to search inside the container/data-structure of values that all hashed to the ideal bucket. How efficient this is depends on the type of container used. It's generally expected that the number of elements colliding at one bucket will be small, which is true of a good hash function with non-adversarial inputs, and typically true enough of even a mediocre hash function especially with a prime number of buckets. So, a linked list or contiguous array is often used, despite the O(n) search properties: linked lists are simple to implement and operate on, and arrays pack the data together for better memory cache locality and access speed. The worst possible case though is that every value in your table hashed to the same bucket, and the container at that bucket now holds all the values: your entire hash table is then only as efficient as the bucket's container. Some Java hash table implementations have started using binary trees if the number of elements hashing to the same buckets passes a threshold, to make sure complexity is never worse than O(log2n).
Python hashes are an example of 1 = open addressing = closed hashing. C++ std::unordered_set is an example of closed addressing = separate chaining.

The purpose of hashing is to produce an index into the underlying array, which enables you to jump straight to the element in question. This is usually accomplished by dividing the hash by the size of the array and taking the remainder index = hash%capacity.
The type/size of the hash is typically that of the smallest integer large enough to index all of RAM. On a 32 bit system this is a 32 bit integer. On a 64 bit system this is a 64 bit integer. In C++ this corresponds to unsigned int and unsigned long long respectively. To be pedantic C++ technically specifies minimum sizes for its primitives i.e. at least 32 bits and at least 64 bits, but that's beside the point. For the sake of making code portable C++ also provides a size_t primative which corresponds to the appropriate unsigned integer. You'll see that type a lot in for loops which index into arrays, in well written code. In the case of a language like Python the integer primitive grows to whatever size it needs to be. This is typically implemented in the standard libraries of other languages under the name "Big Integer". To deal with this the Python programming language simply truncates whatever value you return from the __hash__() method down to the appropriate size.
On this score I think it's worth giving a word to the wise. The result of arithmetic is the same regardless of whether you compute the remainder at the end or at each step along the way. Truncation is equivalent to computing the remainder modulo 2^n where n is the number of bits you leave intact. Now you might think that computing the remainder at each step would be foolish due to the fact that you're incurring an extra computation at every step along the way. However this is not the case for two reasons. First, computationally speaking, truncation is extraordinarily cheap, far cheaper than generalized division. Second, and this is the real reason as the first is insufficient, and the claim would generally hold even in its absence, taking the remainder at each step keeps the number (relatively) small. So instead of something like product = 31*product + hash(array[index]), you'll want something like product = hash(31*product + hash(array[index])). The primary purpose of the inner hash() call is to take something which might not be a number and turn it into one, where as the primary purpose of the outer hash() call is to take a potentially oversized number and truncate it. Lastly I'll note that in languages like C++ where integer primitives have a fixed size this truncation step is automatically performed after every operation.
Now for the elephant in the room. You've probably realized that hash codes being generally speaking smaller than the objects they correspond to, not to mention that the indices derived from them are again generally speaking even smaller still, it's entirely possible for two objects to hash to the same index. This is called a hash collision. Data structures backed by a hash table like Python's set or dict or C++'s std::unordered_set or std::unordered_map primarily handle this in one of two ways. The first is called separate chaining, and the second is called open addressing. In separate chaining the array functioning as the hash table is itself an array of lists (or in some cases where the developer feels like getting fancy, some other data structure like a binary search tree), and every time an element hashes to a given index it gets added to the corresponding list. In open addressing if an element hashes to an index which is already occupied the data structure probes over to the next index (or in some cases where the developer feels like getting fancy, an index defined by some other function as is the case in quadratic probing) and so on until it finds an empty slot, of course wrapping around when it reaches the end of the array.
Next a word about load factor. There is of course an inherent space/time trade off when it comes to increasing or decreasing the load factor. The higher the load factor the less wasted space the table consumes; however this comes at the expense of increasing the likelihood of performance degrading collisions. Generally speaking hash tables implemented with separate chaining are less sensitive to load factor than those implemented with open addressing. This is due to the phenomenon known as clustering where by clusters in an open addressed hash table tend to become larger and larger in a positive feed back loop as a result of the fact that the larger they become the more likely they are to contain the preferred index of a newly added element. This is actually the reason why the afore mentioned quadratic probing scheme, which progressively increases the jump distance, is often preferred. In the extreme case of load factors greater than 1, open addressing can't work at all as the number of elements exceeds the available space. That being said load factors greater than 1 are exceedingly rare in general. At time of writing Python's set and dict classes employ a max load factor of 2/3 where as Java's java.util.HashSet and java.util.HashMap use 3/4 with C++'s std::unordered_set and std::unordered_map taking the cake with a max load factor of 1. Unsurprisingly Python's hash table backed data structures handle collisions with open addressing where as their Java and C++ counterparts do it with separate chaining.
Last a comment about table size. When the max load factor is exceeded, the size of the hash table must of course be grown. Due to the fact that this requires that every element there in be reindexed, it's highly inefficient to grow the table by a fixed amount. To do so would incur order size operations every time a new element is added. The standard fix for this problem is the same as that employed by most dynamic array implementations. At every point where we need to grow the table we simply increase its size by its current size. This unsurprisingly is known as table doubling.

I think you answered your own question there. "shouldn't the algorithm compare this hash against every element's hash". That's kind of what it does when it doesn't know the index location of what you're searching for. It compares each element to find the one you're looking for:
E.g. Let's say you're looking for an item called "Car" inside an array of strings. You need to go through every item and check item.Hash() == "Car".Hash() to find out that that is the item you're looking for. Obviously it doesn't use the hash when searching always, but the example stands. Then you have a hash table. What a hash table does is it creates a sparse array, or sometimes array of buckets as the guy above mentioned. Then it uses the "Car".Hash() to deduce where in the sparse array your "Car" item is actually. This means that it doesn't have to search through the entire array to find your item.

When performing a looking in a Hash data structure, why is it a fast operation?

When you perform a lookup in a Hashtable, the key is converted into a hash. Now using that hashed value, does it directly map to a memory location, or are there more steps?
Just trying to understand things a little more under the covers.
And what other key based lookup data structures are there and why are they slower than a hash?

Hash tables are not necessarily fast. People consider hash tables a "fast" data structure because the retrieval time does not depend on the number of entries in the table. That is, retrieval from a hash table is an "O(1)" (constant time) operation.
Retrieval time from other data structures can vary depending on the number of entries in the map. For example, for a balanced binary tree, the retrieval time scales with the base-2 logarithm of its size; it's "O(log n)".
However, actually computing a hash code for an single object, in practice, often takes many times longer than comparing that type of object to others. So, you could find that for a small map, something like a red-black tree is faster than a hash table. As the maps grow, the hash table retrieval time will stay constant, and the red-black tree time will slowly grow until it is slower than a hash table.

A Hash (aka Hash Table) implies more than a Map (or Associative Array).
In particular, a Map (or Associative Array) is an Abstract Data Type:
...an associative array (also called a map or a dictionary) is an abstract data type composed of a collection of (key,value) pairs, such that each possible key appears at most once in the collection.
While a Hash table is an implementation of a Map (although it could also be considered an ADT that includes a "cost"):
...a hash table or hash map is a data structure that uses a hash function to map identifying values, known as keys [...], to their associated values [...]. Thus, a hash table implements an associative array [or, map].
Thus it is an implementation-detail leaking out: a HashMap is a Map that uses a Hash-table algorithm and thus provides the expected performance characteristics of such an algorithm. The "leaking" of the implementation detail is good in this case because it provides some basic [expected] bound guarantees, such as an [expected] O(1) -- or constant time -- get.
Hint: a hash function is important part of a hash-table algorithm and sets a HashMap apart from other Map implementations such as a TreeMap (that uses a red-black tree) or a ConcurrentSkipListMap (that uses a skip list).
Another form of a Map is an Association List (or "alist", which is common in LISP programming). While association lists are O(n) for get, they can have much less overhead for small n, which brings up another point: Big-Oh describes limiting behavior (as n -> infinity) and does not address the relative performance for a particular [smallish] n:
A description of a function in terms of big O notation usually only provides an upper bound on the growth rate of the function.
Please refer to the links above (including the javadoc) for the basic characteristics and different implementation strategies -- anything else I say here is already said there (or in other SO answers). If there are specific questions, open a new SO post if warranted :-)
Happy coding.
Here is the source for the HashMap implementation that is in OpenJDK 7. Looking at the put method shows that it a simple chaining as a collision-resolution method and that the underlying "bucket array" will grow by a factor of 2 each resize (which is triggered when the load factor is reached). The load factor and amortized performance expectations -- including those of the hashing function used -- are covered in the class documentation.

"Key-based" implies a mapping of some sort. You can implement one in a linked list or array, and it would probably be pretty slow (O(n)) for lookups or deletes.
Hashing takes constant time. In the more sophisticated implementations it will typically map to a memory address which stores a list of pointers back at the key object in addition to the mapped object or value, for collision detection and resolution.
The expensive operations are following the list of the "hashed to this location" objects to figure out which one you are really looking for. In theory, this could be O(n) for each lookup! However, if we use a larger space the probability of this occurring is reduced (although a few collisions is almost inevitable per the Birthday Problem) drastically.
If you start getting over a certain threshold of collisions, most implementations will expand the size of the hash table, which also takes another O(n) time. However, this will on average take place no more often than every 1/n inserts. So we have amortized constant time.

Hash table vs Balanced binary tree [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
What factors should I take into account when I need to choose between a hash table or a balanced binary tree in order to implement a set or an associative array?

This question cannot be answered, in general, I fear.
The issue is that there are many types of hash tables and balanced binary trees, and their performances vary widely.
So, the naive answer is: it depends on the functionality you need. Use a hash table if you do not need ordering and a balanced binary tree otherwise.
For a more elaborate answer, let's consider some alternatives.
Hash Table (see Wikipedia's entry for some basics)
Not all hash tables use a linked-list as a bucket. A popular alternative is to use a "better" bucket, for example a binary tree, or another hash table (with another hash function), ...
Some hash tables do not use buckets at all: see Open Addressing (they come with other issues, obviously)
There is something called Linear re-hashing (it's a quality of implementation detail), which avoids the "stop-the-world-and-rehash" pitfall. Basically during the migration phase you only insert in the "new" table, and also move one "old" entry into the "new" table. Of course, migration phase means double look-up etc...
Binary Tree
Re-balancing is costly, you may consider a Skip-List (also better for multi-threaded accesses) or a Splay Tree.
A good allocator can "pack" nodes together in memory (better caching behavior), even though this does not alleviate the pointer-look-up issue.
B-Tree and variants also offer "packing"
Let's not forget that O(1) is an asymptotic complexity. For few elements, the coefficient is usually more important (performance-wise). Which is especially true if your hash function is slow...
Finally, for sets, you may also wish to consider probabilistic data structures, like Bloom Filters.

Hash tables are generally better if there isn't any need to keep the data in any sort of sequence. Binary trees are better if the data must be kept sorted.

A worthy point on a modern architecture: A Hash table will usually, if its load factor is low, have fewer memory reads than a binary tree will. Since memory access tend to be rather costly compared to burning CPU cycles, the Hash table is often faster.
In the following Binary tree is assumed to be self-balancing, like a red black tree, an AVL tree or like a treap.
On the other hand, if you need to rehash everything in the hash table when you decide to extend it, this may be a costly operation which occur (amortized). Binary trees does not have this limitation.
Binary trees are easier to implement in purely functional languages.
Binary trees have a natural sort order and a natural way to walk the tree for all elements.
When the load factor in the hash table is low, you may be wasting a lot of memory space, but with two pointers, binary trees tend to take up more space.
Hash tables are nearly O(1) (depending on how you handle the load factor) vs. Bin trees O(lg n).
Trees tend to be the "average performer". There are nothing they do particularly well, but then nothing they do particularly bad.

Hash tables are faster lookups:
You need a key that generates an even distribution (otherwise you'll miss a lot and have to rely on something other than hash; like a linear search).
Hash's can use a lot of empty space. You may reserve 256 entries but only need 8 (so far).
Binary trees:
Deterministic. O(log n) I think...
Don't need extra space like hash tables can
Must be kept sorted. Adding an element in the middle means moving the rest around.

A binary search tree requires a total order relationship among the keys. A hash table requires only an equivalence or identity relationship with a consistent hash function.
If a total order relationship is available, then a sorted array has lookup performance comparable to binary trees, worst-case insert performance in the order of hash tables, and less complexity and memory use than both.
The worst-case insertion complexity for a hash table can be left at O(1)/O(log K) (with K the number of elements with the same hash) if it's acceptable to increase the worst-case lookup complexity to O(K) or O(log K) if the elements can be sorted.
Invariants for both trees and hash tables are expensive to restore if the keys change, but less than O(n log N) for sorted arrays.
These are factors to take into account in deciding which implementation to use:
Availability of a total order relationship.
Availability of a good hashing function for the equivalence relationship.
A-priory knowledge of the number of elements.
Knowledge about the rate of insertions, deletions, and lookups.
Relative complexity of the comparison and hashing functions.

If you only need to access single elements, hashtables are better. If you need a range of elements, you simply have no other option than binary trees.

To add to the other great answers above, I'd say:
Use a hash table if the amount of data will not change (e.g. storing constants); but, if the amount of data will change, use a tree. This is due to the fact that, in a hash table, once the load factor has been reached, the hash table must resize. The resize operation can be very slow.

One point that I don't think has been addressed is that trees are much better for persistent data structures. That is, immutable structures. A standard hash table (i.e. one that uses a single array of linked lists) cannot be modified without modifying the whole table. One situation in which this is relevant is if two concurrent functions both have a copy of a hash table, and one of them changes the table (if the table is mutable, that change will be visible to the other one as well). Another situation would be something like the following:
def bar(table):
# some intern stuck this line of code in
table["hello"] = "world"
return table["the answer"]
def foo(x, y, table):
z = bar(table)
if "hello" in table:
raise Exception("failed catastrophically!")
return x + y + z
important_result = foo(1, 2, {
"the answer": 5,
"this table": "doesn't contain hello",
"so it should": "be ok"
})
# catastrophic failure occurs
With a mutable table, we can't guarantee that the table a function call receives will remain that table throughout its execution, because other function calls might modify it.
So, mutability is sometimes not a pleasant thing. Now, a way around this would be to keep the table immutable, and have updates return a new table without modifying the old one. But with a hash table this would often be a costly O(n) operation, since the entire underlying array would need to be copied. On the other hand, with a balanced tree, a new tree can be generated with only O(log n) nodes needing to be created (the rest of the tree being identical).
This means that an efficient tree can be very convenient when immutable maps are desired.

If you''ll have many slightly-different instances of sets, you'll probably want them to share structure. This is easy with trees (if they're immutable or copy-on-write). I'm not sure how well you can do it with hashtables; it's at least less obvious.

In my experience, hastables are always faster because trees suffer too much of cache effects.
To see some real data, you can check the benchmark page of my TommyDS library http://tommyds.sourceforge.net/
Here you can see compared the performance of the most common hashtable, tree and trie libraries available.

One point to note is about the traversal, minimum and maximum item. Hash tables don’t support any kind of ordered traversal, or access to the minimum or maximum items. If these capabilities are important, the binary tree is a better choice.

Efficiently estimating the number of unique elements in a large list

This problem is a little similar to that solved by reservoir sampling, but not the same. I think its also a rather interesting problem.
I have a large dataset (typically hundreds of millions of elements), and I want to estimate the number of unique elements in this dataset. There may be anywhere from a few, to millions of unique elements in a typical dataset.
Of course the obvious solution is to maintain a running hashset of the elements you encounter, and count them at the end, this would yield an exact result, but would require me to carry a potentially large amount of state with me as I scan through the dataset (ie. all unique elements encountered so far).
Unfortunately in my situation this would require more RAM than is available to me (nothing that the dataset may be far larger than available RAM).
I'm wondering if there would be a statistical approach to this that would allow me to do a single pass through the dataset and come up with an estimated unique element count at the end, while maintaining a relatively small amount of state while I scan the dataset.
The input to the algorithm would be the dataset (an Iterator in Java parlance), and it would return an estimated unique object count (probably a floating point number). It is assumed that these objects can be hashed (ie. you can put them in a HashSet if you want to). Typically they will be strings, or numbers.

You could use a Bloom Filter for a reasonable lower bound. You just do a pass over the data, counting and inserting items which were definitely not already in the set.

This problem is well-addressed in the literature; a good review of various approaches is http://www.edbt.org/Proceedings/2008-Nantes/papers/p618-Metwally.pdf. The simplest approach (and most compact for very high accuracy requirements) is called Linear Counting. You hash elements to positions in a bitvector just like you would a Bloom filter (except only one hash function is required), but at the end you estimate the number of distinct elements by the formula D = -total_bits * ln(unset_bits/total_bits). Details are in the paper.

If you have a hash function that you trust, then you could maintain a hashset just like you would for the exact solution, but throw out any item whose hash value is outside of some small range. E.g., use a 32-bit hash, but only keep items where the first two bits of the hash are 0. Then multiply by the appropriate factor at the end to approximate the total number of unique elements.

Nobody has mentioned approximate algorithm designed specifically for this problem, Hyperloglog.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio