I have read about "probabilistic" data structures like bloom filters and skip lists.
What are the common characteristics of probabilistic data structures and what are they used for?
There are probably a lot of different (and good) answers, but in my humble opinion, the common characteristics of probabilistic data structures is that they provide you with approximate, not precise answer.
How many items are here?
About 1523425 with probability of 99%
Update:
Quick search produced link to decent article on the issue:
https://highlyscalable.wordpress.com/2012/05/01/probabilistic-structures-web-analytics-data-mining/
If you are interested in probabilistic data structures, you might want to read my recently published book "Probabilistic Data Structures and Algorithms for Big Data Applications" (ISBN: 9783748190486, available at Amazon) where I have explained many of such space-efficient data structures and fast algorithms that are extremely useful in modern Big Data applications.
In this book, you can find the state of the art algorithms and data structures that help to handle such common problems in Big Data processing as
Membership querying (Bloom filter, Counting Bloom filter, Quotient filter, Cuckoo filter).
Cardinality (Linear counting, probabilistic counting, LogLog, HyperLogLog, HyperLogLog++).
Frequency (Majority algorithm, Frequent, Count Sketch, Count-Min Sketch).
Rank (Random sampling, q-digest, t-digest).
Similarity (LSH, MinHash, SimHash).
You can get a free preview and all related information about the book at https://pdsa.gakhov.com
Probabilistic data structures can't give you a definite answer, instead they provide you with a reasonable approximation of the answer and a way to approximate this estimation. They are extremely useful for big data and streaming application because they allow to dramatically decrease the amount of memory needed (in comparison to data structures that give you exact answers).
In majority of the cases these data structures use hash functions to randomize the items. Because they ignore collisions they keep the size constant, but this is also a reason why they can't give you exact values. The advantages they bring:
they use small amount of memory (you can control how much)
they can be easily parallelizable (hashes are independent)
they have constant query time (not even amortized constant like in dictionary)
Frequently used probabilistic data structures are:
bloom filters
count-min sketch
hyperLogLog
There is a list of probabilistic data structures in wikipedia for your reference:
https://en.wikipedia.org/wiki/Category:Probabilistic_data_structures
There are different definitions about what "probabilistic data structure" is. IMHO, probabilistic data structure means that the data structure uses some randomized algorithm or takes advantage of some probabilistic characteristics internally, but they don't have to behave probabilistically or un-deterministically from the data structure user's perspective.
There are many "probabilistic data structures" with probabilistically
behavior such as the bloom filter and HyperLogLog mentioned
by the other answers.
At the same time, there are other "probabilistic data structures"
with determined behavior (from a user's perspective) such as skip
list. For skip list, users can use it similarly as a balanced binary search tree but is implemented with some probability related idea internally. And according to skip list's author William Pugh:
Skip lists are a probabilistic data structure that seem likely to
supplant balanced trees as the implementation method of choice for
many applications. Skip list algorithms have the same asymptotic
expected time bounds as balanced trees and are simpler, faster and use
less space.
Probabilistic data structures allow for constant memory space and extremely fast processing while still maintaining a low error rate with a specified degree on uncertainity.
Some use-cases are
Checking presence of value in a data set
Frequency of events
Estimate approximate size of a data set
Ranking and grouping
Related
According to this StackOverflow question, probabilistic data structures are data structures that give approximate, as opposed to precise, answers. In particular, they have very low time and space complexities and are easily parallelizable, making them very efficient structures to use. Examples provided include Bloom Filters, Count-Min Sketch, and HyperLogLog.
However, all of these data structures are also known as "sketch" data structures - structures that approximate a large set via a compact representation for more efficient (but less precise) operation.
I don't see the difference between a "sketch" and a "probabilistic" data structure.
There are probabilistic data structures that are not approximations, for example the Skip list.
I am trying to find a fast algorithm for finding the (approximate, if need be) nearest neighbours of a given point in a two-dimensional space where points are frequently removed from the dataset and new points are added.
(Relatedly, there are two variants of this problem that interest me: one in which points can be thought of as being added and removed randomly and another in which all the points are in constant motion.)
Some thoughts:
kd-trees offer good performance, but are only suitable for static point sets
R*-trees seem to offer good performance for a variety of dimensions, but the generality of their design (arbitrary dimensions, general content geometries) suggests the possibility that a more specific algorithm might offer performance advantages
Algorithms with existing implementations are preferable (though this is not necessary)
What's a good choice here?
I agree with (almost) everything that #gsamaras said, just to add a few things:
In my experience (using large dataset with >= 500,000 points), kNN-performance of KD-Trees is worse than pretty much any other spatial index by a factor of 10 to 100. I tested them (2 KD-trees and various other indexes) on a large OpenStreetMap dataset. In the following diagram, the KD-Trees are called KDL and KDS, the 2D dataset is called OSM-P (left diagram): The diagram is taken from this document, see bullet points below for more information.
This research describes an indexing method for moving objects, in case you keep (re-)inserting the same points in slightly different positions.
Quadtrees are not too bad either, they can be very fast in 2D, with excellent kNN performance for datasets < 1,000,000 entries.
If you are looking for Java implementations, have a look at my index library. In has implementations of quadtrees, R-star-tree, ph-tree, and others, all with a common API that also supports kNN. The library was written for the TinSpin, which is a framework for testing multidimensional indexes. Some results can be found enter link description here (it doesn't really describe the test data, but 'OSM-P' results are based on OpenStreetMap data with up to 50,000,000 2D points.
Depending on your scenario, you may also want to consider PH-Trees. They appear to be slower for kNN-queries than R-Trees in low dimensionality (though still faster than KD-Trees), but they are faster for removal and updates than RTrees. If you have a lot of removal/insertion, this may be a better choice (see the TinSpin results, Figures 2 and 46). C++ versions are available here and here.
Check the Bkd-Tree, which is:
an I/O-efficient dynamic data structure based on the kd-tree. [..] the Bkd-tree maintains its high space utilization and excellent
query and update performance regardless of the number of updates performed on it.
However this data structure is multi dimensional, and not specialized to lower dimensions (like the kd-tree).
Play with it in bkdtree.
Dynamic Quadtrees can also be a candidate, with O(logn) query time and O(Q(n)) insertion/deletion time, where Q(n) is the time
to perform a query in the data structure used. Note that this data structure is specialized for 2D. For 3D however, we have octrees, and in a similar way the structure can be generalized for higher dimensions.
An implentation is QuadTree.
R*-tree is another choice, but I agree with you on the generality. A r-star-tree implementations exists too.
A Cover tree could be considered as well, but I am not sure if it fits your description. Read more here,and check the implementation on CoverTree.
Kd-tree should still be considered, since it's performance is remarkable on 2 dimensions, and its insertion complexity is logarithic in size.
nanoflann and CGAL are jsut two implementations of it, where the first requires no install and the second does, but may be more performant.
In any case, I would try more than one approach and benchmark (since all of them have implementations and these data structures are usually affected by the nature of your data).
Given a data structure specification such as a purely functional map with known complexity bounds, one has to pick between several implementations. There is some folklore on how to pick the right one, for example Red-Black trees are considered to be generally faster, but AVL trees have better performance on work loads with many lookups.
Is there a systematic presentation (published paper) of this knowledge (as relates to sets/maps)? Ideally I would like to see statistical analysis performed on actual software. It might conclude, for example, that there are N typical kinds of map usage, and list the input probability distribution for each.
Are there systematic benchmarks that test map and set performance on different distributions of inputs?
Are there implementations that use adaptive algorithms to change representation depending on actual usage?
These are basically research topics, and the results are generally given in the form of conclusions, while the statistical data is hidden. One can have statistical analysis on their own data though.
For the benchmarks, better go through the implementation details.
The 3rd part of the question is a very subjective matter, and the actual intentions may never be known at the time of implementation. However, languages like perl do their best to implement highly optimized solutions to every operation.
Following might be of help:
Purely Functional Data Structures by Chris Okasaki
http://www.cs.cmu.edu/~rwh/theses/okasaki.pdf
I keep in mind that hash would be first thing I should resort to if I want to write an application which requests high lookup speed, and any other data structure wouldn't guarantee that.
But I got confused when saw some many post saying different, such as suffix tree, trie, to name a few.
So I wonder is hash always the best thing for high speed lookup? What if I want both high lookup speed and less space cost?
Is there any material (books or papers) lecturing about the data structures or algorithms **on high speed lookup and space efficiency? Any of this kind is highly appreciated.
So I wonder is hash always the best thing for high speed lookup?
No. As stated in comments:
There is never such a thing Best data structure for [some generic issue]. Everything is case dependent. Tries and radix trees might be great for strings, since you need to read the string anyway. arrays allows simplicity and great cache efficiency - and are usually the best for small scale static information
I once answered a related question of cases where a tree might be better then a hash table: Hash Table v/s Trees
What if I want both high lookup speed and less space cost?
The two might be self-contradicting. Even for the simple example of a hash table of size X vs a hash table of size 2*X. The bigger hash table is less likely to encounter collisions, and thus is expected to be faster then the smaller one.
Is there any material (books or papers) lecturing about the data
structures or algorithms on high speed lookup and space efficiency?
Introduction to Algorithms provide a good walk through on the main data structure used. Any algorithm developed is trying to provide a good space and time efficiency, but like said, there is a trade off, and some algorithms might be better for specific cases then others.
Choosing the right algorithm/data structure/design for the specific problem is what engineering is about, isn't it?
I assume you are talking about strings here, and the answer is "no", hashes are not the fastest or most space efficient way to look up strings, tries are. Of course, writing a hashing algorithm is much, much easier than writing a trie.
One thing you won't find in wikipedia or books about tries is that if you naively implement them with one node per letter, you end up with large numbers of inefficient, one-child nodes. To make a trie that really burns up the CPU you have to implement nodes so that they can have a variable number of characters. This, of course, is even harder than writing a plain trie.
I have written trie implementations that handle over a billion entries and I can tell you that if done properly it is insanely fast, nothing else compares.
One other issue with tries is that you have to write a custom heap, because if you just use some kind of generic memory management it will be slow. So in addition to implementing the trie, you have to implement the heap that the trie runs on. Pretty freakin complicated, but if you do it, you get batshit crazy speed.
Only a good implementation of hash will give you good performance. And you cannot compare hash with Trie for all situations. Situations where Trie is applicable, is fast, but it can be costly in terms of memory, (again dependent on implementation).
But have you measured performance? Or it is unnecessary optimization you are looking for. Did the map fail you?
That might also depend on the actual number of elements.
In complexity theory a hash is not bad, but complexity theory is only good if the actual number of elements is bigger than some threshold.
I.e. if you have only 2 elements, there is a faster method than a hash ;-)
Hash tables are a good general purpose structure but they can fail spectacularly if the hash function doesn't suit the input data. Worst case lookup is O(n). They also waste some space as you mentioned. Other general-purpose structures like balanced binary search trees have worse average case but better worst case performance than a hash table. This is important for real-time applications. A trie is a more special-purpose structure tailored to string lookup.
Is there someplace where I can get a Big-O style analysis / comparison of traditional data structures such as linked lists, various trees, hashes, etc vs. cache aware data structures such as Judy trees and others?
Actually,
I would look here for analysis of Judy Trees.
As illustrated in this data, Judy's
smaller size does not give it an
enormous speed advantage over a
traditional "trade size for speed"
data structure. Judy has received
countless man-hours developing and
debugging 20,000 lines of code; I
spent an hour or three writing a
fairly standard 200-line hash table.
If your data is strictly sequential;
you should use a regular array. If
your data is often sequential, or
approximately sequential (e.g. an
arithmetic sequence stepping by 64),
Judy might be the best data structure
to use. If you need to keep space to a
minimum--you have a huge number of
associative arrays, or you're only
storing very small values, Judy is
probably a good idea. If you need an
sorted iterator, go with Judy.
Otherwise, a hash table may be just as
effective, possibly faster, and much
simpler.
BigO is about algorhitms comlexity doing certain task.
There are different tasks avaliable on each data structure. Most important one are:
Sort, Find(in sorted structure) and add element.
So what are you looking for is complexity of certain task for certain data structure.
For most data types optimal sorting algorhitm is O(nlog(n)) but keep in mind that some structures are still slower, for instance sorting linked list is slower than arrays athough both have nlog(n) complexity
Read The Art of Computer Programming books by Don Knuth. These are considered by many to be the best source of algorithm information around.
Did you look in: "Introduction to Algorithms"
(http://en.wikipedia.org/wiki/Introduction_to_Algorithms)