Best structures for random select with known weights? - algorithm

Say I have a simulation that needs to pick between many discrete events with given weights, with the weights themselves following a known distribution.
I need some structure that supports updates, and efficiently picks random events.
It is trivial to get O(log(n)) insertion, deletion, and random-choice operations, with a binary search tree. It should be possible to improve the random-choice operation, with interpolation search for example.
What are some theoretical results stronger than this, or known good implementations?
EDIT: I consider Niklas's reply to this comment about the O(log* n) algorithm (ftp.cs.brown.edu/pub/techreports/92/cs92-36.pdf)to be exactly what I was looking for.

Related

Sorting Algorithm that minimizes the maximum number of comparisons in which individual items are involved

I'm interested in finding a comparison sorting algorithm that minimizes the number of times each single element is compared with others during the run of the algorithm.
For a randomly sorted list, I'm interested in two distributions: the number of comparisons that are needed to sort a list (this is the traditional criterion) and the number of comparisons in which each single element of the list is involved.
Among the algorithms that have a good performance in terms of the number of comparisons, say achieving O(n log(n)) on average, I would like to find out the one for which, on average, the number of times a single element is compared with others is minimized.
I supposed that the theoretical minimum is O(log(n)) which is obtained by dividing the above figure on the total number of comparisons by n.
I'm also interested in the case where data are likely to be already ordered to some extent.
Is perhaps a simulation the best way to go about finding an answer?
(My previous question has been put on hold - This is now a very clear question, if you can't understand it then please explain why)
Yes you definitely should do simulations.
There you will implicitely set the size and pre-ordering constraints in a way that may allow more specific statements than the general question you rose.
There can, however, not be a clear answer to such question in general.
Big-O deals with asymptotic behaviour while your question
seem to target smaller problem sizes. So Big-O could hint on the best candidates for sufficiently large input sets to a sort run. (But, e.g. if you are interested in size<=5 the results may be completely different!)
For getting proper estimate on comparison operations you would need
to analyze each individual algorithm.
At the end, the result (for a given algorithm) will necesarily be specific to the dataset being sorted.
Also, on avarage is not well defined in your context. I'd assume you intend to refer to the number of comparisions on the participating objects for a given sort and not avarage over a (sufficiently large) set of sort runs.
Even within a single algorithm the distribution of comparisions an individual object is taking place in may show a large standard deviation in one case and be (nearly) equally distributed in another case.
As complexity of a sorting algorithm is determined by the total number of comparisons (and position changes thereof), I do not assume there will be much from therotical analysis contributing to an answer.
Maybe you can add some background on what would make an answer to your question "interesting" in a practical sense?

Given a situation, how to decide on a data structure?

I'm preparing to attend technical interviews and have faced mostly questions which are situation based.Often the situation is a big dataset and I'm asked to decide which will be the most optimal data structure to use.
I'm familiar with most data structures,their implementation and performance. But I fall under dilemma when given situations and be decisive on structures.
Looking for steps/algorithm to follow in a given situation which can help me arrive at the optimum data structure within the time period of the interview.
It depends on what operations you need to support efficiently.
Let's start from the simplest example - you have a large list of elements and you have to find the given element. Lets consider various candidates
You can use sorted array to find an element in O(log N) time using Binary search. What if you want to support insertion and deletion along with that? Inserting an element into a sorted array takes O(n) time in the worst case. (Think of adding an element in the beginning. You have to shift all the elements one place to the right). Now here comes binary search trees (BST). They can support insertion, deletion and searching for an element in O(log N) time.
Now you need to support two operations namely finding minimum and maximum. In the first case, it is just returning the first and the last element respectively and hence the complexity is O(1). Assuming the BST is a balanced one like Red-black tree or AVL tree, finding min and max needs O(log N) time. Consider another situation where you need to return the kth order statistic. Again,sorted array wins. As you can see there is a tradeoff and it really depends on the problem you are given.
Let's take another example. You are given a graph of V vertices and E edges and you have to find the number of connected components in the graph. It can be done in O(V+E) time using Depth first search (assuming adjacency list representation). Consider another situation where edges are added incrementally and the number of connected components can be asked at any point of time in the process. In that situation, Disjoint Set Union data structure with rank and path compression heuristics can be used and it is extremely fast for this situation.
One more example - You need to support range update, finding sum of a subarray efficiently and no new elements are inserting into the array. If you have an array of N elements and Q queries are given, then there are two choices. If range sum queries come only after "all" update operations which are Q' in number. Then you can preprocess the array in O(N+Q') time and answer any query in O(1) time (Store prefix sums). What if there is no such order enforced? You can use Segment Tree with lazy propagation for that. It can be built in O(N log N) time and each query can be performed in O(log N) time. So you need O((N+Q)log N) time in total. Again, what if insertion and deletion are supported along with all these operations? You can use a data structure called Treap which is a probabilistic data structure and all these operations can be performed in O(log N) time. (Using implicit treap).
Note: The constant is omitted while using Big Oh notation. Some of them have large constant hidden in their complexities.
Start with common data structures. Can the problem be solved efficiently with arrays, hashtables, lists or trees (or a simple combination of them, e.g. an array of hastables or similar)?
If there are multiple options, just iterate the runtimes for common operations. Typically one data structure is a clear winner in the scenario set up for the interview. If not, just tell the interviewer your findings, e.g. "A takes O(n^2) to build but then queries can be handled in O(1), whereas for B build and query time are both O(n). So for one-time usage, I'd use B, otherwise A". Space consumption might be relevant in some cases, too.
Highly specialized data structures (e.g. prefix trees aka "Trie") are often that: highly specialized for one particular specific case. The interviewer should usually be more interested in your ability to build useful stuff out of an existing general purpose library -- opposed to knowing all kinds of exotic data structures that may not have much real world usage. That said, extra knowledge never hurts, just be prepared to discuss pros and cons of what you mention (the interviewer may probe whether you are just "name dropping").

KD-Tree implementation

I'm trying to write my own KD-Tree implementation and eventually a kNN implementation. and I'm having a bit of difficulty understanding how the KD-Tree construct the search tree.
on wikipedia it says that it finds the median of the values and use that as the root of the tree.
When there are many dimensions however, how would u compute the median?
You don't find the median in several dimensions (in fact, there is no meaningful order for multidimensional numbers). At every level of the kd Tree, you focus on one dimension. You choose the median based on this dimension, ignoring other components.
Note that you can use many criteria other than the median, depending on what you want to do. Likewise, selecting a good scheme for deciding the dimension for each node is an art, though virtually every scheme is correct.
It is not required to find the medians: from wikipedia:
Note also that it is not required to select the median point. In that
case, the result is simply that there is no guarantee that the tree
will be balanced. A simple heuristic to avoid coding a complex
linear-time median-finding algorithm, or using an O(n log n) sort of
all n points, is to use sort to find the median of a fixed number of
randomly selected points to serve as the splitting plane. In practice,
this technique often results in nicely balanced trees.
KD-Tree from wikipedia
You can simply sort the points according to one dimension, then choose
the median as root, then recursively construct subtrees(sort with other dimension)
here is an implementation:
https://github.com/tavaresdong/cs106l/blob/master/KDTree/src/KDTree.h

What's the best way to compare the efficiency of splay trees?

I have implemented several splay tree algorithms.
What's the best way to compare them?
Is it a good start to compare execution time when adding random nodes?
I've also implemented an Binary Search Tree that keeps track of how much every node is visited. I wrote an optimize() method that creates an Optimal Binary Search Tree.
If we do not plan on modifying a search tree, and we know exactly how often each item will be accessed, we can construct an optimal binary search tree, which is a search tree where the average cost of looking up an item (the expected search cost) is minimized.
How can I involve this in the comparison of splay trees?
I like the empirical approach.
In this approach:
Create a bunch of random typical data sets, of various lengths.
Run each implementation and find out what is the execution time for each.
Use Hypothesis testing methods to find out if one implementation is better then the other. In here, the null hypothesis (H0) is "The two implementations should take the same time to execute, on average.
Conclude from step 3 that one implementation is better then the other, with probability 1-p (where p is your p_value).
PS Wilcoxon test is considered a good one, and is used a lot in literature and research to compare two algorithms.

How to efficiently find k-nearest neighbours in high-dimensional data?

So I have about 16,000 75-dimensional data points, and for each point I want to find its k nearest neighbours (using euclidean distance, currently k=2 if this makes it easiser)
My first thought was to use a kd-tree for this, but as it turns out they become rather inefficient as the number of dimension grows. In my sample implementation, its only slightly faster than exhaustive search.
My next idea would be using PCA (Principal Component Analysis) to reduce the number of dimensions, but I was wondering: Is there some clever algorithm or data structure to solve this exactly in reasonable time?
The Wikipedia article for kd-trees has a link to the ANN library:
ANN is a library written in C++, which
supports data structures and
algorithms for both exact and
approximate nearest neighbor searching
in arbitrarily high dimensions.
Based on our own experience, ANN
performs quite efficiently for point
sets ranging in size from thousands to
hundreds of thousands, and in
dimensions as high as 20. (For applications in significantly higher
dimensions, the results are rather
spotty, but you might try it anyway.)
As far as algorithm/data structures are concerned:
The library implements a number of
different data structures, based on
kd-trees and box-decomposition trees,
and employs a couple of different
search strategies.
I'd try it first directly and if that doesn't produce satisfactory results I'd use it with the data set after applying PCA/ICA (since it's quite unlikely your going to end up with few enough dimensions for a kd-tree to handle).
use a kd-tree
Unfortunately, in high dimensions this data structure suffers severely from the curse of dimensionality, which causes its search time to be comparable to the brute force search.
reduce the number of dimensions
Dimensionality reduction is a good approach, which offers a fair trade-off between accuracy and speed. You lose some information when you reduce your dimensions, but gain some speed.
By accuracy I mean finding the exact Nearest Neighbor (NN).
Principal Component Analysis(PCA) is a good idea when you want to reduce the dimensional space your data live on.
Is there some clever algorithm or data structure to solve this exactly in reasonable time?
Approximate nearest neighbor search (ANNS), where you are satisfied with finding a point that might not be the exact Nearest Neighbor, but rather a good approximation of it (that is the 4th for example NN to your query, while you are looking for the 1st NN).
That approach cost you accuracy, but increases performance significantly. Moreover, the probability of finding a good NN (close enough to the query) is relatively high.
You could read more about ANNS in the introduction our kd-GeRaF paper.
A good idea is to combine ANNS with dimensionality reduction.
Locality Sensitive Hashing (LSH) is a modern approach to solve the Nearest Neighbor problem in high dimensions. The key idea is that points that lie close to each other are hashed to the same bucket. So when a query arrives, it will be hashed to a bucket, where that bucket (and usually its neighboring ones) contain good NN candidates).
FALCONN is a good C++ implementation, which focuses in cosine similarity. Another good implementation is our DOLPHINN, which is a more general library.
You could conceivably use Morton Codes, but with 75 dimensions they're going to be huge. And if all you have is 16,000 data points, exhaustive search shouldn't take too long.
No reason to believe this is NP-complete. You're not really optimizing anything and I'd have a hard time figure out how to convert this to another NP-complete problem (I have Garey and Johnson on my shelf and can't find anything similar). Really, I'd just pursue more efficient methods of searching and sorting. If you have n observations, you have to calculate n x n distances right up front. Then for every observation, you need to pick out the top k nearest neighbors. That's n squared for the distance calculation, n log (n) for the sort, but you have to do the sort n times (different for EVERY value of n). Messy, but still polynomial time to get your answers.
BK-Tree isn't such a bad thought. Take a look at Nick's Blog on Levenshtein Automata. While his focus is strings it should give you a spring board for other approaches. The other thing I can think of are R-Trees, however I don't know if they've been generalized for large dimensions. I can't say more than that since I neither have used them directly nor implemented them myself.
One very common implementation would be to sort the Nearest Neighbours array that you have computed for each data point.
As sorting the entire array can be very expensive, you can use methods like indirect sorting, example Numpy.argpartition in Python Numpy library to sort only the closest K values you are interested in. No need to sort the entire array.
#Grembo's answer above should be reduced significantly. as you only need K nearest Values. and there is no need to sort the entire distances from each point.
If you just need K neighbours this method will work very well reducing your computational cost, and time complexity.
if you need sorted K neighbours, sort the output again
see
Documentation for argpartition

Resources