How can I efficiently calculate the negative binomial cumulative distribution function? - algorithm

This post is really helpful:
How can I efficiently calculate the binomial cumulative distribution function?
(Title = How can I efficiently calculate the binomial cumulative distribution function?)
However, I need the negative binomial cumulative distribution function.
Is there a way to tweek the code to get a negative cumulative distribution function?

You can compute the CDF by summing the terms of the PMF taking advantage of the recurrence relationship the terms satisfy. The terms in the series are a little complicated, but the ratio of consecutive terms is simple.

Related

Fuzzy search over millions of strings with custom distance function

I have a large pool of short strings and a custom distance function on them (let's say Damerau–Levenshtein distance).
Q: What is the state-of-the-art solution for getting top N strings from the pool according to the custom distance?
I am looking for both a theoretical approach to this problem as well as coded implementation (Java, Python, etc).
The straight forward approach is to iterate over all strings, calculate the distance for each and keep only the best N while you iterate.
If you need to do this task a lot, you should think if you can come up with a upper-bound / lower bound estimation for the costs that can be calculated much faster than your real cost function. E.g. pre-calculate all n-grams (e.g. 3-grams) for your strings. or maybe comparing the length difference can already give a lower bound for the distance. than you can skip the calculation of the distance for all strings which have a lower bound distance higher than your current distance of the n-th best match.

How the average complexity is calculated

How the average complexity of algorithm is calculated? Worst is obvious, best also, but how the average is calculated?
calculate the complexity for all possible input and take and weighted sum based on their probabilities. This is also called expected runtine (similar to expectation in probabilities).
ET(I) = P(X=I1)*T(I1) + P(X=I2)*T(I2) + P(X=I3)*T(I3).......
Average performance (time, space, etc.) complexity is found by considering all possible inputs of a given size and stating the asymptotic bound for the average of the respective measure across all those inputs.
For example, average "number of comparisons" complexity for a sort would be found by considering all N! permutations of input of size N and stating bounds on the average number of comparisons performed across all those inputs.
I.e. this is the sum of numbers of comparisons for all of the possible N! inputs divided by N!
Because the average performance across all possible inputs is equal to the expected value of the same performance measure, average performance is also called expected performance.
Quicksort presents an interesting non-trivial example of calculating the average run-time performance. As you can see the math can get quite complex, and so unfortunately I don't think there's a general equation for calculating average performance.

Cumulative frequency table with creation in linear or better than linear complexity?

I am trying to solve an algorithmic problem, and to solve it within the time constrains I need to implement a cumulative frequency table whose creation takes a linear or better than linear time? My inputs are integers only; hence, my keys of the frequency table are integers only. A simple implementation that I came up with is follows (assume cumulative_freq_table is a hashmap in the following code.):
read x
for key in range(x, N):
if key in cumulative_freq_table:
cumulative_freq_table[key] += 1
I haven't studied any algorithms related course, but I guess its complexity is around O(N^2). Can this be done in time better than O(N^2)?
OFF-LINE APPROACH
If you are happy to use two passes then you can do this:
for each x:
read x
freq_table[x] += 1
t = 0
for key in range(0,N):
t += freq_table[key]
cumulative_freq_table[key] = t
This will be linear.
ON-LINE APPROACH
THe problem with the linear approach is that it requires all the data to be seen before you can access the cumulative frequency table.
There are alternative approaches that allow continual access to the cumulative frequency, but have higher complexity.
For example, have a look at Fenwick Trees for an approach that uses O(log(N)) operations for each element.

Sorting algorithms for data of known statistical distribution?

It just occurred to me, if you know something about the distribution (in the statistical sense) of the data to sort, the performance of a sorting algorithm might benefit if you take that information into account.
So my question is, are there any sorting algorithms that take into account that kind of information? How good are they?
An example to clarify: if you know the distribution of your data to be Gaussian, you could estimate mean and average on the fly as you process the data. This would give you an estimate of the final position of each number, which you could use to place them close to their final position.
I'm pretty surprised the answer isn't a wiki link to a thourough page discussing this issue. Isn't this a very common case (the Gaussian case, for example)?
I'm adding a bounty to this question, because I'm looking for definite answers with sources, not speculation. Something like "in the case of gaussian distributed data, XYZ algorithm is the fastest on average, as was proved by Smith et al. [1]". However any additional information is welcome.
If the data you are sorting has a known distribution, I would use a Bucket Sort algorithm. You could add some extra logic to it so that you calculated the size and/or positions of the various buckets based upon properties of the distribution (ex: for Gaussian, you might have a bucket every (sigma/k) away from the mean, where sigma is the standard deviation of the distribution).
By having a known distribution and modifying the standard Bucket Sort algorithm in this way, you would probably get the Histogram Sort algorithm or something close to it. Of course, your algorithm would be computationally faster than the the Histogram Sort algorithm because there would probably not be a need to do the first pass (described in the link) since you already know the distribution.
Edit: given your new criteria of your question, (though my previous answer concerning Histogram Sort links to the respectable NIST and contains performance information), here is a peer review journal article from the International Conference on Parallel Processing:
Adaptive Data Partition for Sorting Using Probability Distribution
The authors claim this algorithm has better performance (up to 30% better) than the popular Quick-Sort Algorithm.
Sounds like you might want to read Self-Improving Algorithms: they achieve an eventual optimal expected running time for arbitrary input distributions.
We give such self-improving algorithms
for two problems: (i) sorting a
sequence of numbers and (ii) computing
the Delaunay triangulation of a planar
point set. Both algorithms achieve
optimal expected limiting complexity.
The algorithms begin with a training
phase during which they collect
information about the input
distribution, followed by a stationary
regime in which the algorithms settle
to their optimized incarnations.
If you already know your input distribution is approximately Gaussian, then perhaps another approach would be more efficient in terms of space complexity, but in terms of expected running time this is a rather wonderful result.
Knowing the data source distribution, one can build a good hash function. Knowing the distribution well, the hash function may prove to be a perfect hash function, or close to perfect for many input vectors.
Such function would divide an input of size n into n bins, such that the smallest item would map into the 1st bin, and the largest item would map to the last bin. When the hash is perfect- we would achieve sort just be inserting all the items into the bins.
Inserting all the items into a hash table, then extracting them by order will be O(n) when the hash is perfect (assuming the hash function calculation cost is O(1), and the underline hash data structure operations are O(1)).
I would use an array of fibonacci heaps to implement the hash-table.
For input vector for which the hash function won't be perfect (but still close to perfect), it would still be much better than O(nlogn). When it is perfect - it would be O(n). I'm not sure how to calculate the average complexity, but if forced to, I would bet on O(nloglogn).
Computer sorting algorithms can be classified into
two categories, comparison-based sorting and
non-comparison-based sorting. For comparison-based
sorting, the sorting time in its best-case performance is
Ω (nlogn), while in its worst-case performance the
sorting time can rise up to O(n2 ). In recent years,
some improved algorithms have been proposed to
speed up comparison-based sorting, such as advanced
quick sort according to data distribution characteristics
. However, the average sorting time for these
algorithms is just Ω (nlog2n), and only in the best-case
can it reach O(n).
Different from comparison-based sorting,
non-comparison-based sorting such as count sorting,
bucket sorting and radix sorting depends mainly on key
and address calculation. When the values of keys are
finite ranging from 1 to m, the computational
complexity of non-comparison-based sorting is
O(m+n). Particularly, when m=O(n), the sorting time
can reach O(n). However, when m=n2, n3, …., the
upper bound of linear sorting time can not be obtained.
Among non-comparison-based sorting, bucket sorting
distributes a group of records with similar keys into the
appropriate “bucket”, then another sorting algorithm is
applied to the records in each bucket. With bucket
sorting, the partition of records into m buckets is less
time consuming, while only a few records will be
contained in each bucket so that “cleanup sorting”
algorithm can be applied very fast. Therefore,
bucket sorting has the potential to asymptotically save
sorting time compared with Ω (nlogn) algorithms.
Obviously, how to uniformly distribute all records into
buckets plays a critical role in bucket sorting. Hence what you need is a method to construct a hash function
according to data distribution, which is used to
uniformly distribute n records into n buckets based on
the key of each record. Hence, the sorting time of the
proposed bucket sorting algorithm will reach O(n)
under any circumstance.
check this paper: http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=5170434&tag=1
Bucket sort would give you a linear time sorting algorithm, as long as you can compute the CDF of each point in O(1) time.
The algorithm, which you can also look up elsewhere, is as follows:
a = array(0, n - 1, []) // create an empty list for each bucket
for x in input:
a[floor(n * cdf(x))].append(x) // O(1) time for each x
input.clear()
for i in {0,...,n - 1}:
// this sorting step costs O(|a[i]|^2) time for each bucket
// but most buckets are small and the cost is O(1) per bucket in expectation
insertion_sort(a[i])
input.concatenate(a[i])
The running time is O(n) in expectation because in expectation there are O(n) pairs (x, y) such that x and y fall in the same bucket, and the running time of insertion sort is precisely O(n + # pairs in the same bucket). The analysis is similar to that of FKS static perfect hashing.
EDIT: If you don't know the distribution, but you know what family it's from, you can just estimate the distribution in O(n), in the Gaussian case by computing the mean and variance, and then use the same algorithm (incidentally, computing the cdf in this case is nontrivial).
You could use that information in quicksort to select the pivot value. I think it would improve the probability of the algorithm staying away of the O(N**2) worst case complexity.
I think cycle sort falls into this category. You use it when you know the exact position that you want each element to end up at.
Cyclesort has some nice properties - for certain restricted types of data it can do a stable, in-place sort in linear time, while guaranteeing that each element will be moved at most once.
Use inverse of cdf for different pdfs other than uniform distributions.
a = array(0, n - 1, []) // create an empty list for each bucket
for x in input:
a[floor(n * inverse_cdf(x))].append(x) // O(1) time for each x
input.clear()
for i in {0,...,n - 1}:
// this sorting step costs O(|a[i]|^2) time for each bucket
// but most buckets are small and the cost is O(1) per bucket in expectation
insertion_sort(a[i])
input.concatenate(a[i])

How to generate a random number from specified discrete distribution?

Lets say we have some discrete distribution with finite number of possible results, is it possible to generate a random number from this distribution faster than in O(logn), where n is number possible results?
How to make it in O(logn):
- Make an array with cumulative probability (Array[i] = Probability that random number will be less or equal to i)
- Generate random number from uniform distribution (lets denote it by k)
- Find the smallest i such that k < Array[i]. It can be done using binary search.
- i is our random number.
Walker's alias method can draw a sample in constant worst-case time, using some auxiliary arrays of size n which need to be precomputed. This method is described in Chapter 3 of Devroye's book on sampling and is implemented in the R sample() function. You can get code from R's source code or this thread. A 1991 paper by Vose claims to reduce the initialization cost.
Note that your question isn't well-defined unless you specify the exact form of the input and how many random numbers you want to draw. For example, if the input is an array giving the probability of each result, then your algorithm is not O(log n) because it requires first computing the cumulative probabilities which takes O(n) time from the input array.
If you intend to draw many samples then the cost of generating a single sample is not so important. Instead what matters is the total cost to generate m results, and the peak memory required. In this regard, the alias method very good. If you want to generate the samples all at once, use the O(n+m) algorithm posted here and then shuffle the results.

Resources