Is it okay to use mean average precision for recommender system evaluation of non-uniform number length recommendation?
Related
Suppose I have a list of ranges (in a form of lower bound and upper bound, inclusive) ranges = [(lb1, ub1), (lb2, ub2)...] and a positive number k. Is there some way how to sample k N-dimensional vectors (N is given by len(ranges)) from the N-dimensional interval given by ranges such that the samples cover the interval as evenly as possible?
I have no definiton of evenly, it's just intuitive (maybe that the distances between "neighboring" points are similar). I'm not looking for a precise algorithm (which is not possible without the definition) but rather for ideas of how to do that and that are nice in python/numpy.
I'm (probably) not looking for just random sampling which could very easily create unwanted clusters of samples, but the algorithm can definitely be stochastic.
If the points are independent, then there should be clusters. So, you want the points not to be independent. You want something like a low discrepancy sequence in N dimensions. One type of low discrepancy sequence in N dimensions is a Sobol sequence. These were designed for high dimensional numerical integration, and are suitable for many but not all purposes.
Say I have a sample of N positive real numbers and I want to find a "typical" value for these numbers. Of course "typical" is not very well defined but one could think of the following more concrete problem :
The numbers are distributed such that (roughly speaking) a fraction (1-epsilon) of them is drawn from a Gaussian with positive mean m > 0 and mean square deviation sigma << m and a small fraction epsilon of them is drawn from some other distribution, heavy tailed both for large and small numbers. I want to estimate the mean of the Gaussian within a few standard deviation.
A solution would be to compute the median but while it is O(N), constant factors are not so good for moderate N and moreover it requires quite a bit of coding. I am ready to give up precision on my estimate against code simplicity and/or small N performance (say N is 10 or 20 for instance, and I have at most one or two outliers).
Do you have any suggestion ?
(For instance, if my outliers where only coming from large values, I would compute the average of the log of my values and exponentiate it. Under some further assumptions this gives me, generally, a good estimate and I can compute it easily and with a sharp O(N)).
You could take the mean of the numbers excluding the min and max. The formula is (sum - min - max) / (N - 2), and the terms in the numerator can be computed simply with one pass (watch out for floating point issues though).
I think you should reconsider the median, either using quickselect or Blum-Floyd-Pratt-Rivest-Tarjan (as implented here by Coetzee). It's fast and robust.
If you need better speed you might consider picking a fixed number of random elements and taking their median. This is sublinear (O(1) or O(log n) depending on the model) and works well for large sets.
I am starting to learn CUDA and I think calculating long digits of pi would be a nice, introductory project.
I have already implemented the simple Monte Carlo method which is easily parallelize-able. I simply have each thread randomly generate points on the unit square, figure out how many lie within the unit circle, and tally up the results using a reduction operation.
But that is certainly not the fastest algorithm for calculating the constant. Before, when I did this exercise on a single threaded CPU, I used Machin-like formulae to do the calculation for far faster convergence. For those interested, this involves expressing pi as the sum of arctangents and using Taylor series to evaluate the expression.
An example of such a formula:
Unfortunately, I found that parallelizing this technique to thousands of GPU threads is not easy. The problem is that the majority of the operations are simply doing high precision math as opposed to doing floating point operations on long vectors of data.
So I'm wondering, what is the most efficient way to calculate arbitrarily long digits of pi on a GPU?
You should use the Bailey–Borwein–Plouffe formula
Why? First of all, you need an algorithm that can be broken down. So, the first thing that came to my mind is having a representation of pi as an infinite sum. Then, each processor just computes one term, and you sum them all in the end.
Then, it is preferable that each processor manipulates small-precision values, as opposed to very high precision ones. For example, if you want one billion decimals, and you use some of the expressions used here, like the Chudnovsky algorithm, each of your processor will need to manipulate a billion long number. That's simply not the appropriate method for a GPU.
So, all in all, the BBP formula will allow you to compute the digits of pi separately (the algorithm is very cool), and with "low precision" processors! Read the "BBP digit-extraction algorithm for π"
Advantages of the BBP algorithm for computing π
This algorithm computes π without requiring custom data types having thousands or even millions of digits. The method calculates the nth digit without calculating the first n − 1 digits, and can use small, efficient data types.
The algorithm is the fastest way to compute the nth digit (or a few digits in a neighborhood of the nth), but π-computing algorithms using large data types remain faster when the goal is to compute all the digits from 1 to n.
Lets say we have some discrete distribution with finite number of possible results, is it possible to generate a random number from this distribution faster than in O(logn), where n is number possible results?
How to make it in O(logn):
- Make an array with cumulative probability (Array[i] = Probability that random number will be less or equal to i)
- Generate random number from uniform distribution (lets denote it by k)
- Find the smallest i such that k < Array[i]. It can be done using binary search.
- i is our random number.
Walker's alias method can draw a sample in constant worst-case time, using some auxiliary arrays of size n which need to be precomputed. This method is described in Chapter 3 of Devroye's book on sampling and is implemented in the R sample() function. You can get code from R's source code or this thread. A 1991 paper by Vose claims to reduce the initialization cost.
Note that your question isn't well-defined unless you specify the exact form of the input and how many random numbers you want to draw. For example, if the input is an array giving the probability of each result, then your algorithm is not O(log n) because it requires first computing the cumulative probabilities which takes O(n) time from the input array.
If you intend to draw many samples then the cost of generating a single sample is not so important. Instead what matters is the total cost to generate m results, and the peak memory required. In this regard, the alias method very good. If you want to generate the samples all at once, use the O(n+m) algorithm posted here and then shuffle the results.
I have an ordered list of weighted items, weight of each is less-or-equal than N.
I need to convert it into a list of clusters.
Each cluster should span several consecutive items, and total weight of a cluster has to be less-or-equal than N.
Is there an algorithm which does it while minimizing the total number of clusters and keeping their weights as even as possible?
E.g. list [(a,5),(b,1),(c,2),(d,5)], N=6 should be converted into [([a],5),([b,c],3),([d],5)]
Since the dataset is ordered, one possible approach is to assign a "badness" score to each possible cluster and use a dynamic program reminiscent of Knuth's word wrapping ( http://en.wikipedia.org/wiki/Word_wrap ) to minimize the sum of the badness scores. The badness function will let you explore tradeoffs between minimizing the number of clusters (larger constant term) and balancing them (larger penalty for deviating from the average number of items).
Your problem is under-specified.
The issue is that you are trying to optimize two different properties of the resulting data, and these properties may be in opposition to one another. For a given set of data, it may be that the most even distribution has many clusters, and that the smallest number of clusters has a very uneven distribution.
For example, consider: [(a,1),(b,1),(c,1),(d,1),(e,1)], N=2
The most even distribution is [([a],1),([b],1),([c],1),([d],1),([e],1)]
But the smallest number of clusters is [([a,b],2),([c,d],2),([e],1)]
How is an algorithm supposed to know which of these (or which clustering in between them) you want? You need find some way to quantify the tradeoff that you are willing to accept between number of clusters and evenness of distribution.
You can create an example with an arbitrarily large discrepancy between the two possibilities by creating any set with 2k + 1 elements, and assigning them all the value N/2. This will lead to the smallest number of clusters being k+1 clusters (k of 2 elements and 1 of 1) with a weight difference of N/2 between the largest and smallest clusters. And then the most even distribution for this set will be 2k + 1 clusters of 1 element each, with no weight difference.
Edit: Also, "evenness" itself is not a well-defined idea. Are you looking to minimize the largest absolute difference in weights between clusters, or the mean difference in weights, or the median difference in weights, or the standard deviation in weights?