Pointwise vs. pairwise Learning-to-rank on DATA WITH BINARY RELEVANCE VALUES - ranking

I have two question about the differences between pointwise and pairwise learning-to-rank algorithms on DATA WITH BINARY RELEVANCE VALUES (0s and 1s). Suppose the loss function for a pairwise algorithm calculates the number of times an entry with label 0 gets ranked before an entry with label 1, and that for a pointwise algorithm calculates the overall differences between the estimated relevance values and the actual relevance values.
So my questions are: 1) theoretically, will the two groups of algorithms perform significantly differently? 2) will a pairwise algorithm degrade to pointwise algorithm in such settings?
thanks!

In point wise estimation the errors across rows in your data (rows with items and users, you want to rank items within each user/query) are assumed to be independent sort of like normally distributed errors. Whereas in pair wise evaluation the algorithm loss function often used is cross entropy - a relative measure of accurately classifying 1's as 1's and 0's as 0s in each pair (with information - i.e. one of the item is better than other within the pair).
So changes are that the pair wise is likely to learn better than point-wise.
Only exception I could see is a business scenario when users click items without evaluating/comparing items from one another per-say. This is highly unlikely though.

Related

"Covering" the space of all possible histogram shapes

There is a very expensive computation I must make frequently.
The computation takes a small array of numbers (with about 20 entries) that sums to 1 (i.e. the histogram) and outputs something that I can store pretty easily.
I have 2 things going for me:
I can accept approximate answers
The "answers" change slowly. For example: [.1 .1 .8 0] and [.1
.1 .75 .05] will yield similar results.
Consequently, I want to build a look-up table of answers off-line. Then, when the system is running, I can look-up an approximate answer based on the "shape" of the input histogram.
To be precise, I plan to look-up the precomputed answer that corresponds to the histogram with the minimum Earth-Mover-Distance to the actual input histogram.
I can only afford to store about 80 to 100 precomputed (histogram , computation result) pairs in my look up table.
So, how do I "spread out" my precomputed histograms so that, no matter what the input histogram is, I'll always have a precomputed result that is "close"?
Finding N points in M-space that are a best spread-out set is more-or-less equivalent to hypersphere packing (1,2) and in general answers are not known for M>10. While a fair amount of research has been done to develop faster methods for hypersphere packings or approximations, it is still regarded as a hard problem.
It probably would be better to apply a technique like principal component analysis or factor analysis to as large a set of histograms as you can conveniently generate. The results of either analysis will be a set of M numbers such that linear combinations of histogram data elements weighted by those numbers will predict some objective function. That function could be the “something that you can store pretty easily” numbers, or could be case numbers. Also consider developing and training a neural net or using other predictive modeling techniques to predict the objective function.
Building on #jwpat7's answer, I would apply k-means clustering to a huge set of randomly generated (and hopefully representative) histograms. This would ensure that your space was spanned with whatever number of exemplars (precomputed results) you can support, with roughly equal weighting for each cluster.
The trick, of course, will be generating representative data to cluster in the first place. If you can recompute from time to time, you can recluster based on the actual data in the system so that your clusters might get better over time.
I second jwpat7's answer, but my very naive approach was to consider the count of items in each histogram bin as a y value, to consider the x values as just 0..1 in 20 steps, and then to obtain parameters a,b,c that describe x vs y as a cubic function.
To get a "covering" of the histograms I just iterated through "possible" values for each parameter.
e.g. to get 27 histograms to cover the "shape space" of my cubic histogram model I iterated the parameters through -1 .. 1, choosing 3 values linearly spaced.
Now, you could change the histogram model to be quartic if you think your data will often be represented that way, or whatever model you think is most descriptive, as well as generate however many histograms to cover. I used 27 because three partitions per parameter for three parameters is 3*3*3=27.
For a more comprehensive covering, like 100, you would have to more carefully choose your ranges for each parameter. 100**.3 isn't an integer, so the simple num_covers**(1/num_params) solution wouldn't work, but for 3 parameters 4*5*5 would.
Since the actual values of the parameters could vary greatly and still achieve the same shape it would probably be best to store ratios of them for comparison instead, e.g. for my 3 parmeters b/a and b/c.
Here is an 81 histogram "covering" using a quartic model, again with parameters chosen from linspace(-1,1,3):
edit: Since you said your histograms were described by arrays that were ~20 elements, I figured fitting parameters would be very fast.
edit2 on second thought I think using a constant in the model is pointless, all that matters is the shape.

Building the dataset for Random Forest training procedure

I should use the bagging (abbreviation for bootstrap aggregating) technique in order to train a random forest classifier. I read here the description of this learning technique, but I have not figured out how I initially organize the dataset.
Currently I first load all the positive examples and immediately after the negative ones. Moreover, positive examples are less than half of the negative ones, so by sampling from the dataset uniformly, the probability of obtaining a negative example is greater than that of obtaining a positive example.
How should I build the initial dataset?
Should I shuffle the initial dataset containing positive and negative examples?
Bagging depends on using bootstrap samples to train the different predictors, and aggregating their results. See the above link for the full details, but in short - you need to sample from your data with repetitions (i.e. if you have N elements numbered 1 through N, pick K random integers between 1 and N, and pick those N elements to be a training set), usually creating samples of the same size as the original dataset each (i.e. K=N).
One more thing you should probably bear in mind - random forests are more than just bootstrap aggregations over the original data - there is also a random selection of a subset of the features to use in each individual tree.

Clustering by date (by distance) in Ruby

I have a huge journal with actions done by users (like, for example, moderating contents).
I would like to find the 'mass' actions, meaning the actions that are too dense (the user probably made those actions without thinking it too much :) ).
That would translate to clustering the actions by date (in a linear space), and to marking the clusters that are too dense.
I am no expert in clustering algorithms and methods, but I think the k-means clustering would not do the trick, since I don't know the number of clusters.
Also, ideally, I would also like to 'fine tune' the algorithm.
What would you advice?
P.S. Here are some resources that I found (in Ruby):
hierclust - a simple hierarchical clustering library for spatial data
AI4R - library that implements some clustering algorithms
K-means would probably do a good job as long as you're interested in an a priori known number of clusters. Since you don't you might consider reading about the LBG algorithm, which is based on k-means and is used in data compression for vector quantisation. It's basically iterative k-means which splits centroids after they converge and keeps splitting until you achieve an acceptable number of clusters.
On the other hand, since your data is one-dimensional, you could do something completely different.
Assume that you've got actions which took place at 5 points in time: (8, 11, 15, 16, 17). Let's plot a Gaussian for each of these actions with μ equal to the time and σ = 3.
Now let's see how a sum of values of these Gaussians looks like.
It shows a density of actions with a peak around 16.
Based on this observation I propose a following simple algorithm.
Create a vector of zeroes for the time range of interest.
For each action calculate the Gaussian and add it to the vector.
Scan the vector looking for values which are greater than the maximum value in the vector multiplied by α.
Note that for each action only a small section of the vector needs updates because values of a Gaussian converge to zero very quickly.
You can tune the algorithm by adjusting values of
α ∈ [0,1], which indicates how significant a peak of activity has to be to be noted,
σ, which affects the distance of actions which are considered close to each other, and
time periods per vector's element (minutes, seconds, etc.).
Notice that the algorithm is linear with regard to the number of actions. Moreover, it shouldn't be difficult to parallelise: split your data across multiple processes summing Gaussians and then sum generated vectors.
Have a look at density based clustering. E.g. DBSCAN and OPTICS.
This sounds like exactly what you want.

Efficiently re-computing area under ROC when one label changes

Say that you have a list of scores with binary labels (for simplicity, assume no ties), and that we've used the labels to compute the area under the associated receiver operating characteristic (ROC) curve. For a set of n scores, this calculation is straightforward to do in O(n log n) time -- you simply sort the list, then traverse the list in sorted order, keeping a running total of the number of positively labeled examples you've seen so far. Every time you see a negative label, you add the number of positives, and at the end you divide the resulting sum by the product of the number of positives times the number of negatives.
Now, having done that calculation, say that someone comes along and flips exactly one label (from positive to negative or vice versa). The scores themselves do not change, so you don't need to re-sort. It's straightforward to calculate the new area under the curve (AUC) in O(n) time by re-traversing the sorted list. My question is, is it possible to compute the new AUC in something better than O(n)? I.e., do I have to re-traverse the entire sorted list to get the new AUC?
I think I can do the re-calculation in O(1) time by storing a count, at each position in the ranked list, to the number of positives and negatives above this position. But I am going to need to repeatedly calculate the AUC as more labels get flipped. And I think that if I rely on those stored values, then updating them for the next time will be O(n).
Yes, it is possible to compute AUC in O(log(n)). You need two sets of scores, one for positives and one for negatives, that provide the following operations:
Querying the number of items with higher (or lower) score than a given value (score of the label being flipped).
Inserting and removing the elements.
Knowing the number of positives above/below given position lets you update AUC efficiently as you already mentioned. After that you have to remove the item from the set of positives/negatives and insert to negatives/positives, respectively.
Balanced search trees can do both operations in O(log(n)).
Furthermore, actual values of scores do not matter, only position is relevant. This leads to very simple and efficient implementation using binary indexed tree. See http://community.topcoder.com/tc?module=Static&d1=tutorials&d2=binaryIndexedTrees for explanation.
Also, you don't really need to maintain two sets. Since you already know the total number of positives and negatives above given position, single set is enough.

Efficiently estimating the number of unique elements in a large list

This problem is a little similar to that solved by reservoir sampling, but not the same. I think its also a rather interesting problem.
I have a large dataset (typically hundreds of millions of elements), and I want to estimate the number of unique elements in this dataset. There may be anywhere from a few, to millions of unique elements in a typical dataset.
Of course the obvious solution is to maintain a running hashset of the elements you encounter, and count them at the end, this would yield an exact result, but would require me to carry a potentially large amount of state with me as I scan through the dataset (ie. all unique elements encountered so far).
Unfortunately in my situation this would require more RAM than is available to me (nothing that the dataset may be far larger than available RAM).
I'm wondering if there would be a statistical approach to this that would allow me to do a single pass through the dataset and come up with an estimated unique element count at the end, while maintaining a relatively small amount of state while I scan the dataset.
The input to the algorithm would be the dataset (an Iterator in Java parlance), and it would return an estimated unique object count (probably a floating point number). It is assumed that these objects can be hashed (ie. you can put them in a HashSet if you want to). Typically they will be strings, or numbers.
You could use a Bloom Filter for a reasonable lower bound. You just do a pass over the data, counting and inserting items which were definitely not already in the set.
This problem is well-addressed in the literature; a good review of various approaches is http://www.edbt.org/Proceedings/2008-Nantes/papers/p618-Metwally.pdf. The simplest approach (and most compact for very high accuracy requirements) is called Linear Counting. You hash elements to positions in a bitvector just like you would a Bloom filter (except only one hash function is required), but at the end you estimate the number of distinct elements by the formula D = -total_bits * ln(unset_bits/total_bits). Details are in the paper.
If you have a hash function that you trust, then you could maintain a hashset just like you would for the exact solution, but throw out any item whose hash value is outside of some small range. E.g., use a 32-bit hash, but only keep items where the first two bits of the hash are 0. Then multiply by the appropriate factor at the end to approximate the total number of unique elements.
Nobody has mentioned approximate algorithm designed specifically for this problem, Hyperloglog.

Resources