Precision-Recall values of retrieved document - precision-recall

I'm learning precison and recall of documents and I'm having trouble understanding this particular question.
The table below shows the relevance of the top 6 results returned by two ranked
retrieval search engines denoted by A and B. '+' indicates relevant documents and '-' indicates non-relelevant documents.
Assuming that the total number of relevant documents in the collection was 4, compute precision-recall values for the two engines for the top 1, 2, 3, 4, 5 and 6 results.
The solution given for search engine A was:
Precision:--100%--|--50%--|--33.3%--|--25%--|--40%--|--50%--|
Recall :------25%--|--25%--|--25%-----|--25%--|--50%--|--75%--|
The solution for B:
Precision: --|100%--|--100%--|--66.6%--|--50%--|--60%--|--50%--|
Recall: ----|---25%---|--50%----|--50%-----|--50%--|--75%--|--75%--|
I know how to calculate for single documents and that Precsion = TP/(TP+FP) and Recall is TP/(TP+FN). I'm just not sure how some of the values above are calculated.

Instead of trying to memorize formulas, try to understand the concepts.
"Precision" is: What proportion of the results are correct? Hence, for both A and B, if you take the top result, it is correct. The precision is 100%.
"Recall" is: What proportion of the correct results are present? Hence, for both A and B, if you take the top result, you have one out of four correct values, so the recall is 25%.

Related

Top k precision

I have a database of documents in which I perform searches. For each and every search there are n positives. Now, if I evaluate the performance of the search by precision#k and recall#k, things work out just fine for the latter:
recall#k = true positives / positives = true positives / n
The amount of true positives is in the range [0, n] so recall#k is in range [0, 1] - perfect.
Things get weird concerning precision#k, however. If I calculate
precision#k = tp / (tp + fp) = tp / k
precision#k is in the range [0, n/k] which doesn't make too much sense to me. Think of the edge case n=1 for example. One cannot increase the tp beacuse there are just no more than n positives and one cannot decrease k either because, well, it's called precision#k, isn't it?
What am I getting wrong?
An example of what I'm talking about can be found in [1] figure 8b. What you can see there is a precision-recall-curve for the top 1..200 query results. Even though there are less than 200 positives in the database the precision is quite high.
[1] https://www.computer.org/csdl/pds/api/csdl/proceedings/download-article/19skfc3ZfKo/pdf
Since precision#k is computed as #num_relevant/k, its max could be 1 (which happens if all the k top-ranked documents in your retrieved list is relevant).
Your argument is correct in the sense that if the #relevant_docs is less than k then you're being wrongly penalized by the P#k metric because in that case even with a perfect retrieval you don't score 1 on the metric.
A standard solution is thus to take both into account and compute precision values not at arbitrary values of k but rather at recall points, i.e. at those positions in your ranked list where a relevant document is retrieved. You would then eventually divide the sum by the number of relevant documents. This measure is called the mean average precision* (MAP). An example to compute MAP follows.
Let's say that you retrieved 10 documents out of which 2 are relevant at ranks 2 and 5 (and there're 3 relevant docs in total - one of which is not retrieved).
You compute precision#k at the recall points (values of k = 2 and 5).
This gives:
1/2 (at position 2, one is relevant out of 2) +
2/5 (at position 5, one 2 are relevant out of 5)
and then you divide this number by 3 (total number of known rel docs). The last step favours systems that achieve high recall whereas the cut-off point based precisions favour systems that retrieve docs towards top ranks.
Note that a system A which retrieves the relevant docs at better ranks and retrieves a higher number of rel docs would score better than a system which fails to cater to either or both the cases.
Also note that you'll score a perfect 1 on this metric if you retrieve the 3 rel docs at the top 3 ranks out of 10 that you retrieved in total (check this), which addresses your concern that motivated this question.

Adding, Removing and First missing positive integer

Given an empty list. There are three types of queries 1, 2, 3.
Query 1 x where x is a positive integer indicates adding the number x into the list.
Query 2 x indicates removing x from the list.
Query 3 indicates printing the smallest positive integer not present in the array.
Here x can be from 1 upto 10^9 and number of queries upto 10^5. For the large range of x I can't keep a boolean array marking visited integers. How should I approach?
There are too many unknowns about your data to give a definitive answer here. The approach differs a lot between at least these different cases:
Few values.
A lot of values but with large gaps.
A lot of values with only small gaps.
Almost all values.
It also depends on which ones of the mentioned operations that you will do the most.
It is less than 1 GB of data so it is possible to keep it as a bit array in memory on most machines. But if the data set is sparse (case 1 and 2 above) you might want to consider sparse arrays instead, or for very sparse sets (case 1) perhaps a binary search tree or a min-heap. The heap is probably not a good idea if you are going to use operation 2 a lot.
For case 1, 2 and 4 you might consider a range tree. The upside to this solution is that you can do operation 3 in logarithmic time just by going leftwards down the tree and look at the first range.
It might also be possible to page out your datastructure to disk if you are not going to do a lot of random insertions.
You might also consider speeding up the search with a Bloom filter, depending on what type of datastructure you choose in the end.

Subset calculation of list of integers

I'm currently implementing an algorithm where one particular step requires me to calculate subsets in the following way.
Imagine I have sets (possibly millions of them) of integers. Where each set could potentially contain around a 1000 elements:
Set1: [1, 3, 7]
Set2: [1, 5, 8, 10]
Set3: [1, 3, 11, 14, 15]
...,
Set1000000: [1, 7, 10, 19]
Imagine a particular input set:
InputSet: [1, 7]
I now want to quickly calculate to which this InputSet is a subset. In this particular case, it should return Set1 and Set1000000.
Now, brute-forcing it takes too much time. I could also parallelise via Map/Reduce, but I'm looking for a more intelligent solution. Also, to a certain extend, it should be memory-efficient. I already optimised the calculation by making use of BloomFilters to quickly eliminate sets to which the input set could never be a subset.
Any smart technique I'm missing out on?
Thanks!
Well - it seems that the bottle neck is the number of sets, so instead of finding a set by iterating all of them, you could enhance performance by mapping from elements to all sets containing them, and return the sets containing all the elements you searched for.
This is very similar to what is done in AND query when searching the inverted index in the field of information retrieval.
In your example, you will have:
1 -> [set1, set2, set3, ..., set1000000]
3 -> [set1, set3]
5 -> [set2]
7 -> [set1, set7]
8 -> [set2]
...
EDIT:
In inverted index in IR, to save space we sometimes use d-gaps - meaning we store the offset between documents and not the actual number. For example, [2,5,10] will become [2,3,5]. Doing so and using delta encoding to represent the numbers tends to help a lot when it comes to space.
(Of course there is also a downside: you need to read the entire list in order to find if a specific set/document is in it, and cannot use binary search, but it sometimes worths it, especially if it is the difference between fitting the index into RAM or not).
How about storing a list of the sets which contain each number?
1 -- 1, 2, 3, 1000000
3 -- 1, 3
5 -- 2
etc.
Extending amit's solution, instead of storing the actual numbers, you could just store intervals and their associated sets.
For example using a interval size of 5:
(1-5): [1,2,3,1000000]
(6-10): [2,1000000]
(11-15): [3]
(16-20): [1000000]
In the case of (1,7) you should consider intervals (1-5) and (5-10) (which can be determined simply by knowing the size of the interval). Intersecting those ranges gives you [2,1000000]. Binary search of the sets shows that indeed, (1,7) exists in both sets.
Though you'll want to check the min and max values for each set to get a better idea of what the interval size should be. For example, 5 is probably a bad choice if the min and max values go from 1 to a million.
You should probably keep it so that a binary search can be used to check for values, so the subset range should be something like (min + max)/N, where 2N is the max number of values that will need to be binary searched in each set. For example, "does set 3 contain any values from 5 to 10?" this is done by finding the closest values to 5 (3) and 10 (11), in this case, no it does not. You would have to go through each set and do binary searches for the interval values that could be within the set. This means ensuring that you don't go searching for 100 when the set only goes up to 10.
You could also just store the range (min and max). However, the issue is that I suspect your numbers are going be be clustered, thus not providing much use. Although as mentioned, it'll probably be useful for determining how to set up the intervals.
It'll still be troublesome to pick what range to use, too large and it'll take a long time to build the data structure (1000 * million * log(N)). Too small, and you'll start to run into space issues. The ideal size of the range is probably such that it ensures that the number of set's related to each range is approximately equal, while also ensuring that the total number of ranges isn't too high.
Edit:
One benefit is that you don't actually need to store all intervals, just the ones you need. Although, if you have too many unused intervals, it might be wise to increase the interval and split the current intervals to ensure that the search is fast. This is especially true if processioning time isn't a major issue.
Start searching from biggest number (7) of input set and
eliminate other subsets (Set1 and Set1000000 will returned).
Search other input elements (1) in remaining sets.

Can I do better than binary search here?

I want to pick the top "range" of cards based upon a percentage. I have all my possible 2 card hands organized in an array in order of the strength of the hand, like so:
AA, KK, AKsuited, QQ, AKoff-suit ...
I had been picking the top 10% of hands by multiplying the length of the card array by the percentage which would give me the index of the last card in the array. Then I would just make a copy of the sub-array:
Arrays.copyOfRange(cardArray, 0, 16);
However, I realize now that this is incorrect because there are more possible combinations of, say, Ace King off-suit - 12 combinations (i.e. an ace of one suit and a king of another suit) than there are combinations of, say, a pair of aces - 6 combinations.
When I pick the top 10% of hands therefore I want it to be based on the top 10% of hands in proportion to the total number of 2 cards combinations - 52 choose 2 = 1326.
I thought I could have an array of integers where each index held the combined total of all the combinations up to that point (each index would correspond to a hand from the original array). So the first few indices of the array would be:
6, 12, 16, 22
because there are 6 combinations of AA, 6 combinations of KK, 4 combinations of AKsuited, 6 combinations of QQ.
Then I could do a binary search which runs in BigOh(log n) time. In other words I could multiply the total number of combinations (1326) by the percentage, search for the first index lower than or equal to this number, and that would be the index of the original array that I need.
I wonder if there a way that I could do this in constant time instead?
As Groo suggested, if precomputation and memory overhead permits, it would be more efficient to create 6 copies of AA, 6 copies of KK, etc and store them into a sorted array. Then you could run your original algorithm on this properly weighted list.
This is best if the number of queries is large.
Otherwise, I don't think you can achieve constant time for each query. This is because the queries depend on the entire frequency distribution. You can't look only at a constant number of elements to and determine if it's the correct percentile.
had a similar discussion here Algorithm for picking thumbed-up items As a comment to my answer (basically what you want to do with your list of cards), someone suggested a particular data structure, http://en.wikipedia.org/wiki/Fenwick_tree
Also, make sure your data structure will be able to provide efficient access to, say, the range between top 5% and 15% (not a coding-related tip though ;).

Grouping similar sets algorithm

I have a search engine. The search engine generates results when is searched for a keyword. What I need is to find all other keywords which generate similar results.
For example keyword k1 gives result set R1 = { 1,2,3,4,5,...40 }, which contains up to 40 document ids. And I need to get a list of all other keywords K1 which generate results similar to what k1 generates.
The similarity S(R1, R2) between two result sets R1 and R2 is computed as follows:
2 * (number of same elements both in _R1_ and _R2_) / ( (total number of elements in _R1_) + (total number of elements in _R2_) ). Example: R1 = {1,2,3} and R2 = {2,3,4,5} gives S(R1, R2) = (2*|{2,3}|) / |{1,2,3}| + |{2,3,4,5}| = (2*2)/(3+4) = 4/7 = 0.57.
There are more than 100,000 keywords thus more than 100,000 result sets. So far I only was able to solve this problem the hard way O(N^2), where each result set is comprated to every other set. This takes a lot of time.
Is there someone with a better idea?
Some similar post which not solve the problem completely:
How to store sets, to find similar patterns fast?
efficient algorithm to compare similarity between sets of numbers?
One question are the results in sorted order?
Something which came to mind combine both the sets , sort it and find duplicates. It can be reduced to O(nlogn)
To make the problem be simple, it is supposed that all the key words have 10 results ans k1 is the key word to be compared. You remove 9 results from the set of each key word. Now compare the last result with k1's and the key words with the same last result is what you want. If a key word has 1 result in common with k1, there is only 1% probability that it will remain. A key word with 5 results in common with k1 will have 25% probability to remain. Maybe you will think that 1% is too big, then you can repeat the process above n times and the key word with 1 result in common will have 1%^n probability to remain.
The time is O(N).
Is your similarity criterium fixed, or can we apply a bit of variety to achieve faster search engine?
Alternative:
An alternative that came to my mind:
Given your result set R1, you could go through the documents and create a histogram over other keywords that those documents would be matched to. Then, if given alternative keyword gets, say, at least #R1/2 hits, you list it as "similar".
The big difference is, that you do not consider documents that are not in R1 at all.
Exact?
If you need a solution exact to your requirements, I believe it would suffice to compute R2 set only for those keywords that satisfy the above "alternative" criterium. I think (mathematical proof needed!) that if the "alternative" criterium is not satisfied, there is no chance that yours will be.

Resources