I have a database of documents in which I perform searches. For each and every search there are n positives. Now, if I evaluate the performance of the search by precision#k and recall#k, things work out just fine for the latter:
recall#k = true positives / positives = true positives / n
The amount of true positives is in the range [0, n] so recall#k is in range [0, 1] - perfect.
Things get weird concerning precision#k, however. If I calculate
precision#k = tp / (tp + fp) = tp / k
precision#k is in the range [0, n/k] which doesn't make too much sense to me. Think of the edge case n=1 for example. One cannot increase the tp beacuse there are just no more than n positives and one cannot decrease k either because, well, it's called precision#k, isn't it?
What am I getting wrong?
An example of what I'm talking about can be found in [1] figure 8b. What you can see there is a precision-recall-curve for the top 1..200 query results. Even though there are less than 200 positives in the database the precision is quite high.
[1] https://www.computer.org/csdl/pds/api/csdl/proceedings/download-article/19skfc3ZfKo/pdf
Since precision#k is computed as #num_relevant/k, its max could be 1 (which happens if all the k top-ranked documents in your retrieved list is relevant).
Your argument is correct in the sense that if the #relevant_docs is less than k then you're being wrongly penalized by the P#k metric because in that case even with a perfect retrieval you don't score 1 on the metric.
A standard solution is thus to take both into account and compute precision values not at arbitrary values of k but rather at recall points, i.e. at those positions in your ranked list where a relevant document is retrieved. You would then eventually divide the sum by the number of relevant documents. This measure is called the mean average precision* (MAP). An example to compute MAP follows.
Let's say that you retrieved 10 documents out of which 2 are relevant at ranks 2 and 5 (and there're 3 relevant docs in total - one of which is not retrieved).
You compute precision#k at the recall points (values of k = 2 and 5).
This gives:
1/2 (at position 2, one is relevant out of 2) +
2/5 (at position 5, one 2 are relevant out of 5)
and then you divide this number by 3 (total number of known rel docs). The last step favours systems that achieve high recall whereas the cut-off point based precisions favour systems that retrieve docs towards top ranks.
Note that a system A which retrieves the relevant docs at better ranks and retrieves a higher number of rel docs would score better than a system which fails to cater to either or both the cases.
Also note that you'll score a perfect 1 on this metric if you retrieve the 3 rel docs at the top 3 ranks out of 10 that you retrieved in total (check this), which addresses your concern that motivated this question.
Related
I have got the following problem given: There are given n Files with length z1,...zn and a usage u1,....un where the sum of all u1,...un equals 1. 0 < u_i < 1
They want to have a sorting where the Time taken to take a file from store is minimal. Example, if z1 = 12, l2 = 3 and u1 = 0,9 and u2 = 0,1 where file 1 is taken first, the approximate time to access is 12* 0,8 + 15* 0,1
My task: Prove that this (greedy) algorithm is optimal.
My Question: Is my answer to that question correct or what should I improve?
My answer:
Suppose the Algorithm is not optimal, there has to exist an order that if more efficennt. For that there have to be two factors being mentioned. The usage and the length. The more a file is used, the shorter its time to access has to be. For that the length of files in the first places before has to be as short as possible. If the formula z1/u1 is being sorted descending. The files with a high usage will be placed lastly. Hence the accesstime is made by all lengths before * usage it means that more often files will be accessed slowly. It is a contradiction to the efficency. Suppose the formula z_i / u_i is unefficient. The division by u_i has the consequence that if more usage is given, the term will be smaller. Which means more often used terms will be accessed faster, hence u_i is < 1 and > 0. If differing from that division, terms with higher usage wont be preferred then, which would be a contradiciton to the efficiency. Also because of z_i at the top of fraction lower lengths will be preferred first. Differing from that, it would mean that terms with longer length should be preferred also. Taking longer terms first is a contradiction to efficiency. Hence all other alternatives of taking another sorting could be contradicted it can be proved that the term is optimal and correct.
Given an empty list. There are three types of queries 1, 2, 3.
Query 1 x where x is a positive integer indicates adding the number x into the list.
Query 2 x indicates removing x from the list.
Query 3 indicates printing the smallest positive integer not present in the array.
Here x can be from 1 upto 10^9 and number of queries upto 10^5. For the large range of x I can't keep a boolean array marking visited integers. How should I approach?
There are too many unknowns about your data to give a definitive answer here. The approach differs a lot between at least these different cases:
Few values.
A lot of values but with large gaps.
A lot of values with only small gaps.
Almost all values.
It also depends on which ones of the mentioned operations that you will do the most.
It is less than 1 GB of data so it is possible to keep it as a bit array in memory on most machines. But if the data set is sparse (case 1 and 2 above) you might want to consider sparse arrays instead, or for very sparse sets (case 1) perhaps a binary search tree or a min-heap. The heap is probably not a good idea if you are going to use operation 2 a lot.
For case 1, 2 and 4 you might consider a range tree. The upside to this solution is that you can do operation 3 in logarithmic time just by going leftwards down the tree and look at the first range.
It might also be possible to page out your datastructure to disk if you are not going to do a lot of random insertions.
You might also consider speeding up the search with a Bloom filter, depending on what type of datastructure you choose in the end.
I have the following task:
there are 10M documents
there are 100K unique labels
each document has 100 labels
For each label X I need to find 10 top Y labels, where X and Y are both present in documents, ordered by the number of documents where X and Y are both present.
The task appears to be quite complex to solve:
although the result set is only 10 records for each of 100K labels
the straightforward algorithm of keeping all combinations as you go, is very sensitive to memory usage: there are 0.5*10^12 total combinations of (X,Y) and it grows as n^2, where n is number of labels
Is there some way I solve this without keeping all combinations in memory or break into a parallel algorithm (similar to map reduce) to solve? What if I don't need it to be 100% accurate?
I think in the general case, you won't be able to avoid really awful runtimes - with 5050 pairs in each document, and 10M documents, all combinations seem possible.
However, in typical real-world data, you seldom have to deal with an "adversarial" input. One potential solution is to first count the occurrences of all 100K terms, sort them, and then for each term X, do the following:
If there are many docs with X (i.e., not less than 1% of the document count, or some other tweakable fraction), run queries against your index in the form X & Y, starting with the most popular terms and going down, keeping a heap of size 10 to track the most popular pairs. You know that max(docs with X & Y) = max(docs with X, docs with Y), so it is very likely you will be able to short-circuit this process early
If there are few docs with X, it is far more prudent to simply scan through all of the documents with that term and aggregate the totals yourself.
For a well-behaved document set, where the 100K terms follow a logarithmic curve with respect to document counts, you will do far less than (100)^2 * 10M work, which the naive solution would require in all cases. Granted, for non-well-behaved document sets, you will wind up doing even more work, but that shouldn't happen in the real world.
As to "not 100% accurate", that's far too vague a specification to work with. What kind of error is permissible? How much of it?
--- Comment Response (too large for comment) ---
a) Think about determining the maximum of 100 million elements. You only need to save the best 1 you have so far as you scan - the same principle applies to determining the top X of N items. Add incoming elements to a binary heap, and remove the weakest elements when the size of the heap exceeds X. Add the end, you will have the top X
b) Imagine you are determining the top 10 X&Y pairs, where X="Elephant". Suppose that, after scanning 1000 Y terms, you have a heap of size 10, where the minimum-scoring pair has count 300. Now suppose the 1001th term you check has doc count 299 - Since only 299 docs have the Y term, at most 299 docs have X&Y as well, therefore it cannot possibly be any better than any of the top 10 pairs you have so far, and since all the Y terms have been sorted by doc frequency, in fact you now know that you don't have to check any more pairs! This is what the max statement guarantees you.
c) The choice you make for each X is purely an optimization decision. If you have many X's that only exist in a small number of documents, that is a good problem to have - it means less work per term.
d) If you can live with some non-zero probability of the top 10 being wrong (for each term), you can probably cut way down on run-time by using a sampling method instead of a full, rigorous scan of the index. The more prevalent a term X is in the doc index, the less documents you have to scan (proportionally) before you are likely to have the correct top 10 X&Y pairs based on the info you have gathered. Coming up with exact numbers in this regard requires some knowledge of the expected distribution of terms in the underlying index. In particular: How much do terms correlate? What does the number N(X)/MAXY(X) look like in general, where N(X) is the number of documents with the term X, and MAXY(X) is the number of documents with the pair X&Y, maximized over all terms Y != X
I think even the worst case is not a bad you you might fear. If there an N docs, M distnict labels but only K labels per document. Then a complete histogram will have a hard limit K*K*N/2 distinct nonzero entries (5.5 * 10^10 with your numbers), in fact it will be far less.
BTW: I think the above point is implicit in torquestomp's answer, so you unless you are particularly interested in hard limits, you should accept his answer.
I'm currently implementing an algorithm where one particular step requires me to calculate subsets in the following way.
Imagine I have sets (possibly millions of them) of integers. Where each set could potentially contain around a 1000 elements:
Set1: [1, 3, 7]
Set2: [1, 5, 8, 10]
Set3: [1, 3, 11, 14, 15]
...,
Set1000000: [1, 7, 10, 19]
Imagine a particular input set:
InputSet: [1, 7]
I now want to quickly calculate to which this InputSet is a subset. In this particular case, it should return Set1 and Set1000000.
Now, brute-forcing it takes too much time. I could also parallelise via Map/Reduce, but I'm looking for a more intelligent solution. Also, to a certain extend, it should be memory-efficient. I already optimised the calculation by making use of BloomFilters to quickly eliminate sets to which the input set could never be a subset.
Any smart technique I'm missing out on?
Thanks!
Well - it seems that the bottle neck is the number of sets, so instead of finding a set by iterating all of them, you could enhance performance by mapping from elements to all sets containing them, and return the sets containing all the elements you searched for.
This is very similar to what is done in AND query when searching the inverted index in the field of information retrieval.
In your example, you will have:
1 -> [set1, set2, set3, ..., set1000000]
3 -> [set1, set3]
5 -> [set2]
7 -> [set1, set7]
8 -> [set2]
...
EDIT:
In inverted index in IR, to save space we sometimes use d-gaps - meaning we store the offset between documents and not the actual number. For example, [2,5,10] will become [2,3,5]. Doing so and using delta encoding to represent the numbers tends to help a lot when it comes to space.
(Of course there is also a downside: you need to read the entire list in order to find if a specific set/document is in it, and cannot use binary search, but it sometimes worths it, especially if it is the difference between fitting the index into RAM or not).
How about storing a list of the sets which contain each number?
1 -- 1, 2, 3, 1000000
3 -- 1, 3
5 -- 2
etc.
Extending amit's solution, instead of storing the actual numbers, you could just store intervals and their associated sets.
For example using a interval size of 5:
(1-5): [1,2,3,1000000]
(6-10): [2,1000000]
(11-15): [3]
(16-20): [1000000]
In the case of (1,7) you should consider intervals (1-5) and (5-10) (which can be determined simply by knowing the size of the interval). Intersecting those ranges gives you [2,1000000]. Binary search of the sets shows that indeed, (1,7) exists in both sets.
Though you'll want to check the min and max values for each set to get a better idea of what the interval size should be. For example, 5 is probably a bad choice if the min and max values go from 1 to a million.
You should probably keep it so that a binary search can be used to check for values, so the subset range should be something like (min + max)/N, where 2N is the max number of values that will need to be binary searched in each set. For example, "does set 3 contain any values from 5 to 10?" this is done by finding the closest values to 5 (3) and 10 (11), in this case, no it does not. You would have to go through each set and do binary searches for the interval values that could be within the set. This means ensuring that you don't go searching for 100 when the set only goes up to 10.
You could also just store the range (min and max). However, the issue is that I suspect your numbers are going be be clustered, thus not providing much use. Although as mentioned, it'll probably be useful for determining how to set up the intervals.
It'll still be troublesome to pick what range to use, too large and it'll take a long time to build the data structure (1000 * million * log(N)). Too small, and you'll start to run into space issues. The ideal size of the range is probably such that it ensures that the number of set's related to each range is approximately equal, while also ensuring that the total number of ranges isn't too high.
Edit:
One benefit is that you don't actually need to store all intervals, just the ones you need. Although, if you have too many unused intervals, it might be wise to increase the interval and split the current intervals to ensure that the search is fast. This is especially true if processioning time isn't a major issue.
Start searching from biggest number (7) of input set and
eliminate other subsets (Set1 and Set1000000 will returned).
Search other input elements (1) in remaining sets.
I'm making some exercises on combinatorics algorithm and trying to figure out how to solve the question below:
Given a group of 25 bits, set (choose) 15 (non-permutable and order NON matters):
n!/(k!(n-k)!) = 3.268.760
Now for every of these possibilities construct a matrix where I cross every unique 25bit member against all other 25bit member where
in the relation in between it there must be at least 11 common setted bits (only ones, not zeroes).
Let me try to illustrate representing it as binary data, so the first member would be:
0000000000111111111111111 (10 zeros and 15 ones) or (15 bits set on 25 bits)
0000000001011111111111111 second member
0000000001101111111111111 third member
0000000001110111111111111 and so on....
...
1111111111111110000000000 up to here. The 3.268.760 member.
Now crossing these values over a matrix for the 1 x 1 I must have 15 bits common. Since the result is >= 11 it is a "useful" result.
For the 1 x 2 we have 14 bits common so also a valid result.
Doing that for all members, finally, crossing 1 x 3.268.760 should result in 5 bits common so since it's < 11 its not "useful".
What I need is to find out (by math or algorithm) wich is the minimum number of members needed to cover all possibilities having 11 bits common.
In other words a group of N members that if tested against all others may have at least 11 bits common over the whole 3.268.760 x 3.268.760 universe.
Using a brute force algorithm I found out that with 81 25bit member is possible achive this. But i'm guessing that this number should be smaller (something near 12).
I was trying to use a brute force algorithm to make all possible variations of 12 members over the 3.268.760 but the number of possibilities
it's so huge that it would take more than a hundred years to compute (3,156x10e69 combinations).
I've googled about combinatorics but there are so many fields that i don't know in wich these problem should fit.
So any directions on wich field of combinatorics, or any algorithm for these issue is greatly appreciate.
PS: Just for reference. The "likeness" of two members is calculated using:
(Not(a xor b)) and a
After that there's a small recursive loop to count the bits given the number of common bits.
EDIT: As promissed (#btilly)on the comment below here's the 'fractal' image of the relations or link to image
The color scale ranges from red (15bits match) to green (11bits match) to black for values smaller than 10bits.
This image is just sample of the 4096 first groups.
tl;dr: you want to solve dominating set on a large, extremely symmetric graph. btilly is right that you should not expect an exact answer. If this were my problem, I would try local search starting with the greedy solution. Pick one set and try to get rid of it by changing the others. This requires data structures to keep track of which sets are covered exactly once.
EDIT: Okay, here's a better idea for a lower bound. For every k from 1 to the value of the optimal solution, there's a lower bound of [25 choose 15] * k / [maximum joint coverage of k sets]. Your bound of 12 (actually 10 by my reckoning, since you forgot some neighbors) corresponds to k = 1. Proof sketch: fix an arbitrary solution with m sets and consider the most coverage that can be obtained by k of the m. Build a fractional solution where all symmetries of the chosen k are averaged together and scaled so that each element is covered once. The cost of this solution is [25 choose 15] * k / [maximum joint coverage of those k sets], which is at least as large as the lower bound we're shooting for. It's still at least as small, however, as the original m-set solution, as the marginal returns of each set are decreasing.
Computing maximum coverage is in general hard, but there's a factor (e/(e-1))-approximation (≈ 1.58) algorithm: greedy, which it sounds as though you could implement quickly (note: you need to choose the set that covers the most uncovered other sets each time). By multiplying the greedy solution by e/(e-1), we obtain an upper bound on the maximum coverage of k elements, which suffices to power the lower bound described in the previous paragraph.
Warning: if this upper bound is larger than [25 choose 15], then k is too large!
This type of problem is extremely hard, you should not expect to be able to find the exact answer.
A greedy solution should produce a "fairly good" answer. But..how to be greedy?
The idea is to always choose the next element to be the one that is going to match as many possibilities as you can that are currently unmatched. Unfortunately with over 3 million possible members, that you have to try match against millions of unmatched members (note, your best next guess might already match another member in your candidate set..), even choosing that next element is probably not feasible.
So we'll have to be greedy about choosing the next element. We will choose each bit to maximize the sum of the probabilities of eventually matching all of the currently unmatched elements.
For that we will need a 2-dimensional lookup table P such that P(n, m) is the probability that two random members will turn out to have at least 11 bits in common, if m of the first n bits that are 1 in the first member are also 1 in the second. This table of 225 probabilities should be precomputed.
This table can easily be computed using the following rules:
P(15, m) is 0 if m < 11, 1 otherwise.
For n < 15:
P(n, m) = P(n+1, m+1) * (15-m) / (25-n) + P(n+1, m) * (10-n+m) / (25-n)
Now let's start with a few members that are "very far" from each other. My suggestion would be:
First 15 bits 1, rest 0.
First 10 bits 0, rest 1.
First 8 bits 1, last 7 1, rest 0.
Bits 1-4, 9-12, 16-23 are 1, rest 0.
Now starting with your universe of (25 choose 15) members, eliminate all of those that match one of the elements in your initial collection.
Next we go into the heart of the algorithm.
While there are unmatched members:
Find the bit that appears in the most unmatched members (break ties randomly)
Make that the first set bit of our candidate member for the group.
While the candidate member has less than 15 set bits:
Let p_best = 0, bit_best = 0;
For each unset bit:
Let p = 0
For each unmatched member:
p += P(n, m) where m = number of bits in common between
candidate member+this bit and the unmatched member
and n = bits in candidate member + 1
If p_best < p:
p_best = p
bit_best = this unset bit
Set bit_best as the next bit in our candidate member.
Add the candidate member to our collection
Remove all unmatched members that match this from unmatched members
The list of candidate members is our answer
I have not written code, I therefore have no idea how good an answer this algorithm will produce. But assuming that it does no better than your current, for 77 candidate members (we cheated and started with 4) you have to make 271 passes through your unmatched candidates (25 to find the first bit, 24 to find the second, etc down to 11 to find the 15th, and one more to remove the matched members). That's 20867 passes. If you have an average of 1 million unmatched members, that's on the order of a 20 billion operations.
This won't be quick. But it should be computationally feasible.