I have the following task:
there are 10M documents
there are 100K unique labels
each document has 100 labels
For each label X I need to find 10 top Y labels, where X and Y are both present in documents, ordered by the number of documents where X and Y are both present.
The task appears to be quite complex to solve:
although the result set is only 10 records for each of 100K labels
the straightforward algorithm of keeping all combinations as you go, is very sensitive to memory usage: there are 0.5*10^12 total combinations of (X,Y) and it grows as n^2, where n is number of labels
Is there some way I solve this without keeping all combinations in memory or break into a parallel algorithm (similar to map reduce) to solve? What if I don't need it to be 100% accurate?
I think in the general case, you won't be able to avoid really awful runtimes - with 5050 pairs in each document, and 10M documents, all combinations seem possible.
However, in typical real-world data, you seldom have to deal with an "adversarial" input. One potential solution is to first count the occurrences of all 100K terms, sort them, and then for each term X, do the following:
If there are many docs with X (i.e., not less than 1% of the document count, or some other tweakable fraction), run queries against your index in the form X & Y, starting with the most popular terms and going down, keeping a heap of size 10 to track the most popular pairs. You know that max(docs with X & Y) = max(docs with X, docs with Y), so it is very likely you will be able to short-circuit this process early
If there are few docs with X, it is far more prudent to simply scan through all of the documents with that term and aggregate the totals yourself.
For a well-behaved document set, where the 100K terms follow a logarithmic curve with respect to document counts, you will do far less than (100)^2 * 10M work, which the naive solution would require in all cases. Granted, for non-well-behaved document sets, you will wind up doing even more work, but that shouldn't happen in the real world.
As to "not 100% accurate", that's far too vague a specification to work with. What kind of error is permissible? How much of it?
--- Comment Response (too large for comment) ---
a) Think about determining the maximum of 100 million elements. You only need to save the best 1 you have so far as you scan - the same principle applies to determining the top X of N items. Add incoming elements to a binary heap, and remove the weakest elements when the size of the heap exceeds X. Add the end, you will have the top X
b) Imagine you are determining the top 10 X&Y pairs, where X="Elephant". Suppose that, after scanning 1000 Y terms, you have a heap of size 10, where the minimum-scoring pair has count 300. Now suppose the 1001th term you check has doc count 299 - Since only 299 docs have the Y term, at most 299 docs have X&Y as well, therefore it cannot possibly be any better than any of the top 10 pairs you have so far, and since all the Y terms have been sorted by doc frequency, in fact you now know that you don't have to check any more pairs! This is what the max statement guarantees you.
c) The choice you make for each X is purely an optimization decision. If you have many X's that only exist in a small number of documents, that is a good problem to have - it means less work per term.
d) If you can live with some non-zero probability of the top 10 being wrong (for each term), you can probably cut way down on run-time by using a sampling method instead of a full, rigorous scan of the index. The more prevalent a term X is in the doc index, the less documents you have to scan (proportionally) before you are likely to have the correct top 10 X&Y pairs based on the info you have gathered. Coming up with exact numbers in this regard requires some knowledge of the expected distribution of terms in the underlying index. In particular: How much do terms correlate? What does the number N(X)/MAXY(X) look like in general, where N(X) is the number of documents with the term X, and MAXY(X) is the number of documents with the pair X&Y, maximized over all terms Y != X
I think even the worst case is not a bad you you might fear. If there an N docs, M distnict labels but only K labels per document. Then a complete histogram will have a hard limit K*K*N/2 distinct nonzero entries (5.5 * 10^10 with your numbers), in fact it will be far less.
BTW: I think the above point is implicit in torquestomp's answer, so you unless you are particularly interested in hard limits, you should accept his answer.
Related
I have distributed 50 millions of ids within a numeric space of the size of 10^30. Ids are distributed randomly, no series or reversed function could be found. For example, the minimum and the maximum are:
25083112306903763728975529743
29353757632236106718171971627
Two consecutive ids have a distance in the order at least of 10^19. For example:
28249462572807242052513352500
28249462537043093417625790615
This distribution is solid to a brute force attack since to find 1 consecutive to another, it will take at least 10^19 search (to have an idea about timing, 1000 search it will take 1 second then it will spend 10^16 seconds...).
There are other search algorithms to search in this space that could take less time and make my ids distribution less solid?
If your 50 millions are really randomly distributed in a space of 10^30, you can't do anything better than brute force.
This means you can only iterate your 10^30 values in a random order, and in average you have to test 10^30 / (5 10^7) = 2.10^22 to find one.
Of course, there exists an algorithm to find all of them at first try, but it's extremely unlikely that you stumble on it without knowing the ids first.
Imagine you have N distinct people and that you have a record of where these people are, exactly M of these records to be exact.
For example
1,50,299
1,2,3,4,5,50,287
1,50,299
So you can see that 'person 1' is at the same place with 'person 50' three times. Here M = 3 obviously since there's only 3 lines. My question is given M of these lines, and a threshold value (i.e person A and B have been at the same place more than threshold times), what do you suggest the most efficient way of returning these co-occurrences?
So far I've built an N by N table, and looped through each row, incrementing table(N,M) every time N co occurs with M in a row. Obviously this is an awful approach and takes 0(n^2) to O(n^3) depending on how you implent. Any tips would be appreciated!
There is no need to create the table. Just create a hash/dictionary/whatever your language calls it. Then in pseudocode:
answer = []
for S in sets:
for (i, j) in pairs from S:
count[(i,j)]++
if threshold == count[(i,j)]:
answer.append((i,j))
If you have M sets of size of size K the running time will be O(M*K^2).
If you want you can actually keep the list of intersecting sets in a data structure parallel to count without changing the big-O.
Furthermore the same algorithm can be readily implemented in a distributed way using a map-reduce. For the count you just have to emit a key of (i, j) and a value of 1. In the reduce you count them. Actually generating the list of sets is similar.
The known concept for your case is Market Basket analysis. In this context, there are different algorithms. For example Apriori algorithm can be using for your case in a specific case for sets of size 2.
Moreover, in these cases to finding association rules with specific supports and conditions (which for your case is the threshold value) using from LSH and min-hash too.
you could use probability to speed it up, e.g. only check each pair with 1/50 probability. That will give you a 50x speed up. Then double check any pairs that make it close enough to 1/50th of M.
To double check any pairs, you can either go through the whole list again, or you could double check more efficiently if you do some clever kind of reverse indexing as you go. e.g. encode each persons row indices into 64 bit integers, you could use binary search / merge sort type techniques to see which 64 bit integers to compare, and use bit operations to compare 64 bit integers for matches. Other things to look up could be reverse indexing, binary indexed range trees / fenwick trees.
Given an empty list. There are three types of queries 1, 2, 3.
Query 1 x where x is a positive integer indicates adding the number x into the list.
Query 2 x indicates removing x from the list.
Query 3 indicates printing the smallest positive integer not present in the array.
Here x can be from 1 upto 10^9 and number of queries upto 10^5. For the large range of x I can't keep a boolean array marking visited integers. How should I approach?
There are too many unknowns about your data to give a definitive answer here. The approach differs a lot between at least these different cases:
Few values.
A lot of values but with large gaps.
A lot of values with only small gaps.
Almost all values.
It also depends on which ones of the mentioned operations that you will do the most.
It is less than 1 GB of data so it is possible to keep it as a bit array in memory on most machines. But if the data set is sparse (case 1 and 2 above) you might want to consider sparse arrays instead, or for very sparse sets (case 1) perhaps a binary search tree or a min-heap. The heap is probably not a good idea if you are going to use operation 2 a lot.
For case 1, 2 and 4 you might consider a range tree. The upside to this solution is that you can do operation 3 in logarithmic time just by going leftwards down the tree and look at the first range.
It might also be possible to page out your datastructure to disk if you are not going to do a lot of random insertions.
You might also consider speeding up the search with a Bloom filter, depending on what type of datastructure you choose in the end.
Recently I have been working with combinations of words to make "phrases" in different languages and I have noticed a few things that I could do with some more expert input on.
Defining some constants for this,
Depths (n) is on average 6-7
The length of the input set is ~160 unique words.
Memory - Generating n permutations of 160 words wastes lots of space. I can abuse databases by writing it to disk, but then I take a hit in performance as I need to constantly wait for IO. The other trick is to generate the combinations on the fly like a generator object
Time - If Im not wrong n choose k gets big fast something like this formula factorial(n) / (factorial(depth) * (factorial(n-depth))) this means that input sets get huge quickly.
My question is thus.
Considering I have an function f(x) that takes a combination and applies a calculation that has a cost, e.g.
func f(x) {
if query_mysql("text search query").value > 15 {
return true
}
return false
}
How can I efficiently process and execute this function on a huge set of combinations?
Bonus question, can combinations be generated concurrently?
Update: I already know how to generate them conventionally, its more a case of making it efficient.
One approach will be to first calculate how much parallelism you can get, based on the number of threads you've got. Let the number of threads be T, and split the work as follows:
sort the elements according to some total ordering.
Find the smallest number d such that Choose(n,d) >= T.
Find all combinations of 'depth' (exactly) d (typically much lower than to depth d, and computable on one core).
Now, spread the work to your T cores, each getting a set of 'prefixes' (each prefix c is a combination of size d), and for each case, find all the suffixes that their 'smallest' element is 'bigger' than max(c) according to the total ordering.
this approach can also be translated nicely to map-reduce paradigm.
map(words): //one mapper
sort(words) //by some total ordering function
generate all combiations of depth `d` exactly // NOT K!!!
for each combination c produced:
idx <- index in words of max(c)
emit(c,words[idx+1:end])
reduce(c1, words): //T reducers
combinations <- generate all combinations of size k-d from words
for each c2 in combinations:
c <- concat(c1,c2)
emit(c,f(c))
Use one of the many known algorithms to generate combinations. Chase's Twiddle algorithm is one of the best known and perfectly suitable. It captures state in an array, so it can be restarted or seeded if wished.
See Algorithm to return all combinations of k elements from n for lots more.
You can progress through your list at your own pace, using minimal memory and no disk IO. Generating each combination will take a microscopic amount of time compared to the 1 sec or so of your computation.
This algorithm (and many others) are easily adapted for parallel execution if you have the necessary skills.
I have two arrays, N and M. they are both arbitrarily sized, though N is usually smaller than M. I want to find out what elements in N also exist in M, in the fastest way possible.
To give you an example of one possible instance of the program, N is an array 12 units in size, and M is an array 1,000 units in size. I want to find which elements in N also exist in M. (There may not be any matches.) The more parallel the solution, the better.
I used to use a hash map for this, but it's not quite as efficient as I'd like it to be.
Typing this out, I just thought of running a binary search of M on sizeof(N) independent threads. (Using CUDA) I'll see how this works, though other suggestions are welcome.
1000 is a very small number. Also, keep in mind that parallelizing a search will only give you speedup as the number of cores you have increases. If you have more threads than cores, your application will start to slow down again due to context switching and aggregating information.
A simple solution for your problem is to use a hash join. Build a hash table from M, then look up the elements of N in it (or vice versa; since both your arrays are small it doesn't matter much).
Edit: in response to your comment, my answer doesn't change too much. You can still speed up linearly only until your number of threads equals your number of processors, and not past that.
If you want to implement a parallel hash join, this would not be difficult. Start by building X-1 hash tables, where X is the number of threads/processors you have. Use a second hash function which returns a value modulo X-1 to determine which hash table each element should be in.
When performing the search, your main thread can apply the auxiliary hash function to each element to determine which thread to hand it off to for searching.
Just sort N. Then for each element of M, do a binary search for it over sorted N. Finding the M items in N is trivially parallel even if you do a linear search over an unsorted N of size 12.