problem statement
There are 5 buckets. Each n[many] of different products. Each product can repeat multiple times in the same bucket and across the bucket also. We want to derive products which should be based on two conditions:
Products counts should be considerably high across the buckets
same Product should have a considerably high count in each segment
i want list of products which satisfies the condition.
I tried knapsack algorithms. By providing random weights and profits. It seems wrong approach
In linear time you can find:
The total count of each products across all buckets
The count of each product in the bucket which has the least of it
If linear time is too slow, in sublinear time you could estimate the same by sampling the buckets.
In either case, you have the information (or an estimate thereof) that you need to make a decision.
After doing the above, you need to decide how you want to pick products -- essentially how you want to trade off the min vs total per product.
I'm looking for an algorithm to sort a large number of items using the fewest comparisons. My specific case makes it unclear which of the obvious approaches is appropriate: the comparison function is slow and non-deterministic so it can make errors, because it's a human brain.
In other words, I want to sort arbitrary items on my computer into a list from "best" to "worst" by comparing them two at a time. They could be images, strings, songs, anything. My program would display two things for me to compare. The program doesn't know anything about what is being compared, its job is just to decide which pairs to compare. So that gives the following criteria
It's a comparison sort - The only time the user sees items is when comparing two of them.
It's an out-of-place sort - I don't want to move the actual files, so items can have placeholder values or metadata files
Comparisons are slow - at least compared to a computer. Data locality won't have an effect, but comparing obvious disparities will be quick, similar items will be slow.
Comparison is subjective - comparison results could vary slightly at different times.
Items don't have a total order - the desired outcome is an order that is "good enough" at runtime, which will vary depending on context.
Items will rarely be almost sorted - in fact, the goal is to get random data to an almost-sorted state.
Sets usually will contain runs - If every song on an album is a banger, it might be faster because of (2) to compare them to the next album rather than each other. Imagine a set {10.0, 10.2, 10.9, 5.0, 4.2, 6.9} where integer comparisons are fast but float comparisons are very slow.
There are many different ways to approach this problem. In addition to sorting algorithms, it's similar to creating tournament brackets, and voting systems. As that table illustrates, there are countless ways to define and solve the problem based on various criteria. For this question I'm only interested in treating it as a sorting problem where the user is comparing two items at a time and choosing a preference. So what approach makes sense for either of the two following versions of the question?
How to choose pairs to get the best result in O(n) or fewer operations? (for example compare random pairs of items with n/2 operations, then use n/2 operations to spot check or fine-tune)
How to create the best order with additional operations but no additional comparisons (e.g. similar items are sorted into buckets or losers are removed, anything that doesn't increase the number of comparisons)
The representation of comparison results can be anything that makes the solution convenient - it can be dictionary keys corresponding to the final order, a "score" based on number of comparisons, a database, etc.
Edit: The comments have helped clarify the question in that the goal is similar to something like bucket sort, samplesort or the partitioning phase of quicksort. So the question could be rephrased as how to choose good partitions based on comparisons, but I'm also interested in any other ways of using the comparison results that wouldn't be applicable in a standard in-place comparison sort like keeping a score for each item.
Given N users with movies preferences, retrieve a list of movies preferred by at least K users.
What's the most efficient [Run-time / Memory] algorithm to find that answer?
If N=K it's easy, since you could:
Intersection = First user preferences.
For the rest of the users:
Intersection = intersect(Intersection, user_i)
If intersection is empty there's no point to continue.
(4) is the problematic in any other case, since even if there's no intersection, there's still 'potential'
I thought to create an hash-map to "Count" the amount of intersections per movie preference, but it sounds pretty inefficient. Especially if the movie preferences is huge.
Any ideas / hints? Thanks.
If you want to optimize to run time, your approach of creating a histogram is a good one. Basically, run over all data, and map:movie->#users. Then, a single iteration on the map gives you the list of movies liked by k+ users.
This is O(N+k) time and O(k) memory.
Note that this approach can be efficiently distributed using map-reduce.
map:(user,movie)->(movie,1)
reduce:(movie,list<int>)->movie if sum(list)>k else none
If you want to do so with minimal added memory, you can use some inplace sorting algorithm of your data, by movie name. Then, iterate the data and count how many times a movie repeats itself, if it's k or more, yield it.
This is O(NlogN) run time, with minimal added memory.
In both solutions, N stands for the input size (number of entries), which is potentially O(n*k), but practically much less.
I'm trying to improve an ordinal ranking implementation we are currently using to determine a unique ranking amongst competitors over a set of tasks.
The problem is as follows. There are K tasks and N competitors. All of the tasks are equally important. For each of the tasks, the competitors perform the task and the time it took for them to complete the task is noted. For each task, points are given to each competitor based on the order of their completion times. The fastest competitor gets N points, next fastest gets N-1 etc, the last competitor gets 1 point. The points are accumulated for a final tally from which a ranking is established.
Note in the event any two competitors complete in the same time, they are given equivalent points - In short ordinal ranking.
My problem is as follows, the tasks even though they are all equally important are not equally complex. Some tasks are more difficult than others. As a result what we observe is that the time difference between the 1st place and last place competitors can sometimes be 2-3 orders of magnitude. This situation is compounded when in scenarios the top M% competitors completes in a time that is 1-3 orders of magnitude less than the next best competitor.
I would like to somehow signify and make obvious these differences in the final ranking.
Is there such a ranking system that can accommodate such a requirement?
Since your system is based on completion time, you obviously need to rank the complexity of each task by expected time. So each task needs to have a weight that you multiply N by so that the first place would get weight*N, second would get weight*(N-1). The final tally would sort the final weight ranking of each competitor and use that order as their final placement.
On the other hand, you may keep your system as it is now. Let a part of the competition be the wisdom of differentiating which task is more complex than another.
I have a large database (potentially in the millions of records) with relatively short strings of text (on the order of street address, names, etc).
I am looking for a strategy to remove inexact duplicates, and fuzzy matching seems to be the method of choice. My issue: many articles and SO questions deal with matching a single string against all records in a database. I am looking to deduplicate the entire database at once.
The former would be a linear time problem (comparing a value against a million other values, calculating some similarity measure each time). The latter is an exponential time problem (compare every record's values against every other record's value; for a million records, that's approx 5 x 10^11 calculations vs the 1,000,000 calculations for the former option).
I'm wondering if there is another approach than the "brute-force" method I mentioned. I was thinking of possibly generating a string to compare each record's value against, and then group strings that had roughly equal similarity measures, and then run the brute-force method through these groups. I wouldn't achieve linear time, but it might help. Also, if I'm thinking through this properly, this could miss a potential fuzzy match between strings A and B because the their similarity to string C (the generated check-string) is very different despite being very similar to each other.
Any ideas?
P.S. I realize I may have used the wrong terms for time complexity - it is a concept that I have a basic grasp of, but not well enough so I could drop an algorithm into the proper category on the spot. If I used the terms wrong, I welcome corrections, but hopefully I got my point across at least.
Edit
Some commenters have asked, given fuzzy matches between records, what my strategy was to choose which ones to delete (i.e. given "foo", "boo", and "coo", which would be marked the duplicate and deleted). I should note that I am not looking for an automatic delete here. The idea is to flag potential duplicates in a 60+ million record database for human review and assessment purposes. It is okay if there are some false positives, as long as it is a roughly predictable / consistent amount. I just need to get a handle on how pervasive the duplicates are. But if the fuzzy matching pass-through takes a month to run, this isn't even an option in the first place.
Have a look at http://en.wikipedia.org/wiki/Locality-sensitive_hashing. One very simple approach would be to divide up each address (or whatever) into a set of overlapping n-grams. This STACKOVERFLOW becomes the set {STACKO, TACKO, ACKOV, CKOVE... , RFLOW}. Then use a large hash-table or sort-merge to find colliding n-grams and check collisions with a fuzzy matcher. Thus STACKOVERFLOW and SXACKOVRVLOX will collide because both are associated with the colliding n-gram ACKOV.
A next level up in sophistication is to pick an random hash function - e.g. HMAC with an arbitrary key, and of the n-grams you find, keep only the one with the smallest hashed value. Then you have to keep track of fewer n-grams, but will only see a match if the smallest hashed value in both cases is ACKOV. There is obviously a trade-off here between the length of the n-gram and the probability of false hits. In fact, what people seem to do is to make n quite small and get higher precision by concatenating the results from more than one hash function in the same record, so you need to get a match in multiple different hash functions at the same time - I presume the probabilities work out better this way. Try googling for "duplicate detection minhash"
I think you may have mis-calculated the complexity for all the combinations. If comparing one string with all other strings is linear, this means due to the small lengths, each comparison is O(1). The process of comparing each string with every other string is not exponential but quadratic, which is not all bad. In simpler terms you are comparing nC2 or n(n-1)/2 pairs of strings, so its just O(n^2)
I couldnt think of a way you can sort them in order as you cant write an objective comparator, but even if you do so, sorting would take O(nlogn) for merge sort and since you have so many records and probably would prefer using no extra memory, you would use quick sort, which takes O(n^2) in worst case, no improvement over the worst case time in brute force.
You could use a Levenshtein transducer, which "accept[s] a query term and return[s] all terms in a dictionary that are within n spelling errors away from it". Here's a demo.
Pairwise comparisons of all the records is O(N^2) not exponential. There basically two ways to go to cut down on that complexity.
The first is blocking, where you only compare records that already have something in common that's easy to compute, like the first three letters or a common n-gram. This is basically the same idea as Locally Sensitive Hashing. The dedupe python library implements a number of blocking techniques and the documentation gives a good overview of the general approach.
In the worse case, pairwise comparisons with blocking is still O(N^2). In the best case it is O(N). Neither best or worst case are really met in practice. Typically, blocking reduces the number of pairs to compare by over 99.9%.
There are some interesting, alternative paradigms for record linkage that are not based on pairwise comparisons. These have better worse case complexity guarantees. See the work of Beka Steorts and Michael Wick.
I assume this is a one-time cleanup. I think the problem won't be having to do so many comparisons, it'll be having to decide what comparisons are worth making. You mention names and addresses, so see this link for some of the comparison problems you'll have.
It's true you have to do almost 500 billion brute-force compares for comparing a million records against themselves, but that's assuming you never skip any records previously declared a match (ie, never doing the "break" out of the j-loop in the pseudo-code below).
My pokey E-machines T6532 2.2gHz manages to do 1.4m seeks and reads per second of 100-byte text file records, so 500 billion compares would take about 4 days. Instead of spending 4 days researching and coding up some fancy solution (only to find I still need another x days to actually do the run), and assuming my comparison routine can't compute and save the keys I'd be comparing, I'd just let it brute-force all those compares while I find something else to do:
for i = 1 to LASTREC-1
seektorec(i)
getrec(i) into a
for j = i+1 to LASTREC
getrec(j) into b
if similarrecs(a, b) then [gotahit(); break]
Even if a given run only locates easy-to-define matches, hopefully it reduces the remaining unmatched records to a more reasonable smaller set for which further brute-force runs aren't so time-consuming.
But it seems unlikely similarrecs() can't independently compute and save the portions of a + b being compared, in which case the much more efficient approach is:
for i = 1 to LASTREC
getrec(i) in a
write fuzzykey(a) into scratchfile
sort scratchfile
for i = 1 to LASTREC-1
if scratchfile(i) = scratchfile(i+1) then gothit()
Most databases can do the above in one command line, if you're allowed to invoke your own custom code for computing each record's fuzzykey().
In any case, the hard part is going to be figuring out what makes two records a duplicate, per the link above.
Equivalence relations are particularly nice kinds of matching; they satisfy three properties:
reflexivity: for any value A, A ~ A
symmetry: if A ~ B, then necessarily B ~ A
transitivity: if A ~ B and B ~ C, then necessarily A ~ C
What makes these nice is that they allow you to partition your data into disjoint sets such that each pair of elements in any given set are related by ~. So, what you can do is apply the union-find algorithm to first partition all your data, then pick out a single representative element from each set in the partition; this completely de-duplicates the data (where "duplicate" means "related by ~"). Moreover, this solution is canonical in the sense that no matter which representatives you happen to pick from each partition, you get the same number of final values, and each of the final values are pairwise non-duplicate.
Unfortunately, fuzzy matching is not an equivalence relation, since it is presumably not transitive (though it's probably reflexive and symmetric). The result of this is that there isn't a canonical way to partition the data; you might find that any way you try to partition the data, some values in one set are equivalent to values from another set, or that some values from within a single set are not equivalent.
So, what behavior do you want, exactly, in these situations?