How do I find the largest cluster in this simple dataset? - algorithm

I have data on users and their interests. Some users have more interests than others. Data looks like below.
How do I find the largest cluster of users with the most interests in common? Formally, I am trying to maximize (number of users in cluster * number of shared interests in cluster)
In the data below, the largest cluster is:
CORRECT ANSWER
Users: [1,2,3]
Interests: [2,3]
Cluster-value: 3 users x 2 shared interests = 6
DATA
User 1: {3,2}
User 2: {3,2,4}
User 3: {2,3,8}
User 4: {7}
User 5: {7}
User 6: {9}
How do I find the largest cluster of users with the most interests in common?
Here would be a hypothetical data generation process:
import random
# Generate 300 random (user, interest) tupples
def generate_data():
data = []
while len(data) < 300:
data_pt = {"user": random.randint(1,100), "interest":random.randint(50)}
if data_pt not in data:
data.append(data_pt)
return data
def largest_cluster(data):
return None
UPDATE: As somebody pointed out, the data is too parse. In the real case, there would be more users than interests. So I have updated the data generating process.

This looks to me like the kind of combinatorial optimization problem which would fall into the NP-Hard complexity class, which would of course mean that it's intractable to find an exact solution for instances with more than ~30 users.
Dynamic Programming would be the tool you'd want to employ if you were to find a usable algorithm for a problem with an exponential search space like this (here the solution space is all 2^n subsets of users), but I don't see DP helping us here because of the lack of overlapping sub-problems. That is, for DP to help, we have to be able to use and combine solutions to smaller sub-problems into an overall solution in polynomial time, and I don't see how we can do that for this problem.
Imagine you have a solution for a size=k problem, using a limited subset of the users {u1, u2,...uk} and you want to use that solution to find the new solution when you add another user u(k+1). The issue is the solution set in the incrementally larger instance might not overlap at all with the previous solution (it may be an entirely different group of users/interests), so we can't effectively combine solutions to subproblems to get the overall solution. And if instead of trying to just use the single optimal solution for the size k problem to reason about the size k+1 problem you instead stored all possible user combinations from the smaller instance along with their scores, you could of course quite easily do set intersections across these groups' interests with the new user's interests to find the new optimal solution. However, the problem with this approach is of course that the information you have to store would double with iteration, yielding an exponential time algorithm not better than the brute force solution. You run into similar problems if you try to base your DP off incrementally adding interests rather than users.
So if you know you only have a few users, you can use the brute force approach: generating all user combinations, taking a set intersection of each combination's interests, scoring and saving the max score. The best way to approach larger instances would probably be with approximate solutions through search algorithms (unless there is a DP solution I don't see). You could iteratively add/subtracts/swap users to improve the score and climb towards towards an optimum, or use a branch-and-bound algorithm which systematically explores all user combinations but stops exploring any user-subset branches with null interest intersection (as adding additional users to that subset will still produce a null intersection). You might have a lot of user groups with null interest intersections, so this latter approach could be quite quick practically speaking by its pruning off large parts of the search space, and if you ran it without a depth limit it would find the exact solution eventually.
Branch-and-bound would work something like this:
def getLargestCluster((user, interest)[]):
userInterestDict := { user -> {set of user's interests} } # build a dict
# generate and score user clusters
users := userInterestDict.keys() # save list of users to iterate over
bestCluster, bestInterests, bestClusterScore := {}, {}, 0
generateClusterScores()
return [bestCluster, bestInterests bestClusterScore]
# (define locally in getLargestCluster or pass needed values
def generateClusterScores(i = 0, userCluster = {}, clusterInterests = {}):
curScore := userCluster.size * clusterInterests.size
if curScore > bestScore:
bestScore, bestCluster, bestInterests := curScore, curCluster, clusterInterests
if i = users.length: return
curUser := users[i]
curInterests := userInterestDict[curUser]
newClusterInterests := userCluster.size = 0 ? curInterests : setIntersection(clusterInterests, curInterests)
# generate rest subsets with and without curUser (copy userCluster if pass by reference)
generateClusterScores(i+1, userCluster, clusterInterests)
if !newClusterInterests.isEmpty(): # bound the search here
generateClusterScores(i+1, userCluster.add(curUser), newClusterInterests)
You might be able to do a more sophisticated bounding (like if you can calculate that the current cluster score couldn't eclipse your current best score, even if all the remaining users were added to the cluster and the interest intersection stayed the same), but checking for an empty interest intersection is simple enough. This works fine for 100 users, 50 interests though, up to around 800 data points. You could also make it more efficient by iterating over the minimum of |interests| and |users| (to generate fewer recursive calls/combinations) and just mirror the logic for the case where interests is lower. Also, you get more interesting clusters with fewer users/interests

Related

Resource Allocation Algorithm (Weights in Containers)

I am currently trying to work through this problem. But I cannot seem to find the solution for this problem.
So here is the premise, there are k number of containers. Each container has a capacity associated with it. You are to place weights in these containers. The weights can have random values. However, the total weight in the container cannot exceed the capacity of the container. Or else the container will break. There could be a situation where the weight does not fit in any of the container. Then, you can rearrange the weights to accommodate the new weight.
Example:
Container 1: [10, 4], Capacity = 20
Container 2: [7, 6], Capacity = 20
Container 3: [10, 6], Capacity = 20
Now lets say we have to add new weight with value 8.
One possible solution is to move the 6 from Container 2 to Container 1. And place the new weight in Container 2.
Container 1: [10, 4, 6], Capacity = 20
Container 2: [7, 8], Capacity = 20
Container 3: [10, 6], Capacity = 20
I would like to reallocate this in an few moves as possible.
Let me know if this does not make sense. I am sure there is an algorithm out there but I just cannot seem to find it.
Thanks.
I thought the "Distribution of Cookies" problem would help but that requires to many moves.
As I noted in the comments, the problem of finding if ANY solution exists is called Bin Packing and is NP-complete. Therefore any solution is either going to sometimes fail to find answers, or will be possibly exponentially slow.
The stated preference is for sometimes failing to find an answer. So I'll make reasonable decisions that result in that.
Note that this is would take me a couple of days for me to implement. Take a shot yourself, but if you want you can email btilly#gmail.com and we can discuss a contract. (I already spent too long on it.)
Next, the request for shortest path means a breadth first search. So we'll take a breadth-first search through "the reasonableness of the path". Basically we'll try greedy first strategies, and then cut it off if it takes too long. So we may find the wrong answer (if greedy was wrong), or give up (if it takes too long). But we'll generally do reasonably well.
So what is a reasonable path? Well a good greedy solution to bin packing is always place the heaviest thing first, and place it in the fullest bin you can. That's great for placing a bunch of objects in at once, but it won't help you directly with moving objects.
And therefore we'll prioritize moves that create large holes first. And so our rules for the first things to try become:
Always place the heaviest thing we have first.
If possible, place it where we leave the container as full as possible.
Try moving things to create large spaces before small ones.
Deduplicate early.
Figuring this out is going to involve a lot of, "Pick the closest to full bin where I fit," and, "Pick the smallest thing in this bin which lets me fit." And you'd like to do this while looking at a lot of, "We did, X, Y and Z..." and then looking at "...or maybe X, Y and W...".
Luckily I happen to have a perfect data structure for this. https://stackoverflow.com/a/75453554/585411 shows how to have a balanced binary tree, kept in sorted order, which it is easy to clone and try something with while not touching the original tree. There I did it so you can iterate over the old tree. But you can also use it to create a clone and try something out that you may later abandon.
I didn't make that a multi-set (able to add elements multiple times) or add a next_biggest method. A multi-set is doable by adding a count to a node. Now contains can return a count (possibly 0) instead of a boolean. And next_biggest is fairly easy to add.
We need to add a hash function to this for deduplication purposes. We can define this recursively with:
node.hash = some_hash(some_hash(node.value) + some_hash(node.left.hash) + some_hash(node.right.hash))
(insert appropriate default hashes if node.left or node.right is None)
If we store this in the node at creation, then looking it up for deduplication is very fast.
With this if you have many bins and many objects each, you can have the objects stored in sorted order of size, and the bins stored sorted by free space, then bin.hash. And now the idea is to add a new object to a bin as follows
new_bin = old_bin.add(object)
new_bins = old_bins.remove(old_bin).add(new_bin)
And remove similarly with:
new_bin = old_bin.remove(object)
new_bins = old_bins.remove(old_bin).add(new_bin)
And with n objects across m bins this constructs each new state using only O(log(n) + log(m)) new data. And we can easily see if we've been here before.
And now we create partial solutions objects consisting of:
prev_solution (the solution we came from, may be None)
current_state (our data for bins and objects in bins)
creation_id (ascending id for partial solutions)
last_move (object, from_bin, to_bin)
future_move_bins (list of bins in order of largest movable object)
future_bins_idx (which one we last looked at)
priority (what order to look at these in)
moves (how many moves we've actually used)
move_priority (at what priority we started emptying the from_bin)
Partial solutions should compare based on priority and then creation_id. They should hash based on (solution.state.hash, solution.last_move.move_to.hash, future_bins_idx).
There will need to be a method called next_solutions. It will return the next group of future solutions to consider. (These may share
The first partial solution will have prev_solution = None, creation_id=1, last_move=None, and priority = moves = move_priority = 0. The future_move_bins will be a list of bins sorted by biggest movable element descending. And future_move_bins_idx will be 0
When we create a new partial solution, we will have to:
clone old solution into self
self.prev_solution = old solution
self.creation_id = next_creation_id
next_creation_id += 1
set self.last_move
remove object from self.state.from_bin
add object to self.state.to_bin
(fixing future_move_bins left to caller)
self.moves += 1
if the new from_bin matches the previous:
self.priority = max(self.moves, self.move_priority)
else:
self.priority += 1
self.move_priority = self.priority
OK, this is a lot of setup. We're ALMOST there. (Except for the key future_moves business.)
The next thing that we need is the idea of a Priority Queue. Which in Python can be realized with heapq.
And NOW here is the logic for the search:
best_solution_hash_moves = {}
best_space_by_moves = {}
construct initial_solution
queue = []
add initial_solution.next_solutions() to queue
while len(queue) and not_time_to_stop(): # use this to avoid endless searches:
solution = heapq.heappop(queue)
# ANSWER HERE?
if can add target object to solution.state:
walk prev_solution backwards to get the moves we want
return reverse of the moves we found.
if solution.hash() not in best_solution_hash:
# We have never seen this solution hash
best_solution_hash[solution.hash()] = solution
elif solution.moves < best_solution_hash[solution.hash()].moves:
# This is a better way of finding this state we previously got to!
# We want to redo that work with higher priority!
solution.priority = min(solution.priority, best_solution_hash[solution.hash()].priority - 0.01)
best_solution_hash[solution.hash()] = solution
if best_solution_hash[solution.hash()] == solution:
for next_solution in solution.next_solutions():
# Is this solution particularly promising?
if solution.moves not in best_space_by_moves or
best_space_by_moves[solution.moves] <=
space left in solution.last_move.from_bin:
# Promising, maybe best solution? Let's prioritize it!
best_space_by_moves[solution.moves] =
space left in solution.last_move.from_bin:
solution.priority = solution.move_priority = solution.moves
add next_solution to queue
return None # because no solution was found
So the idea is that we take the best looking current solution, consider just a few related solutions, and add them back to the queue. Generally with a higher priority. So if something fairly greedy works, we'll try that fairly quickly. In time we'll get to unpromising moves. If one of those surprises us on the upside, we'll set its priority to moves (thereby making us focus on it), and explore that path more intensely.
So what does next_solutions do? Something like this:
def next_solutions(solution):
if solution.last_move is None:
if future_bins is not empty:
yield result of moving largest movable in future_bins[0] to first bin it can go into (ie enough space)
else:
if can do this from solution:
yield result of moving largest movable...
in future_bins[bin_idx]...
to smallest bin it can go in...
...at least as big as last_move.to_bin
if can move smaller object from same bin in prev_solution:
yield that with priority solution.priority+2
if can move same object to later to_bin in prev_solution:
yield that with priority solution.priority+2
if can move object from next bin_idx in prev_solution:
yield result of moving that with priority solution.priority+1
Note that trying moving small objects first, or moving objects to an emptier bin than needed are possible, but are unlikely to be a good idea. So I penalized that more severely to have the priority queue focus on better ideas. This results in a branching factor of about 2.7.
So if an obvious greedy approach succeeds in less than 7 steps, the queue will likely get to size 1000 or so before you find it. And is likely to find it if you had a couple of suboptimal choices.
Even if a couple of unusual choices need to be made, you'll still get an answer quickly. You might not find the best, but you'll generally find pretty good ones.
Solutions of a dozen moves with a lot of data will require the queue to grow to around 100,000 items, and that should take on the order of 50-500 MB of memory. And that's probably where this approach maxes out.
This all may be faster (by a lot) if the bins are full enough that there aren't a lot of moves to make.

Matching data based on parameters and constraints

I've been looking into the k nearest neighbors algorithm as I might be developing an application that matches fighters (boxers) in the near future.
The reason for my question, is to figure out which would be the best approach/algorithm to use when matching fighters based on multiple parameters and constraints depending on the rule-set.
The relevant properties of each fighter are the following:
Age (Fighters will be assigned to an agegroup (15, 17, 19, elite)
Weight
Amount of fights
Now there are some rulesets for what can be allowed when matching fighters:
A maximum of 2 years in between the fighters (unless it's elite)
A maximum of 3 kilo's difference in weight
Now obviously the perfect match, would be one where all the attendees gets matched with another boxer that fits within the ruleset.
And the main priority is to match as many fighters with each other as possible.
Is K-nn the way to go or is there a better approach?
If so which?
This is too long for a comment.
For best results with K-nn, I would suggest principal components. These allow you to use many more dimensions and do a pretty good job of spreading the data through the space, to get a good neighborhood.
As for incorporating existing rules, you have two choices. Probably, the best way is to build it into you distance function. Alternatively, you can take a large neighborhood and build it into the combination function.
I would go with k-Nearest Neighbor search. Since your dataset is in a low dimensional space (i.e. 3), I would use CGAL, in order to perform the task.
Now, the only thing you have to do, is to create a distance function like this:
float boxers_dist(Boxer a, Boxer b) {
if(abs(a.year - b.year) > 2 || abs(a.weight - b.weight) > e)
return inf;
// think how you should use the 3 dimensions you have, to compute distance
}
And you are done...now go fight!

Sorting People into Groups based on Votes

I have a problem with finding a algorithm for sorting a dataset of people. I try to explain as detailed as possible:
The story starts with a survey. A bunch of people, lets say 600 can choose between 20-25 projects. They make a #1-wish, #2-wish and #3-wish, where #1 is the most wanted project they want to take part and wish 3 the "not-perfect-but-most-acceptable-choose".
These project are limited in their number of participants. Every project can join around 30 people (based on the number of people and count of projects).
The algorithm puts the people in the different projects and should find the best possible combination.
The problem is that you can't just put all the people with their number 1 wish X in the certain project and stuff all the other with also number 1 wish X in there number 2 wish because that would not be the most "happiest" situation for everybody.
You may can think of what I mean when you imagine that for everybody who get his number 1 wish you get 100 points, for everybody who get his number 2 wish 60 points, number 3 wish 30 points and who get not in one of his wishes 0 points. And you want to get as most points as possible.
I hope you get my problem. This is for a school-project day.
Is there something out there that could help me? Do you have any idea? I would be thankful for every tipp!!
Kind regards
You can solve this optimally by formulating it as a min cost network flow problem.
Add a node for each person, and one for each project.
Set cost for a flow between a person and a project according to their preferences.
(As Networkx provides a min cost flow, but not max cost flow I have set the costs to be
negative.)
For example, using Networkx and Python:
import networkx as nx
G=nx.DiGraph()
prefs={'Tom':['Project1','Project2','Project3'],
'Dick':['Project2','Project1','Project3'],
'Harry':['Project1','Project3','Project1']}
capacities={'Project1':2,'Project2':10,'Project3':4}
num_persons=len(prefs)
G.add_node('dest',demand=num_persons)
A=[]
for person,projectlist in prefs.items():
G.add_node(person,demand=-1)
for i,project in enumerate(projectlist):
if i==0:
cost=-100 # happy to assign first choices
elif i==1:
cost=-60 # slightly unhappy to assign second choices
else:
cost=-30 # very unhappy to assign third choices
G.add_edge(person,project,capacity=1,weight=cost) # Edge taken if person does this project
for project,c in capacities.items():
G.add_edge(project,'dest',capacity=c,weight=0)
flowdict = nx.min_cost_flow(G)
for person in prefs:
for project,flow in flowdict[person].items():
if flow:
print person,'joins',project
In this code Tom's number 1 choice is Project1, followed by Project2, then Project3.
The capacities dictionary specifies the upper limit on how many people can join each project.
My algorithm would be something like this:
mainloop
wishlevel = 1
loop
Distribute people into all projects according to wishlevel wish
loop through projects, counting population
If population exceeds maximum
Distribute excess non-redistributed people into their wishlevel+1 projects that are under-populated
tag distributed people as 'redistributed' to avoid moving again
endif
endloop
wishlevel = wishlevel + 1
loop until wishlevel == 3
mainloop until no project exceeds max population
This should make several passes through the data set until everything is evened out. This algorithm may result in an endless loop if you restrict redistribution of already-redistributed people in the event that one project fills up with such people as the algorithm progresses, so you might try it without that restriction.

System Design of Google Trends?

I am trying to figure out system design behind Google Trends (or any other such large scale trend feature like Twitter).
Challenges:
Need to process large amount of data to calculate trend.
Filtering support - by time, region, category etc.
Need a way to store for archiving/offline processing. Filtering support might require multi dimension storage.
This is what my assumption is (I have zero practial experience of MapReduce/NoSQL technologies)
Each search item from user will maintain set of attributes that will be stored and eventually processed.
As well as maintaining list of searches by time stamp, region of search, category etc.
Example:
Searching for Kurt Cobain term:
Kurt-> (Time stamp, Region of search origin, category ,etc.)
Cobain-> (Time stamp, Region of search origin, category ,etc.)
Question:
How do they efficiently calculate frequency of search term ?
In other words, given a large data set, how do they find top 10 frequent items in distributed scale-able manner ?
Well... finding out the top K terms is not really a big problem. One of the key ideas in this fields have been the idea of "stream processing", i.e., to perform the operation in a single pass of the data and sacrificing some accuracy to get a probabilistic answer. Thus, assume you get a stream of data like the following:
A B K A C A B B C D F G A B F H I B A C F I U X A C
What you want is the top K items. Naively, one would maintain a counter for each item, and at the end sort by the count of each item. This takes O(U) space and O(max(U*log(U), N)) time, where U is the number of unique items and N is the number of items in the list.
In case U is small, this is not really a big problem. But once you are in the domain of search logs with billions or trillions of unique searches, the space consumption starts to become a problem.
So, people came up with the idea of "count-sketches" (you can read up more here: count min sketch page on wikipedia). Here you maintain a hash table A of length n and create two hashes for each item:
h1(x) = 0 ... n-1 with uniform probability
h2(x) = 0/1 each with probability 0.5
You then do A[h1[x]] += h2[x]. The key observation is that since each value randomly hashes to +/-1, E[ A[h1[x]] * h2[x] ] = count(x), where E is the expected value of the expression, and count is the number of times x appeared in the stream.
Of course, the problem with this approach is that each estimate still has a large variance, but that can be dealt with by maintaining a large set of hash counters and taking the average or the minimum count from each set.
With this sketch data structure, you are able to get an approximate frequency of each item. Now, you simply maintain a list of 10 items with the largest frequency estimates till now, and at the end you will have your list.
How exactly a particular private company does it is likely not publicly available, and how to evaluate the effectiveness of such a system is at the discretion of the designer (be it you or Google or whoever)
But many of the tools and research is out there to get you started. Check out some of the Big Data tools, including many of the top-level Apache projects, like Storm, which allows for the processing of streaming data in real-time
Also check out some of the Big Data and Web Science conferences like KDD or WSDM, as well as papers put out by Google Research
How to design such a system is challenging with no correct answer, but the tools and research are available to get you started

Coming up with factors for a weighted algorithm?

I'm trying to come up with a weighted algorithm for an application. In the application, there is a limited amount of space available for different elements. Once all the space is occupied, the algorithm should choose the best element(s) to remove in order to make space for new elements.
There are different attributes which should affect this decision. For example:
T: Time since last accessed. (It's best to replace something that hasn't been accessed in a while.)
N: Number of times accessed. (It's best to replace something which hasn't been accessed many times.)
R: Number of elements which need to be removed in order to make space for the new element. (It's best to replace the least amount of elements. Ideally this should also take into consideration the T and N attributes of each element being replaced.)
I have 2 problems:
Figuring out how much weight to give each of these attributes.
Figuring out how to calculate the weight for an element.
(1) I realize that coming up with the weight for something like this is very subjective, but I was hoping that there's a standard method or something that can help me in deciding how much weight to give each attribute. For example, I was thinking that one method might be to come up with a set of two sample elements and then manually compare the two and decide which one should ultimately be chosen. Here's an example:
Element A: N = 5, T = 2 hours ago.
Element B: N = 4, T = 10 minutes ago.
In this example, I would probably want A to be the element that is chosen to be replaced since although it was accessed one more time, it hasn't been accessed in a lot of time compared with B. This method seems like it would take a lot of time, and would involve making a lot of tough, subjective decisions. Additionally, it may not be trivial to come up with the resulting weights at the end.
Another method I came up with was to just arbitrarily choose weights for the different attributes and then use the application for a while. If I notice anything obviously wrong with the algorithm, I could then go in and slightly modify the weights. This is basically a "guess and check" method.
Both of these methods don't seem that great and I'm hoping there's a better solution.
(2) Once I do figure out the weight, I'm not sure which way is best to calculate the weight. Should I just add everything? (In these examples, I'm assuming that whichever element has the highest replacementWeight should be the one that's going to be replaced.)
replacementWeight = .4*T - .1*N - 2*R
or multiply everything?
replacementWeight = (T) * (.5*N) * (.1*R)
What about not using constants for the weights? For example, sure "Time" (T) may be important, but once a specific amount of time has passed, it starts not making that much of a difference. Essentially I would lump it all in an "a lot of time has passed" bin. (e.g. even though 8 hours and 7 hours have an hour difference between the two, this difference might not be as significant as the difference between 1 minute and 5 minutes since these two are much more recent.) (Or another example: replacing (R) 1 or 2 elements is fine, but when I start needing to replace 5 or 6, that should be heavily weighted down... therefore it shouldn't be linear.)
replacementWeight = 1/T + sqrt(N) - R*R
Obviously (1) and (2) are closely related, which is why I'm hoping that there's a better way to come up with this sort of algorithm.
What you are describing is the classic problem of choosing a cache replacement policy. Which policy is best for you, depends on your data, but the following usually works well:
First, always store a new object in the cache, evicting the R worst one(s). There is no way to know a priori if an object should be stored or not. If the object is not useful, it will fall out of the cache again soon.
The popular squid cache implements the following cache replacement algorithms:
Least Recently Used (LRU):
replacementKey = -T
Least Frequently Used with Dynamic Aging (LFUDA):
replacementKey = N + C
Greedy-Dual-Size-Frequency (GDSF):
replacementKey = (N/R) + C
C refers to a cache age factor here. C is basically the replacementKey of the item that was evicted last (or zero).
NOTE: The replacementKey is calculated when an object is inserted or accessed, and stored alongside the object. The object with the smallest replacementKey is evicted.
LRU is simple and often good enough. The bigger your cache, the better it performs.
LFUDA and GDSF both are tradeoffs. LFUDA prefers to keep large objects even if they are less popular, under the assumption that one hit to a large object makes up lots of hits for smaller objects. GDSF basically makes the opposite tradeoff, keeping many smaller objects over fewer large objects. From what you write, the latter might be a good fit.
If none of these meet your needs, you can calculate optimal values for T, N and R (and compare different formulas for combining them) by minimizing regret, the difference in performance between your formula and the optimal algorithm, using, for example, Linear regression.
This is a completely subjective issue -- as you yourself point out. And a distinct possibility is that if your test cases consist of pairs (A,B) where you prefer A to B, then you might find that you prefer A to B , B to C but also C over A -- i.e. its not an ordering.
If you are not careful, your function might not exist !
If you can define a scalar function of your input variables, with various parameters for coefficients and exponents, you might be able to estimate said parameters by using regression, but you will need an awful lot of data if you have many parameters.
This is the classical statistician's approach of first reviewing the data to IDENTIFY a model, and then using that model to ESTIMATE a particular realisation of the model. There are large books on this subject.

Resources