Related
I'm working on subscription system and I need to have the output data processed to register the subscribers correctly and automatically.
I work mainly with PHP/Mysql but it's not a PHP/Mysql problem but rather a data sorting problem.
My data have two sets :
"topics"=> [
"Women rights"=> ["capacity"=>3]
"Chelter"=> ["capacity"=>5]
"Charity"=> ["capacity"=>7]
"Training"=> ["capacity"=>17]
"Child care"=> ["capacity"=>13]
"Nursing"=> ["capacity"=>8]
"Garbage collection"=> ["capacity"=>2]
"Managing"=> ["capacity"=>1]
]
"applications"= [
"Alan"=>[
["topic"=>"Charity", "priority"=>1, "rankInTopic"=>1],
["topic"=>"Chelter", "priority"=>2, "rankInTopic"=>3],
["topic"=>"Garbage collection", "priority"=>3, "rankInTopic"=>1],
["topic"=>"Managing", "priority"=>4, "rankInTopic"=>12]
]
"David"=>[
["topic"=>"Women rights", "priority"=>4, "rankInTopic"=>2],
["topic"=>"Chelter", "priority"=>3, "rankInTopic"=>2],
["topic"=>"Garbage collection", "priority"=>1, "rankInTopic"=>3],
["topic"=>"Managing", "priority"=>2, "rankInTopic"=>9]
["topic"=>"Nursing", "priority"=>5, "rankInTopic"=>3]
["topic"=>"Charity", "priority"=>6, "rankInTopic"=>3]
]
"Sonia"=>[
["topic"=>"Chelter", "priority"=>2, "rankInTopic"=>1],
["topic"=>"Training", "priority"=>1, "rankInTopic"=>5]
]
"Robert"=>[
["topic"=>"Garbage collection", "priority"=>6, "rankInTopic"=>2],
["topic"=>"Child care", "priority"=>3, "rankInTopic"=>2],
["topic"=>"Women rights", "priority"=>1, "rankInTopic"=>1],
["topic"=>"Managing", "priority"=>2, "rankInTopic"=>4]
["topic"=>"Nursing", "priority"=>5, "rankInTopic"=>1]
["topic"=>"Charity", "priority"=>4, "rankInTopic"=>5]
]
"Diana"=>[
["topic"=>"Child care", "priority"=>1, "rankInTopic"=>1]
]
]
I would like to subscribe each applicant in one topic only .
Topics are aordered by "rankInTopic" field then by "priority" field.
Thanks
The term for what you're trying to do is matching rather than ranking or sorting.
This problem can be expressed as a weighted bipartite matching. Sometimes this is also called the "assignment problem."
The data structure is an undirected bipartite graph with edge weights - often though not necessarily represented as a matrix of weights. One set of vertices will be "seats" in the topic areas. One vertex per unit of capacity. The other set will be applicants. Each applicant preference is an edge to each of the respective seats. The weight of the edge is a single integer. To accommodate both priority and rank, you'll need the lexicographic order expressed in a single number. Something sufficiently big, e.g. rankInTopic * sum(all priorities) + priority will work.
Now a standard weighted bipartite matching algorithm will pick edges that pair each applicant with exactly one seat. The set of edges it picks will have a the minimum possible sum. (Some implementations are wired to find the maximum. Then you'll want to negate the edge weights.)
The Hungarian Algorithm is the classical solution. Iirc, a good implementation is O(n^3) in the number of vertices. You could look for a library or implement it yourself. It's not trivial, but not terrible either.
I actually used this algorithm to match students with desired courses in an academic setting. It worked well. I added some parameters to use weights more sophisticated than consecutive preferences. E.g. if I wanted to favor giving a maximum number of students their first preference by causing others to get less desirable assignments, I could try weights something like 1,5,6,7... for priorities 1,2,3,4... Or if I wanted to cause all students to get something in their top 3, I'd use something like 1,2,3,10,.... It was great to be able to explain the algorithm and say with perfect confidence that the assignment was optimal.
Edit
You found an implementation that requires a square matrix as input. This is less efficient than one that allows rectangles, but still ought to be fine. So it handles N things to be matched with N other things. I.e. the matching will be perfect.
Let the columns of the matrix be the seats in these topics. So the first three will be for "Women's Rights", the next 5 will be "Chelter", etc. Looks like 54 in all.
The rows will be people. There are only 5, so what to do with the other 49? Fill these in with very high equal weights. I think 1e6 will work. The algorithm will be sure to use the much lower weighted edges based on priority first. These will all connect to "real" people. The 49 "pseudo people" will be matched last with unused topic seats. Those matches will be meaningless and ignored.
For the weight in position [person P, seat in Topic T], use (priority * 10000 + rankInTopic), where priority and rank are what P has for T. If P has expressed no preference for T, then use a huge number meaning "infinity". Something like a million ought to work.
Pass this to the algorithm. The return value should be an array with 54 elements. The first five rows will show which seat each person has been matched with. From those you can determine which topic they're in. The rest will contain matches with arbitrary unused seats.
PHP is a poor-performing environment for this kind of problem (unless you're compiling to native code somehow, and even then). Also this implementation is going to waste memory and time computing matches that aren't needed. The Java implementation mentioned in the Wikipedia page accepts a rectangular array, so is more suitable for this task. It would probably take a couple of careful hours to transcribe the Java to PHP.
I have two sets of points plotted in a coordinate system. Each point in a set must be matched to at least one point at the other set, in a way that the sum of the length of the lines drawn by joining those points should be as low as possible. To make it clear, line drawing is just an abstraction, the actual output is just the pairs of points that must be matched.
I've seen this question about a similar problem, except that in my case there's no single-link restriction since the sets may have different sizes. Is there any kind of problem that describes this situation? More specifically, what algorithm could I use to solve this, assuming each set may have a maximum of 10 points?
Algorithm
You can model this as a network flow problem.
By having a source of 1 at each point in the first set, and a sink of 1 at each point in the second set, plus an extra node 'dest' for any left over capacity, any valid flow will always connect every point.
Make edges between the points with cost according to the distance between the points.
So far we have a network whose solution will be the lowest cost matching of set 1 to set 2 (i.e. each point will have a single link).
To allow multiple links you can simply make the following additions:
add 0 weight edges between each point in set2 and 'dest' (this allows points in set 2 to be multiply connected)
add 0 weight edges between 'dest' and each point in set2 (this allows points in set 1 to be multiply connected)
Example Python code using Networkx
import networkx as nx
import random
G=nx.DiGraph()
set1=['A','B','C','D','E','F','G','H','I']
set2=['a','b','c']
# Assume set1 > set2 (or swap sets)
assert len(set1)>=len(set2)
G.add_node('dest',demand=len(set1)-len(set2))
A=[]
for person in set1:
G.add_node(person,demand=-1)
G.add_edge('dest',person,weight=0)
for project in set2:
cost = random.randint(1,10) # Assign appropriate costs here
G.add_edge(person,project,weight=cost) # Edge taken if person does this project
for project in set2:
G.add_node(project,demand=1)
G.add_edge(project,'dest',weight=0)
flowdict = nx.min_cost_flow(G)
for person in set1:
for project,flow in flowdict[person].items():
if flow:
print person,'->',project
You can use a discrete optimization approach (Integer Programming).
We have two sets A, of size X, and B, of size Y. This means a maximum of X*Y links, each described by a boolean variable: L(i,j) = L(Y*i+j) is 1 if nodes A(i) and B(j) are linked, 0 if not. If X = Y = 10, we can write link L(7,3) as L73.
We can rewrite the problem like this:
Node A(i) has at least one link: X (say, ten) criteria with i from 0 to X-1, each of them comprised of Y components:
L(i,0)+L(i,1)+L(i,2)+...+L(i,Y-1) >= 1
Node B(j) has at least one link, and there are Y criteria made up of X components:
L(0,j)+L(1,j)+L(2,j)+...+L(X-1,j) >= 1
The minimal cost requirement becomes:
cost = SUM(C(0,0)*L(0,0)+C(0,1)*L(0,1)+...+C(9,9)*L(9,9)
With these conventions, we can easily build the matrices for an ILP problem, that can be passed to our favorite ILP solving package or library (C, Java, Python, even PHP).
====
A self-contained "greedy" algorithm which is not guaranteed to find a minimum, but is reasonably quick and should give reasonable results unless you feed it a pathological data set, is:
- connect all points in the smaller set, each to its nearest point in the other set.
- connect all unconnected points remaining in the larger set, each to its
nearest point in the first set, whether it's already connected or not.
As an optimization, you can then enumerate the points in the larger data set; if one of them (say A) is singly connected to a point in the first data set (say B) which is multiply connected, and is not its nearest neighbour C, you can switch the link from A-B to A-C. This takes care of one of the simplest problems that may arise from the "greediness" of the algorithm.
Suppose I have a a graph with 2^N - 1 nodes, numbered 1 to 2^N - 1. Node i "depends on" node j if all the bits in the binary representation of j that are 1, are also 1 in the binary representation of i. So, for instance, if N=3, then node 7 depends on all other nodes. Node 6 depends on nodes 4 and 2.
The problem is eliminating nodes. I can eliminate a node if no other nodes depend on it. No nodes depend on 7; so I can eliminate 7. After eliminating 7, I can eliminate 6, 5, and 3, etc. What I'd like is to find an efficient algorithm for listing all the possible unique elimination paths. (that is, 7-6-5 is the same as 7-5-6, so we only need to list one of the two). I have a dumb algorithm already, but I think there must be a better way.
I have three related questions:
Does this problem have a general name?
What's the best way to solve it?
Is there a general formula for the number of unique elimination paths?
Edit: I should note that a node cannot depend on itself, by definition.
Edit2: Let S = {s_1, s_2, s_3,...,s_m} be the set of all m valid elimination paths. s_i and s_j are "equivalent" (for my purposes) iff the two eliminations s_i and s_j would lead to the same graph after elimination. I suppose to be clearer I could say that what I want is the set of all unique graphs resulting from valid elimination steps.
Edit3: Note that elimination paths may be different lengths. For N=2, the 5 valid elimination paths are (),(3),(3,2),(3,1),(3,2,1). For N=3, there are 19 unique paths.
Edit4: Re: my application - the application is in statistics. Given N factors, there are 2^N - 1 possible terms in statistical model (see http://en.wikipedia.org/wiki/Analysis_of_variance#ANOVA_for_multiple_factors) that can contain the main effects (the factors alone) and various (2,3,... way) interactions between the factors. But an interaction can only be present in a model if all sub-interactions (or main effects) are present. For three factors a, b, and c, for example, the 3 way interaction a:b:c can only be in present if all the constituent two-way interactions (a:b, a:c, b:c) are present (and likewise for the two-ways). Thus, the model a + b + c + a:b + a:b:c would not be allowed. I'm looking for a quick way to generate all valid models.
It seems easier to think about this in terms of sets: you are looking for families of subsets of {1, ..., N} such that for each set in the family also all its subsets are present. Each such family is determined by the inclusion-wise maximal sets, which must be overlapping. Families of pairwise overlapping sets are called Sperner families. So you are looking for Sperner families, plus the union of all the subsets in the family. Possibly known algorithms for enumerating Sperner families or antichains in general are useful; without knowing what you actually want to do with them, it's hard to tell.
Thanks to #FalkHüffner's answer, I saw that what I wanted to do was equivalent to finding monotonic Boolean functions for N arguments. If you look at the figure on the Wikipedia page for Dedekind numbers (http://en.wikipedia.org/wiki/Dedekind_number) the figure expresses the problem graphically. There is an algorithm for generating monotonic Boolean functions (http://www.mathpages.com/home/kmath094.htm) and it is quite simple to construct.
For my purposes, I use the algorithm, then eliminate the first column and last row of the resulting binary arrays. Starting from the top row down, each row has a 1 in the ith column if one can eliminate the ith node.
Thanks!
You can build a "heap", in which at depth X are all the nodes with X zeros in their binary representation.
Then, starting from the bottom layer, connect each item to a random parent at the layer above, until you get a single-component graph.
Note that this graph is a tree, i.e., each node except for the root has exactly one parent.
Then, traverse the tree (starting from the root) and count the total number of paths in it.
UPDATE:
The method above is bad, because you cannot just pick a random parent for a given item - you have a limited number of items from which you can pick a "legal" parent... But I'm leaving this method here for other people to give their opinion (perhaps it is not "that bad").
In any case, why don't you take your graph, extract a spanning-tree (you can use Prim algorithm or Kruskal algorithm for finding a minimal-spanning-tree), and then count the number of paths in it?
I have a graph-theoretic (which is also related to combinatorics) problem that is illustrated below, and wonder what is the best approach to design an algorithm to solve it.
Given 4 different graphs of 6 nodes (by different, I mean different structures, e.g. STAR, LINE, COMPLETE, etc), and 24 unique objects, design an algorithm to assign these objects to these 4 graphs 4 times, so that the number of repeating neighbors on the graphs over the 4 assignments is minimized. For example, if object A and B are neighbors on 1 of the 4 graphs in one assignment, then in the best case, A and B will not be neighbors again in the other 3 assignments.
Obviously, the degree to which such minimization can go is dependent on the specific graph structures given. But I am more interested in a general solution here so that given any 4 graph structures, such minimization is guaranteed as the result of the algorithm.
Any suggestion/idea of solving this problem is welcome, and some pseudo-code may well be sufficient to illustrate the design. Thank you.
Representation:
You have 24 elements, I will name this elements from A to X (24 first letters).
Each of these elements will have a place in one of the 4 graphs. I will assign a number to the 24 nodes of the 4 graphs from 1 to 24.
I will identify the position of A by a 24-uple =(xA1,xA2...,xA24), and if I want to assign A to the node number 8 for exemple, I will write (xa1,Xa2..xa24) = (0,0,0,0,0,0,0,1,0,0...0), where 1 is on position 8.
We can say that A =(xa1,...xa24)
e1...e24 are the unit vectors (1,0...0) to (0,0...1)
note about the operator '.':
A.e1=xa1
...
X.e24=Xx24
There are some constraints on A,...X with these notations :
Xii is in {0,1}
and
Sum(Xai)=1 ... Sum(Xxi)=1
Sum(Xa1,xb1,...Xx1)=1 ... Sum(Xa24,Xb24,... Xx24)=1
Since one element can be assign to only one node.
I will define a graph by defining the neighbors relation of each node, lets say node 8 has neighbors node 7 and node 10
to check that A and B are neighbors on node 8 for exemple I nedd:
A.e8=1 and B.e7 or B.e10 =1 then I just need A.e8*(B.e7+B.e10)==1
in the function isNeighborInGraphs(A,B) I test that for every nodes and I get one or zero depending on the neighborhood.
Notations:
4 graphs of 6 nodes, the position of each element is defined by an integer from 1 to 24.
(1 to 6 for first graph, etc...)
e1... e24 are the unit vectors (1,0,0...0) to (0,0...1)
Let A, B ...X be the N elements.
A=(0,0...,1,...,0)=(xa1,xa2...xa24)
B=...
...
X=(0,0...,1,...,0)
Graph descriptions:
IsNeigborInGraphs(A,B)=A.e1*B.e2+...
//if 1 and 2 are neigbors in one graph
for exemple
State of the system:
L(A)=[B,B,C,E,G...] // list of
neigbors of A (can repeat)
actualise(L(A)):
for element in [B,X]
if IsNeigbotInGraphs(A,Element)
L(A).append(Element)
endIf
endfor
Objective functions
N(A)=len(L(A))+Sum(IsneigborInGraph(A,i),i in L(A))
...
N(X)= ...
Description of the algorithm
start with an initial position
A=e1... X=e24
Actualize L(A),L(B)... L(X)
Solve this (with a solveur, ampl for
exemple will work I guess since it's
a nonlinear optimization
problem):
Objective function
min(Sum(N(Z),Z=A to X)
Constraints:
Sum(Xai)=1 ... Sum(Xxi)=1
Sum(Xa1,xb1,...Xx1)=1 ...
Sum(Xa24,Xb24,... Xx24)=1
You get the best solution
4.Repeat step 2 and 3, 3 more times.
If all four graphs are K_6, then the best you can do is choose 4 set partitions of your 24 objects into 4 sets each of cardinality 6 so that the pairwise intersection of any two sets has cardinality at most 2. You can do this by choosing set partitions that are maximally far apart in the Hasse diagram of set partitions with partial order given by refinement. The general case is much harder, but perhaps you can still begin with this crude approximation of a solution and then be clever with which vertex is assigned which object in the four assignments.
Assuming you don't want to cycle all combinations and calculate the sum every time and choose the lowest, you can implement a minimum problem (solved depending on your constraints using either a linear programming solver i.e. symplex algorithm engines or a non-linear solver, much harder talking in terms of time) with constraints on your variables (24) depending on the shape of your path. You can also use free software like LINGO/LINDO to create rapidly a decision theory model and test its correctness (you need decision theory notions though)
If this has anything to do with the real world, then it's unlikely that you absolutely must have a solution that is the true minimum. Close to the minimum should be good enough, right? If so, you could repeatedly randomly make the 4 assignments and check the results until you either run out of time or have a good-enough solution or appear to have stopped improving your best solution.
I am trying to enumerate a number of failure cases for a system I am working on to make writing test cases easier. Basically, I have a group of "points" which communicate with an arbitrary number of other points through data "paths". I want to come up with failure cases in the following three sets...
Set 1 - Break each path individually (trivial)
Set 2 - For each point P in the system, break paths so that P is completely cut off from the rest of the system (also trivial)
Set 3 - For each point P in the system, break paths so that the system is divided into two groups of points (A and B, excluding point P) so that the only way to get from group A to group B is through point P (i.e., I want to force all data traffic in the system through point P to ensure that it can keep up). If this is not possible for a particular point, then it should be skipped.
Set 3 is what I am having trouble with. In practice, the systems I am dealing with are small and simple enough that I could probably "brute force" a solution (generally I have about 12 points, with each point connected to 1-4 other points). However, I would be interested in finding a more general algorithm for this type of problem, if anyone has any suggestions or ideas about where to start.
Here's some psuedocode, substituting the common graph theory terms of "nodes" for "points" and "edges" for "paths" assuming a path connects two points.
for each P in nodes:
for each subset A in nodes - {P}:
B = nodes - A - {P}
for each node in A:
for each edge out of A:
if the other end is in B:
break edge
run test
replace edges if necessary
Unless I'm misunderstanding something, the problem seems relatively simple as long as you have a method of generating the subsets of nodes-{P}. This will test each partition [A,B] twice unless you put some other check in there.
There are general algorithms for 'coloring' (with or without a u depending on whether you want UK or US articles) networks. However this is overkill for the relatively simple problem you describe.
Simply divide the nodes between two sets, then in pseudo-code:
foreach Node n in a.Nodes
foreach Edge e in n.Edges
if e.otherEnd in b then
e.break()
broken.add(e)
broken.get(rand(broken.size()).reinstate()
Either use rand to chosse a broken link to reinstate, or systematically reinstate one at a time
Repeat for b (or structure your edges such that a break in one direction affects the other)