I'm Student of software engineering,
Right now I am working for my final project, scheduling Business matchmaking on a trade day.
The idea is to bring a seller (developer) and a buyer (A person with financial means) together. The algorithm should be like "Speed Dating".
Let's say I have 15 tables and 10 sessions.
It means that each session 15 buyers will meet 15 sellers for 20 minutes.
My question is how do I make the matching?
Suppose each person has 8 attribute that characterize him.
• I thinking creating bipartite graph (group A – Sellers, group B - Buyers)
• Then link up between a seller and buyer based on similar attributes (Should consider what is level of error). dont want to bring together people who are not related
• Then on each session look for a maximum matching.
Constraints: it's not a real time, I'll close registration a few days before the event.
I'm currently "idea blocked" on how to do the linking step (base on a person attributes).
I would appreciate your help,
Even a dialogue on the matter would help me a lot!:)
Often given multi-dimensional data that describe data points, you define a similarity or "kernel" between points. This could be the e.g. dot product after you normalize by standard deviation in each dimension for example. Or it could be a Gaussian kernel e^((-d^2)/y) where d is the dot-product between points and y is a constant bandwidth parameter. Also e.g. if certain dimensions are categorical then you could the one-dimensional dot-product to be 1 if the categorical variables agree, otherwise 0. Then you can form the overall dot-product from the multi-dimensional data after normalizing each dimension by its standard deviation. The point is, once you form a similarity or kernel between points, then you can define a weighted bipartite graph where the weight of an edge is equal to the similarity/kernel between points, and your problem is to find a maximum weight matching. This is a well-known problem with solutions in the literature e.g. the Hungarian algorithm, see e.g. http://en.wikipedia.org/wiki/Matching_%28graph_theory%29#In_weighted_bipartite_graphs .
Related
Problem: I need to drop (n) employees from office to their homes(co-ordinates available). I have (x) 7-seater & (y) 4-seater cabs available.
I have to design an algorithm to drop all the employees to their homes while travelling minimum distance.
Also, the algorithm must tell me how many 7-seater or/and 4-seater vehicles I must choose so as to travel minimum distance.
eg. If I have 15 employees then the algorithm may tell me to use 1 (7-seater) cab & 2 (4-seater) cab & have the employees in each cab as following:
[(E2, E4, E6, E8), (E1, E3, E5, E7, E9, E10, E12), (E11, E13, E14, E15)]
Approach: I'm thinking of this as a Travelling Salesman Problem with multiple salesmen with an upper limit on number of cities each can travel. Also salesmen do not need to come back to the origin. Ant's colony problem came to my mind, but I can't really choose wisely which algorithm to choose
Requirement: I really need the ALGORITHM. Either TSP or Ant's colony, doesn't matter. I'll welcome opinions, but I really need the ALGORITHM.
This is a cost minimization problem, not a travelling salesman problem. It is related to TSP in the sense that TSP is a very specific cost minimization problem.
The solution consists of three steps:
Generate a list of employee drop-off points (nodes)
Create distinct paths that do not intersect, nor branch. These will be your routes and help prevent wasteful route overlaps. Use cost(path) = distance(furthest node and origin) + taxi_cost(nodes) + sum(distance between nodes) to compare paths and/or brute-force all potential networks. Networks are layouts of paths. DO NOT BRANCH THE PATHS!!
Total distance is a line of defense against waste ensuring that routes are not too long.
Sum of distances helps the algorithm converge on neighbourhoods where many employees live (when possible).
Because this variation of the coin problem allows imperfect solutions, it reduces to a variant of the Knapsack Problem. The utility of each taxi is capacity. If you also wish to choose the cheapest way to transport your employees, utility(taxi) = capacity/cost. From this our simplest solution is to be greedy; who cares about empty space? If you really care about filling up taxis perfectly (as opposed to cost efficiently), you'll need a much more complex solution. You only specify the least distance as your metric (with each additional taxi multiplying cost). I assume this is a proxy to say 'I don't want to pay too much'.
Therefore: taxi_cost(nodes) = math.floor(amount(nodes)/max(utility(taxis)+1). This equation selects the cheapest, roomiest taxi, and figures out how many of them are required to fully service the route.
Be sure to calculate the cost of each network you examine as sum(cost(path))
Once you've found the cheapest network to service, for each path in the chosen network:
make a list of employees travelling to the furthest node
fill the preferred taxi with those employees
repeat with the next furthest node until you have a full taxi, then add the filled taxi to the list. If you run out of employees, you've finished assigning taxis to the route. (The benefit of furthest-first selection is that you can ask employees in unfilled taxis to walk if that part of the route is within blocks of the office).
The algorithm above is not perfect, but it will have many desirable tendencies.
routes will be as short as possible and cover the greatest possible area (by not looping or branching)
routes will tend to service neighbourhoods, rather than trying to overlap responsibilities. This part of the algorithm isn't optimal, but is effective. This makes it really easy to remove service routes without needing to recalculate the transportation network.
the taxis chosen will be cost-efficient, helping to avoid paying more than necessary.
routes will use as few taxis as possible, taking into account the relative cost of upgrading to roomier ones with higher capacity
because the taxis travelling furthest will be full, it has less of an impact on your employee's ability to get to work if you decide to cancel service to emptier taxis.
Every step closer to perfection costs you many times more than the previous step, so diminished returns are acceptable if the solution provide desirable features. Although the algorithm makes some potentially sub-optimal tradeoffs, they come with huge value; your network of taxi routes becomes much easier to modify.
If you'd like to make an optimal solution, the Knapsack Problem, Coin Problem, and Change-making Problem help determine the cost of taxis and routes.
Spanning Trees are the most effective way to determine routes. Center the spanning tree at the office and calculate the cost of each branch as the maximum distance from the office. Try to keep each branch servicing areas with high density to make it easier to add and remove taxi routes.
Studying pathfinding can help you learn how to determine good cost functions so that you can numerically compare different potential paths. Remember that your network consists of a set of paths, but will require its own cost function so that you can compare different layouts.
I've written an in-depth guide to pathfinding for this answer. Pathfinding articles are few and just don't go into enough depth for a lot of problem spaces. A good cost function can get you a nearly perfect solution if you have multiple priorities. Unfortunately, good cost functions are domain specific so you will need to identify them yourself. Feel free to message me if you aren't sure how to make a path with certain traits and I'll help you figure out a good cost function.
It's a constraint satisfaction problem, not really a TSP. If you're looking for a project that may be able to help you, you could look into cspdb, which is what I wrote some time ago:
https://github.com/gtoonstra/cspdb
You'd be using a database in the backend that maintains the state and write a couple of scripts in it's own grammar that manipulates that state. A couple of examples are included to solve nqueens and classroom scheduling with multiple constraints.
From a list d destinations you can make the array of pairwise travel costs c. where c[a,b] is the travel cost from a to b...
Now you have a start point p. add to the array c2 values for p to each point in d.
So you now have the concept of groups.
You can look at this as a greedy algorithm. Given the list c2 you can take the cheapest option given your state.
Your state is the vector all you cab vectors (the costs of getting from where ever they are to where ever they could go next) .* the assignment vector where k == 0 for k in the. You find the minimum option given your state (considering adding another onerous to a cab of 4 costs the difference between the 4 person cab and the 7 person cab and adding a person to a zero person cab or adding a new cab also has a cost. Once all your people have been assigned to their cabs you have an answer.
The Idea of a greedy algorithm is most often characterized by the backpack problem but it can also be implemented for statistical methods such as feature selection.
Like #Aaron3468 this approach is not perfect and does not guarantee the best solution.
If you want the best possible solutions you can iterate through all the combinations but this becomes impractical quickly.
From my point of view your algorithm should solve 2 problems: the number of cars of each type and the shortest distance (how you number your employees depends on you or you should give more details). Sorry I'm a using a phone and I don't have have all features of the site.
For the number of cars you can use below algorithm. To solve issues related to distances, you should give more info about the paths and their lengths. A graph algorithm may then be combined with this to do the trick. Here 11=7+4.
Begin
Integer temp:= n/11
Integer rem:= n mod 11
If rem=0
x:=temp
y:=temp
Else if rem<=4
x:=temp
y:=temp+1
Else if rem<=7
x:=temp+1
y:=temp
Else
x:=temp+1
y:=temp+1
Endif
End
I'm not a trained statistician so I apologize for the incorrect usage of some words. I'm just trying to get some good results from the Weka Nearest Neighbor algorithms. I'll use some redundancy in my explanation as a means to try to get the concept across:
Is there a way to normalize a multi-dimensional space so that the distances between any two instances are always proportional to the effect on the dependent variable?
In other words I have a statistical data set and I want to use a "nearest neighbor" algorithm to find instances that are most similar to a specified test instance. Unfortunately my initial results are useless because two attributes that are very close in value weakly correlated to the dependent variable would incorrectly bias the distance calculation.
For example let's say you're trying to find the nearest-neighbor of a given car based on a database of cars: make, model, year, color, engine size, number of doors. We know intuitively that the make, model, and year have a bigger effect on price than the number of doors. So a car with identical color, door count, may not be the nearest neighbor to a car with different color/doors but same make/model/year. What algorithm(s) can be used to appropriately set the weights of each independent variable in the Nearest Neighbor distance calculation so that the distance will be statistically proportional (correlated, whatever) to the dependent variable?
Application: This can be used for a more accurate "show me products similar to this other product" on shopping websites. Back to the car example, this would have cars of same make and model bubbling up to the top, with year used as a tie-breaker, and then within cars of the same year, it might sort the ones with the same number of cylinders (4 or 6) ahead of the ones with the same number of doors (2 or 4). I'm looking for an algorithmic way to derive something similar to the weights that I know intuitively (make >> model >> year >> engine >> doors) and actually assign numerical values to them to be used in the nearest-neighbor search for similar cars.
A more specific example:
Data set:
Blue,Honda,6-cylinder
Green,Toyota,4-cylinder
Blue,BMW,4-cylinder
now find cars similar to:
Blue,Honda,4-cylinder
in this limited example, it would match the Green,Toyota,4-cylinder ahead of the Blue,Honda,6-cylinder because the two brands are statistically almost interchangeable and cylinder is a stronger determinant of price rather than color. BMW would match lower because that brand tends to double the price, i.e. placing the item a larger distance.
Final note: the prices are available during training of the algorithm, but not during calculation.
Possible you should look at Solr/Lucene for this aim. Solr provides a similarity search based field value frequency and it already has functionality MoreLikeThis for find similar items.
Maybe nearest neighbor is not a good algorithm for this case? As you want to classify discrete values it can become quite hard to define reasonable distances. I think an C4.5-like algorithm may better suit the application you describe. On each step the algorithm would optimize the information entropy, thus you will always select the feature that gives you the most information.
Found something in the IEEE website. The algorithm is called DKNDAW ("dynamic k-nearest-neighbor with distance and attribute weighted"). I couldn't locate the actual paper (probably needs a paid subscription). This looks very promising assuming that the attribute weights are computed by the algorithm itself.
Edit: to include concrete explanation of my problem (as correctly deduced by Billiska):
"Set A is the set of users. set B is the set of products. each user rates one or more products. the rating is 1 to 10. you want to deduce for each user, who is the other user that has the most similar taste to him."
"The other half is choosing how exactly do you want to rank similarity of A-elements." - this is also part of my problem. I feel that users who have rated similarly across the most products have the closed affinity, but at the same time I want to avoid user1 and user2 with many mediocre matches being matched ahead of user1 and user3 who have just a few very good matches (perhaps I need a non-linear score).
Disclaimer: I have never used a graph database.
I two sets of data A and B. A has a relationship with zero to many Bs. Each relationship has a fixed value.
e.g.
A1--5-->B10
A1--1-->B1000
So my initial thought "Yay, thats a graph, time to learn about graph databases!" but before I get too carried away.... the only reason for doing this so that I can answer the question....
For each A find the set of As that are most similar based on their weights, where I want to take in to consideration
the difference in weights (assuming 1 to 10) so that 10 and 10 is scored higher than 10 and 1; but then I have an issue with how to handle where is is no pairing (or do I - I am just not sure)
the number of vertices (ignoring weights) that two sets have in common. Intention is to rank two As with lots of vertices to the same Bs higher than two As that have just a single matching vertices.
What would the best approach be to doing this?
(Supplementary - as I realise this may count a second question): How would that approach change if the set of A was in the millions and B in the 100 thousands and I needed real-time answers?
Not a complete answer. I don't fully understand the technique either. but I know it's very relevant.
If you view the data as a matrix. e.g. have the rows correspond to set A, have the columns correspond to set B, and the entries are the weight.
Then it's a matrix with some missing values.
One technique used in recommender system (under the category of collaborative filtering) is low-rank approximation.
It's based on the assumption that the said user-product rating matrix usually have low-rank.
In a rough sense, the said matrix have low-rank if the rows of many users could be expressed as linear combination of other users' row.
I hope this would give a start for further reading.
Yes, you could see in low-rank approximation wiki page that the technique can be used to guess the missing entries (the missing rating). I know it's a different problem, but related.
I am required to solve a specific problem.
I'm given a representation of a social network.
Each node is a person, each edge is a connection between two persons. The graph is undirected (as you would expect).
Each person has a personal "affinity" for buying a product (to simplify things, let's say there's only one product involved in this whole problem).
In each "step" in time, each person, independently, chooses whether to buy the product or not.
There's probability invovled here. A few parameters are taken into account:
His personal affinity for the product,
The percentage of his friends that already bought the product
The gain for a person buying the product is 1 dollar.
The problem is to point out X persons (let's say, 5 persons) that will receive the product in step 0, and will maximize the total expected value of the gain after Y steps (let's say, 10 steps)
The network is very large. It's not possible to simulate all the options in a naive way.
What tool / library / algorithm should I be using?
Thank you.
P.S.
When investigating this matter in google and wikipedia, a few terms kept popping up:
Dynamic network analysis
Epidemic model
but it didn't help me to find an answer
Generally, people who have the most neighbours have the most influence when they buy something.
So my heuristic would be to order people first by the number of neighbours they have (in decreasing order), then by the number of neighbours that each of those neighbours has (in order from highest to lowest), and so on. You will need at most Y levels of neighbour counts, though fewer may suffice in practice. Then simply take the first X people on this list.
This is only a heuristic, because e.g. if a person has many neighbours but most or all of them are likely to have already bought the product through other connections, then it may give a higher expectation to select a different person having fewer neighbours, but whose neighbours are less likely to already own the product.
You do not need to construct the entire list and then sort it; you can construct the list and then insert each item into a heap, and then just extract the highest-scoring X people. This will be much faster if X is small.
If X and Y are as low as you suggest then this calculation will be pretty fast, so it would be worth doing repeated runs in which instead of starting with the first X people owning the product, for each run you randomly select the initial X owners according to a probability that depends on their position in the list (the further down the list, the lower the probability).
Check out the concept of submodularity, a pretty powerful mathematical concept. In particular, check out slide 19, where submodularity is used to answer the question "Given a social graph, who should get free cell phones?". If you have access, also read the corresponding paper. That should get you started.
Hi I am building a program wherein students are signing up for an exam which is conducted at several cities through out the country. While signing up students provide a list of three cities where they would like to give the exam in order of their preference. So a student may say his first preference for an exam centre is New York followed by Chicago followed by Boston.
Now keeping in mind that as the exam centres have limited capacity they cannot accomodate each students first choice .We would however try and provide as many students either their first or second choice of centres and as far as possible avoid students having to give the third choice centre to a student
Now any ideas of a sorting algorithm that would mke this process more efficent.The simple way to do this would be to first go through the list of first choice of students allot as many as possible then go through the list of second choices and allot. However this may lead to the students who are first in the list getting their first centre and the last students getting their third choice or worse none of their choices. Anything that could make this more efficient
Sounds like a variant of the classic stable marriages problem or the college admission problem. The Wikipedia lists a linear-time (in the number of preferences, O(n²) in the number of persons) algorithm for the former; the NRMP describes an efficient algorithm for the latter.
I suspect that if you randomly generate preferences of exam places for students (one Fisher–Yates shuffle per exam place) and then apply the stable marriages algorithm, you'll get a pretty fair and efficient solution.
This problem could be formulated as an instance of minimum cost flow. Let N be the number of students. Let each student be a source vertex with capacity 1. Let each exam center be a sink vertex with capacity, well, its capacity. Make an arc from each student to his first, second, and third choices. Set the cost of first choice arcs to 0; the cost of second choice arcs to 1; and the cost of third choice arcs to N + 1.
Find a minimum-cost flow that moves N units of flow. Assuming that your solver returns an integral solution (it should; flow LPs are totally unimodular), each student flows one unit to his assigned center. The costs minimize the number of third-choice assignments, breaking ties by the number of second-choice assignments.
There are a class of algorithms that address this allocating of limited resources called auctions. Basically in this case each student would get a certain amount of money (a number they can spend), then your software would make bids between those students. You might use a formula based on preferences.
An example would be for tutorial times. If you put down your preferences, then you would effectively bid more for those times and less for the times you don't want. So if you don't get your preferences you have more "money" to bid with for other tutorials.