I am developing an application where I have to deal with an Entity named 'Skill'. Now the thing is that a 'Skill A' can have a certain relevancy with a 'Skill B' (the relevancy is used for search purposes). Similarly 'Skill B' can also be relevant to 'Skill C'. We currently have the following data model to represent this scenario
Skill {SkillId, SkillName}
RelevantSkill {SkillId, RelevantSkillId, RelevanceLevel}
Now given the above scenario we have a implicit relation between 'Skill A' and 'Skill C'. What would be the optimal data model for this scenario be? We'd also have to traverse this hierarchy when performing search.
What you're asking for seems to be basically a graph distance algorithm (slash data structure) computed from a set of pairwise distances. A reasonable (and nicely computable) metric is commute time.
It can be thought of thus: construct a graph where each node is a Skill, and each edge represents the relevancy of the nodes it connects to each other. Now imagine that you're starting at some node in the graph (some Skill) and randomly jumping to other nodes along defined edges. Let's say that the probability of jumping from Skill A to Skill B is proportional to the relevancy of those skills to each other (normalized by the relevancy of those to other skills ...). Now the commute time represents the average number of steps it takes to make it from Skill A to Skill C.
This has a very nice property that adding more paths between two nodes makes the commute time shorter: if Skill A and B, B and C, C and D, and D and A are related, then the commute time between A and C will get shorter yet. Moreover, commute time can be computed quite easily using an eigenvalue decomposition of your sparsely connected Skill graph (I think the reference I gave you shows this, but if not there are many available).
If you want to actually store the commute time between any pair of Skills you'll need a fully-connected graph, or NxN matrix (N is the number of Skills). A far nicer variant, however, is to, as stated above, drop all connections weaker than some threshold, then store a sparsely connected graph as rows in a database.
Good luck, and I hope this helped!
Something left open by your explanation is how the relevance levels are combined in the case of the indirect ("implicit") relationships. E.g. if skill A is relevant to B with level 3 and skill B is relevant to skill C with level 5, what is the level (as a number) of the indirect relevance of skill A to skill C?
The proper data model depends on two things: how many skills you have, and how dense the relationship structure it is (dense = lots of skills are relevant to others). If the relationship structure is dense and you have few skills (< 1000) you can be best off be representing the whole thing as a matrix.
But if you have many skills but a sparse relationship structure you can represent it as three tables:
Skill {SkillId, SkillName}
RelevantSkill {SkillId, RelevantSkillId, RelevanceLevel}
IndirectRelevance { SkillId, RelevantSkillId, RelevanceLevel}
The third table (IndirectRelevance) is calculated based on the two primary tables; whenever you change Skill or RelevantSkill tables, you need to update the IndirectRelevance table.
I think it is better to have three tables than two; this makes the implementation cleaner and more straightforward. RelevantSkill contains the explicitly stated relationships; IndirectRelevance all derived facts.
Your best bet is to:
augment RelevantSkill with an ImplicitRelevance boolean column:
RelevantSkill {SkillId, RelevantSkillId, RelevanceLevel, ImplicitRelevance}
insert (into the RelevantSkill table) rows corresponding to all implicit (indirect) relevance relationships (e.g. "Skill A" -> "Skill C") with their corresponding computed RelevanceLevel's, if and only if the computed RelevanceLevel is above a set threshold. These rows should have ImplicitRelevance set to true
skill_a_id, skill_b_id, computed_level, 'T'
If any changes are made to the explicit relevance levels (metrics), remove all rows with ImplicitRelevance=true and recompute (re-insert) them.
There are some factors to consider before you can choose best options:
how many skills are there?
are relations sparse or dense (i.e. are skills related with a lot of other skills)?
how often do they change?
is there a relevancy threshold (minimal relevancy that is of interest to you)?
how is multi-path relevancy calculated?
The structure obviously will be like antti.huima proposes. The difference is how IndirectRelevance will be implemented. If there are a lot of changes, lot of relations and relations are dense, then the best way might be stored procedure (perhaps accessed through a view). If there relations are sparse and there is threshold, the best option might be materialized view or a table updated via triggers.
Related
I've been trying to build a schedule generator for my school using topological sort, but am stuck dealing with classes that have prerequisites that can be taken concurrently. I was wondering if there was any clever way to modify topological sort to deal with these concurrent classes? For example, an intro to CS course can either be taken before a Data Structures course or at the same time as a Data Structures course. I'm trying to include the case where they are taken together.
You could create a dummy node, combining the two courses together (assuming each course has low number of concurrent courses at most, as you will likely need all combination of them... Should work just fine if you have only one or two concurrent courses)
The prerequisites of the combined node will be the combined prerequisites of both courses, and all courses that have any prerequisite of one of these will have the dummy node as well.
As postprocessing, once topological sort has ended, you can cleanup the redundancies, and split dummy nodes back to the original courses.
That said, note that topological sort doesn't guarantee you to actually use this dummy node - even if it's possible, before using the original nodes. So there is no guarantee it will actually be used, unless you tie break in favor of them when possible.
Can't mathematically guarantee it's correctness, but this slight modification should work.
Use the normal topological sorting with one difference. Assign all possible beginning nodes a value of 0. For each node that is queued, assign it value of parent node's value + 1. That way, all nodes at a given value would ideally be parallel and can be picked together.
Kahn's algorithm for topological sorting naturally produces a minimum length schedule with concurrency:
Make a dependency graph of all your courses
Select all courses with no dependencies. These can be taken concurrently.
Remove the selected courses from the graph.
If the graph is not empty, go back to (2)
Of course, students are limited in the number of courses they can take simultaneously, and the problem gets tricky when you also impose a limit on maximum concurrency. Deciding the best courses to take first, when too many courses are available, is an NP-hard problem. There are some heuristics you can try, though, like deferring the jobs with the shortest dependant depth.
If you think about exactly what you want as output, it might clear out. For instance, if your desired output is a potential list of what courses to take which semester, then each vertex involved in the topological sort could be “course X on semester Y” rather than just “course X”. Then you'd get these edges, among many others:
intro to CS on semester 1 → data structures on semester 1
intro to CS
on semester 1 → data structures on semester 2
This graph would be larger than if the vertices are just courses of course: the number of vertices is now the number of courses times the maximum number of semesters in your education. But in a realistic setting, it appears to me that it wouldn't be too much to handle.
I'm looking for leads on algorithms to deduce the timeline/chronology of a series of novels. I've split the texts into days and created a database of relationships between them, e.g.: X is a month before Y, Y and Z are consecutive, date of Z is known, X is on a Tuesday, etc. There is uncertainty ('month' really only means roughly 30 days) and also contradictions. I can mark some relationships as more reliable than others to help resolve ambiguity and contradictions.
What kind of algorithms exist to deduce a best-fit chronology from this kind of data, assigning a highest-probability date to each day? At least time is 1-dimensional but dealing with a complex relationship graph with inconsistencies seems non-trivial. I have a CS background so I can code something up but some idea about the names of applicable algorithms would be helpful. I guess what I have is a graph with days as nodes as relationships as edges.
A simple, crude first approximation to your problem would be to store information like "A happened before B" in a directed graph with edges like "A -> B". Test the graph to see whether it is a Directed Acyclic Graph (DAG). If it is, the information is consistent in the sense that there is a consistent chronology of what happened before what else. You can get a sample linear chronology by printing a "topological sort" (topsort) of the DAG. If events C and D happened simultaneously or there is no information to say which came before the other, they might appear in the topsort as ABCD or ABDC. You can even get the topsort algorithm to print all possibilities (so both ABCD and ABDC) for further analysis using more detailed information.
If the graph you obtain is not a DAG, you can use an algorithm like Tarjan's algorithm to quickly identify "strongly connected components", which are areas of the graph which contain chronological contradictions in the form of cycles. You could then analyze them more closely to determine which less reliable edges might be removed to resolve contradictions. Another way to identify edges to remove to eliminate cycles is to search for "minimum feedback arc sets". That's NP-hard in general but if your strongly connected components are small the search could be feasible.
Constraint programming is what you need. In propagation-based CP, you alternate between (a) making a decision at the current choice point in the search tree and (b) propagating the consequences of that decision as far as you can. Notionally you do this by maintaining a domain D of possible values for each problem variable x such that D(x) is the set of values for x which have not yet been ruled out along the current search path. In your problem, you might be able to reduce it to a large set of Boolean variables, x_ij, where x_ij is true iff event i precedes event j. Initially D(x) = {true, false} for all variables. A decision is simply reducing the domain of an undecided variable (for a Boolean variable this means reducing its domain to a single value, true or false, which is the same as an assignment). If at any point along a search path D(x) becomes empty for any x, you have reached a dead-end and have to backtrack.
If you're smart, you will try to learn from each failure and also retreat as far back up the search tree as required to avoid redundant search (this is called backjumping -- for example, if you identify that the dead-end you reached at level 7 was caused by the choice you made at level 3, there's no point in backtracking just to level 6 because no solution exists in this subtree given the choice you made at level 3!).
Now, given you have different degrees of confidence in your data, you actually have an optimisation problem. That is, you're not just looking for a solution that satisfies all the constraints that must be true, but one which also best satisfies the other "soft" constraints according to the degree of trust you have in them. What you need to do here is decide on an objective function assigning a score to a given set of satisfied/violated partial constraints. You then want to prune your search whenever you find the current search path cannot improve on the best previously found solution.
If you do decide to go for the Boolean approach, you could profitably look into SAT solvers, which tear through these kinds of problems. But the first place I'd look is at MiniZinc, a CP language which maps on to a whole variety of state of the art constraint solvers.
Best of luck!
Edit: to include concrete explanation of my problem (as correctly deduced by Billiska):
"Set A is the set of users. set B is the set of products. each user rates one or more products. the rating is 1 to 10. you want to deduce for each user, who is the other user that has the most similar taste to him."
"The other half is choosing how exactly do you want to rank similarity of A-elements." - this is also part of my problem. I feel that users who have rated similarly across the most products have the closed affinity, but at the same time I want to avoid user1 and user2 with many mediocre matches being matched ahead of user1 and user3 who have just a few very good matches (perhaps I need a non-linear score).
Disclaimer: I have never used a graph database.
I two sets of data A and B. A has a relationship with zero to many Bs. Each relationship has a fixed value.
e.g.
A1--5-->B10
A1--1-->B1000
So my initial thought "Yay, thats a graph, time to learn about graph databases!" but before I get too carried away.... the only reason for doing this so that I can answer the question....
For each A find the set of As that are most similar based on their weights, where I want to take in to consideration
the difference in weights (assuming 1 to 10) so that 10 and 10 is scored higher than 10 and 1; but then I have an issue with how to handle where is is no pairing (or do I - I am just not sure)
the number of vertices (ignoring weights) that two sets have in common. Intention is to rank two As with lots of vertices to the same Bs higher than two As that have just a single matching vertices.
What would the best approach be to doing this?
(Supplementary - as I realise this may count a second question): How would that approach change if the set of A was in the millions and B in the 100 thousands and I needed real-time answers?
Not a complete answer. I don't fully understand the technique either. but I know it's very relevant.
If you view the data as a matrix. e.g. have the rows correspond to set A, have the columns correspond to set B, and the entries are the weight.
Then it's a matrix with some missing values.
One technique used in recommender system (under the category of collaborative filtering) is low-rank approximation.
It's based on the assumption that the said user-product rating matrix usually have low-rank.
In a rough sense, the said matrix have low-rank if the rows of many users could be expressed as linear combination of other users' row.
I hope this would give a start for further reading.
Yes, you could see in low-rank approximation wiki page that the technique can be used to guess the missing entries (the missing rating). I know it's a different problem, but related.
I am trying to implement a basic genetic algorithm in MATLAB. I have some questions regarding the cross-over operation. I was reading materials on it and I found that always two parents are selected for cross-over operation.
What happens if I happen to have an odd number of parents?
Suppose I have parent A, parent B & parent C and I cross parent A with B and again parent B with C to produce offspring, even then I get 4 offspring. What is the criteria for rejecting one of them, as my population pool should remain the same always? Should I just reject the offspring with the lowest fitness value ?
Can an arithmetic operation between parents, like suppose OR or AND operation be deemed a good crossover operation? I found some sites listing them as crossover operations but I am not sure.
How can I do crossover between multiple parents ?
"Crossover" isn't so much a well-defined operator as the generic idea of taking aspects of parents and using them to produce offspring similar to each parent in some ways. As such, there's no real right answer to the question of how one should do crossover.
In practice, you should do whatever makes sense for your problem domain and encoding. With things like two parent recombination of binary encoded individuals, there are some obvious choices -- things like n-point and uniform crossover, for instance. For real-valued encodings, there are things like SBX that aren't really sensible if viewed from a strict biological perspective. Rather, they are simply engineered to have some predetermined properties. Similarly, permutation encodings offer numerous well-known operators (Order crossover, Cycle crossover, Edge-assembly crossover, etc.) that, again, are the result of analysis of what features in parents make sense to make heritable for particular problem domains.
You're free to do the same thing. If you have three parents (with some discrete encoding like binary), you could do something like the following:
child = new chromosome(L)
for i=1 to L
switch(rand(3))
case 0:
child[i] = parentA[i]
case 1:
child[i] = parentB[i]
case 2:
child[i] = parentC[i]
Whether that is a good operator or not will depend on several factors (problem domain, the interpretation of the encoding, etc.), but it's a perfectly legal way of producing offspring. You could also invent your own more complex method, e.g., taking a weighted average of each allele value over multiple parents, doing boolean operations like AND and OR, etc. You can also build a more "structured" operator if you like in which different parents have specific roles. The basic Differential Evolution algorithm selects three parents, a, b, and c, and computes an update like a + F(b - c) (with some function F) roughly corresponding to an offspring.
Consider reading the following academic articles:
DEB, Kalyanmoy et al. A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE transactions on evolutionary computation, v. 6, n. 2, p. 182-197, 2002.
DEB, Kalyanmoy; AGRAWAL, Ram Bhushan. Simulated binary crossover for continuous search space. Complex systems, v. 9, n. 2, p. 115-148, 1995.
For SBX, method of crossing and mutate children mentioned by #deong, see answer simulated-binary-crossover-sbx-crossover-operator-example
Genetic algorithm does not have an arbitrary and definite form to be made. Many ways are proposed. But generally, what applies in all are the following steps:
Generate a random population by lot or any other method
Cross parents to raise children
Mutate
Evaluate the children and parents
Generate new population based only on children or children and parents (different approaches exist)
Return to item 2
NSGA-II, the DEB quoted above, is one of the most widely used and well-known genetic algorithms. See an image of the flow taken from the article:
I’m working on program for the English Language school I work for. I’m not being paid, its just a kind of a hobby to improve / automate my work flow.
It’s a residential school and one aspects I’m looking at automating is the way we allocate room to students, and although I don’t want a full blown solution I was hoping someone could point me in the right direction… Suggestions of the way you might approach this or by suggesting algorithms to look at etc.
Basically at the school we have a whole bunch of different rooms ranging from singles to dormitories for 8 people. We get lots of different nationalities from all over the world, and we always try to maker sure each room has a mix of nationalities. Where there is more than one nationality we try to balance them. Age is also important, we always put students of a similar age together, while still trying to mix nationalities, and its unusual for us to have students sharing with more than two years between them.
I suppose more generically speaking, I am in interested in how to sort a given set of students based on two parameters to an optimal result with a few rules attached.
I hope I’ve explain clearly what I am trying to achieve… in a way it sounds really simple, but I’ve trying to think how to do it in a simple way, i.e. by sorting by nationality and then by age but it just doesn’t cut it and I know there must be a better way of approaching this. When I do it “by hand” on an excel sheet it does feel quite intuitive.
Thank you to anyone who offers help / advice.
This is an interesting question but it's not easy to answer. Somehow it's connected with subdivsion and bin packing or the cutting-stock problem. You may want to look for a topological sort too. You can look for Drools a business logic platform that let you define such rules.
First of all you might find this interesting: Stable Room-mates Problem (wikipedia). Unfortunately it does not answer your question.
Try a genetic algorithm.
There are three main criteria for using a genetic algorithm:
ability to represent a solution as a mutable array. We can have an array of integers such that a[i] is the room for the ith student.
mutation of the state should produce predictable results. In our case this is true. Mutating the array will predictably shuffle students between the rooms.
easy to write a fast fitness function. Shouldn't be too hard to write a O(n) fitness function.
This is an interesting problem. I'll try writing some code with this approach and we'll see what happens.
How about, you think of a room as something that repels students of a nationality it already has, and attracts students of a close age to what it already has. The closer the age to the average age, the more it attracts it, and the more guys of X nationality are in the room, the more if repels guys of X nationality.
Then you would, for every new student to be added, iterate through each room and see which is the one that attracts it more. I guess if the room is empty you can set all forces to 0. Also, you would have a couple of constants that multiply each of both "forces" so you can calibrate it depending on how important is to have the same age against how important is to have different nationalities.
I'd analyze each student and create a 'personality' vector based on his/her age & nationality. Then I'd sort the vectors, and maybe scramble the results a bit after sorting to encourage diversity.
The general theme of "assign x to y with respect to constraints while optimizing some quantity" falls within operations research or more specifically http://en.wikipedia.org/wiki/Mathematical_optimization. The usual approach is to formally specify the problem and use a generic optimization solver such as one of those listed in http://en.wikipedia.org/wiki/List_of_optimization_software.
Give it a try, the formal specification languages for using the existing solvers are rather easy to learn and you might get an optimal solution without having to debug a complicated algorithm.
Formulation as a General Optimization Problem
It will be useful to formalize constraints and parameters. Let us assume that for 1 <= i <= 8, we have n_i rooms available of size i. Now let us impose the hard constraint that in a particular room S, every two students a, b \in S, we have that:
|Grade(a) - Grade(b)| <= 2 (1)
Now we are interested in optimizing the "diversity" function which intuitively represents the idea that we want rooms to be as mixed as possible. So we can represent this goal as:
max over all arrangements {{ Sum over all rooms S of DiversityScore(S) }}
where we have DiversityScore(S) = # of Different Nationalities in the Room
Formulation as a Graph Problem
This is the most general setting, but clearly max over all arrangements is not computationally feasible. Now let us pose this as a sort of graph problem with the hard grade constraints. Denote all students as a vertex in a Graph G. Connect two vertices if students satisfy constraint (1). Now a clique in this graph represents a group of students that can all be placed in the same room. Now proceed in a greedy manner. Choose the largest clique of size 4 which has the largest Diversity Score. Then place them in a room and continue until all rooms are filled. This clique search method can also incorporate gender constraints which is useful, however not that Clique finding is NP Hard Problem.
Now before trying to come up with something that may be faster, let us think about how to weaken the hard constraint (1). We can massage our graph formulation by including edge weights into the picture. So if the hard constraint is satisfied denote the edge weight from i to j as 1. If two students i and j deviate by age more than 2 denote the edge weight as 1 / (Age Difference)^2 or something. Then the score of a clique should be a product of the cliques edge weights with some diversity score. However it becomes clear that now the problem is on a complete graph, which is just the general optimization we hoped to avoid, so we need to impose some hard restrictions to reduce the connectivity of our graph.
A Basic Sorting Approximation Algorithm
Sort all students by their age, so we have a sorted array where all students in a[i] have the same age, and all students in a[i] are older than all students in a[j] for all j < i.
Now consider each pair i, j, of which there are O(n^2), where we also have that |Age[i] - Age[j]| <= 2. Find the largest group of students with different nationalities and place them in a room together. We successively iterate over O(n^2) index pairs which satisfy the hard constraint and take any students with nationality difference (which we can find by preprocessing and hashing on the index pairs). Doing this carefully (like looking at indices i j which are spread apart before close together) improves running time further. It feels like it should be polytime, but I think there are certain subtleties to address first before saying so.