Partial validation of a complex series of conditions - algorithm

I am currently in the process of working on an scheduling application and have run into a small snag. Because the field is heavily regulated for safety reasons the software is required to check a number of interdependent conditions to ensure a trip is possible. Instead of a nice tree of conditions - which would be simple to implement - I have this hideous directed graph. Now for the most part this wouldn't be difficult at all except for the fact I may not know all the information required in advance but still need to perform as much of the validation as possible.
I could implement this as a rat's nest of if/else statements but that would be a maintainability nightmare since the regulations change on a fairly regular basis. Since there are no cycles in the graph I'm thinking that some form of breadth-first approach is probably optimal. Am I on the right track? Are there any other alternative techniques for performing this kind of task?

The solution depends completely on what the directed acyclic graph (DAG) you speak of actually represents. Are the nodes AND and OR nodes, or are they conditional braches?
If they are conditional branches you don't need any breadth-first search because there is nothing to search for; you just take the branches according to the evaluated conditions. Yes, it could be easily implemented as a GOTO spaghetti. Another option is to create a database of the nodes (or a datastructure) and have a "virtual machine" which executes through the nodes one by one. This makes it more maintainable but also slower.
If it is an AND/OR/NOT graph then you need to evaluate the truth values for the nodes starting from the leaves. This is not breadth-first search, but kind-of reverse breadth first approach. You calculate the values for the leaves and then calculate the values of the internal nodes backwards; eventually you get an evaluation of the root node (true / false).

Sounds like you are trying to solve a common compilation problem called '[constant folding][1]'. (http://en.wikipedia.org/wiki/Constant_folding).
The good new is that it applies to DAGs (direct acyclic graph) and not just to trees. Basically the idea is to compute what you can from partial expressions. Things as simple as True AND X = X, or True OR X = True help pruning large parts of the tree. (The trivial implementation I know of is a matter of depth-first and backtracking more than breadth-first, but anyway it's not much code).
I still wonder why you got a graph an not an expression tree. If you can compute A from B or B from A you usually do not have both A and B as input at the same time. Then it should be possible to express the problem like a set of expression trees depending on available inputs. But it's raw guess. Can you give more details (exemple) on why you have this graph ?

Related

Machine learning method which is able to integrate prior knowledge in a decision tree

Does any of you know a machine learning method or combination of methods which makes it possible to integrate prior knowledge in the building process of a decision tree?
With "prior knowledge" I mean the information if a feature in a particular node is really responsible for the resulting classification or not. Imagine we only have a short period of time where our features are measured and in this period of time we have a correlation between features. If we now would measure the same features again, we probably would not get a correlation between those features, because it was just a coincidence that they are correlated. Unfortunately it is not possible to measure again.
The problem which arises with that is: the feature which is chosen by the algorithms to perform a split is not the feature which actually leads to the split in the real world. In other words the strongly correlated feature is chosen by the algorithm while the other feature is the one which should be chosen. That's why I want to set rules / causalities / constraints for the tree learning process.
"a particular feature in an already learned tree" - the typical decision tree has one feature per node, and therefore each feature can appear in many different nodes. Similarly, each leaf has one classification, but each classification may appear in multiple leafs. (And with a binary classifier, any non-trivial tree must have repeated classifications).
This means that you can enumerate all leafs and sort them by classification to get uniform subsets of leaves. For each such subset, you can analyze all paths from the root of the tree to see which features occurred. But this will be a large set.
"But in my case there are some features which are strongly correlated ... The feature which is choosen by the algorithms to perform a split is not the feature which actually leads to the split in the real world."
It's been said that every model is wrong, but some models are useful. If the features are indeed strongly correlated, choosing this "wrong" feature doesn't really affect the model.
You can of course just modify the split algorithm in tree building. Trivially, "if the remaining classes are A and B, use split S, else determine the split using algorithm C4.5" is a valid splitting algorithm that hardcodes pre-existing knowledge about two specific classes without being restricted to just that case.
But note that it might just be easier to introduce a combined class A+B in the decision tree, and then decide between A and B in postprocessing.

Decision Tree clarification

I just want to ask/clarify if decision trees are essentially binary trees where each node is a boolean, and it continues down until a desired result is reached?
Not necessarily. Some nodes may share children which is not the case in Binary trees. However, the essence of the decision tree is what you mentioned.
It's a tree where based on the probability of an outcome you move down the graph until you hit an outcome.
See Wikipedia's page on desicion trees for more info.
As mentioned by Ares, not all decision trees are binary (they can be "n-ary") although most implementation I have seen are binary trees.
For instance if you have a color variable (i.e. categorical) that can take three values : red, blue or green; you might want to split in three directly at a node instead of splitting in two and then again in two (or more).
The choice between binary and "n-ary" will usually depends on your data. I suspect that most people use binary trees anyway because it is relatively easier to implement and more flexible.
Then as you said the tree is developed until the desired outcome is reached. Decision Tree suffers major drawbacks such as overfitting and there exist many different ways to tackle this issue (pruning, boosting, etc.) but this is beyond the scope of the question/answer.
I recommend to have a look at this great visualization that explains well the decision tree : http://www.r2d3.us/visual-intro-to-machine-learning-part-1/
Will be happy to give more details about decision tree

Algorithms for Deducing a Timeline / Chronology

I'm looking for leads on algorithms to deduce the timeline/chronology of a series of novels. I've split the texts into days and created a database of relationships between them, e.g.: X is a month before Y, Y and Z are consecutive, date of Z is known, X is on a Tuesday, etc. There is uncertainty ('month' really only means roughly 30 days) and also contradictions. I can mark some relationships as more reliable than others to help resolve ambiguity and contradictions.
What kind of algorithms exist to deduce a best-fit chronology from this kind of data, assigning a highest-probability date to each day? At least time is 1-dimensional but dealing with a complex relationship graph with inconsistencies seems non-trivial. I have a CS background so I can code something up but some idea about the names of applicable algorithms would be helpful. I guess what I have is a graph with days as nodes as relationships as edges.
A simple, crude first approximation to your problem would be to store information like "A happened before B" in a directed graph with edges like "A -> B". Test the graph to see whether it is a Directed Acyclic Graph (DAG). If it is, the information is consistent in the sense that there is a consistent chronology of what happened before what else. You can get a sample linear chronology by printing a "topological sort" (topsort) of the DAG. If events C and D happened simultaneously or there is no information to say which came before the other, they might appear in the topsort as ABCD or ABDC. You can even get the topsort algorithm to print all possibilities (so both ABCD and ABDC) for further analysis using more detailed information.
If the graph you obtain is not a DAG, you can use an algorithm like Tarjan's algorithm to quickly identify "strongly connected components", which are areas of the graph which contain chronological contradictions in the form of cycles. You could then analyze them more closely to determine which less reliable edges might be removed to resolve contradictions. Another way to identify edges to remove to eliminate cycles is to search for "minimum feedback arc sets". That's NP-hard in general but if your strongly connected components are small the search could be feasible.
Constraint programming is what you need. In propagation-based CP, you alternate between (a) making a decision at the current choice point in the search tree and (b) propagating the consequences of that decision as far as you can. Notionally you do this by maintaining a domain D of possible values for each problem variable x such that D(x) is the set of values for x which have not yet been ruled out along the current search path. In your problem, you might be able to reduce it to a large set of Boolean variables, x_ij, where x_ij is true iff event i precedes event j. Initially D(x) = {true, false} for all variables. A decision is simply reducing the domain of an undecided variable (for a Boolean variable this means reducing its domain to a single value, true or false, which is the same as an assignment). If at any point along a search path D(x) becomes empty for any x, you have reached a dead-end and have to backtrack.
If you're smart, you will try to learn from each failure and also retreat as far back up the search tree as required to avoid redundant search (this is called backjumping -- for example, if you identify that the dead-end you reached at level 7 was caused by the choice you made at level 3, there's no point in backtracking just to level 6 because no solution exists in this subtree given the choice you made at level 3!).
Now, given you have different degrees of confidence in your data, you actually have an optimisation problem. That is, you're not just looking for a solution that satisfies all the constraints that must be true, but one which also best satisfies the other "soft" constraints according to the degree of trust you have in them. What you need to do here is decide on an objective function assigning a score to a given set of satisfied/violated partial constraints. You then want to prune your search whenever you find the current search path cannot improve on the best previously found solution.
If you do decide to go for the Boolean approach, you could profitably look into SAT solvers, which tear through these kinds of problems. But the first place I'd look is at MiniZinc, a CP language which maps on to a whole variety of state of the art constraint solvers.
Best of luck!

How do I balance a BK-Tree and is it necessary?

I am looking into using an Edit Distance algorithm to implement a fuzzy search in a name database.
I've found a data structure that will supposedly help speed this up through a divide and conquer approach - Burkhard-Keller Trees. The problem is that I can't find very much information on this particular type of tree.
If I populate my BK-tree with arbitrary nodes, how likely am I to have a balance problem?
If it is possibly or likely for me to have a balance problem with BK-Trees, is there any way to balance such a tree after it has been constructed?
What would the algorithm look like to properly balance a BK-tree?
My thinking so far:
It seems that child nodes are distinct on distance, so I can't simply rotate a given node in the tree without re-calibrating the entire tree under it. However, if I can find an optimal new root node this might be precisely what I should do. I'm not sure how I'd go about finding an optimal new root node though.
I'm also going to try a few methods to see if I can get a fairly balanced tree by starting with an empty tree, and inserting pre-distributed data.
Start with an alphabetically sorted list, then queue from the middle. (I'm not sure this is a great idea because alphabetizing is not the same as sorting on edit distance).
Completely shuffled data. (This relies heavily on luck to pick a "not so terrible" root by chance. It might fail badly and might be probabilistically guaranteed to be sub-optimal).
Start with an arbitrary word in the list and sort the rest of the items by their edit distance from that item. Then queue from the middle. (I feel this is going to be expensive, and still do poorly as it won't calculate metric space connectivity between all words - just each word and a single reference word).
Build an initial tree with any method, flatten it (basically like a pre-order traversal), and queue from the middle for a new tree. (This is also going to be expensive, and I think it may still do poorly as it won't calculate metric space connectivity between all words ahead of time, and will simply get a different and still uneven distribution).
Order by name frequency, insert the most popular first, and ditch the concept of a balanced tree. (This might make the most sense, as my data is not evenly distributed and I won't have pure random words coming in).
FYI, I am not currently worrying about the name-synonym problem (Bill vs William). I'll handle that separately, and I think completely different strategies would apply.
There is a lisp example in the article: http://cliki.net/bk-tree. About unbalancing the tree I think the data structure and the method seems to be complicated enough and also the author didn't say anything about unbalanced tree. When you experience unbalanced tree maybe it's not for you?

Performance of an A* search implemented in Clojure

I have implemented an A* search algorithm for finding a shortest path between two states.
Algorithm uses a hash-map for storing best known distances for visited states. And one hash-map for storing child-parent relationships needed for reconstruction of the shortest path.
Here is the code. Implementation of the algorithm is generic (states only need to be "hashable" and "comparable") but in this particular case states are pairs (vectors) of ints [x y] and they represent one cell in a given heightmap (cost for jumping to neighboring cell depends on the difference in heights).
Question is whether it's possible to improve performance and how? Maybe by using some features from 1.2 or future versions, by changing logic of the algorithm implementation (e.g. using different way to store path) or changing state representation in this particular case?
Java implementation runs in an instant for this map and Clojure implementation takes about 40 seconds. Of course, there are some natural and obvious reasons for this: dynamic typing, persistent data structures, unnecessary (un)boxing of primitive types...
Using transients didn't make much difference.
Using priority-map instead of sorted-set
I first used sorted-set for storing open nodes (search frontier), switching to priority-map improved performance: now it takes 15-20 seconds for this map (before it took 40s).
This blog post was very helpful. And "my" new implementation is pretty much the same.
New a*-search can be found here.
I don't know Clojure, but I can give you some general advice about improving the performance of Vanilla A*.
Consider implementing IDA*, which is a variant of A* that uses less memory, if it's suitable for your domain.
Try a different heuristic. A good heuristic can have a significant impact on the number of node expansions required.
Use a cache, Often called a "transposition table" in search algorithms. Since search graphs are usually Directed Acyclic Graphs and not true trees, you can end up repeating the search of a state more than once; a cache to remember previously-searched nodes reduces node expansions.
Dr. Jonathan Schaeffer has some slides on this subject:
http://webdocs.cs.ualberta.ca/~jonathan/Courses/657/Notes/10.Single-agentSearch.pdf
http://webdocs.cs.ualberta.ca/~jonathan/Courses/657/Notes/11.Evaluations.pdf

Resources