I have a problem where I really need to be able to use finite automata as the keys to an associative container. Each key should actually represent an equivalence class of automata, so that when I search, I will find an equivalent automaton (if such a key exists), even if that automaton isn't structurally identical.
An obvious last-resort approach is of course to use linear search with an equivalence test for each key checked. I'm hoping it's possible to do a lot better than this.
I've been thinking in terms of trying to impose an arbitrary but consistent ordering, and deriving an ordered comparison algorithm. First principles involve the sets of strings that the automata represent. Evaluate the set of possible first tokens for each automaton, and apply an ordering based on those two sets. If necessary, continue to the sets of possible second tokens, third tokens etc. The obvious problem with doing this naively is that there's an infinite number of token-sets to check before you can prove equivalence.
I've been considering a few vague ideas - minimising the input automata first and using some kind of closure algorithm, or converting back to a regular grammar, some ideas involving spanning trees. I've come to the conclusion that I need to abandon the set-of-tokens lexical ordering, but the most significant conclusion I've reached so far is that this isn't trivial, and I'm probably better off reading up on someone elses solution.
I've downloaded a paper from CiteSeerX - Total Ordering on Subgroups and Cosets - but my abstract algebra isn't even good enough to know if this is relevant yet.
It also occurred to me that there might be some way to derive a hash from an automaton, but I haven't given this much thought yet.
Can anyone suggest a good paper to read? - or at least let me know if the one I've downloaded is a red herring or not?
I believe that you can obtain a canonical form from minimized automata. For any two equivalent automatons, their minimized forms are isomorphic (I believe this follows from Myhill-Nerode theorem). This isomorphism respects edge labels and of course node classes (start, accepting, non-accepting). This makes it easier than unlabeled graph isomorphism.
I think that if you build a spanning tree of the minimized automaton starting from the start state and ordering output edges by their labels, then you'll get a canonical form for the automaton which can then be hashed.
Edit: Non-tree edges should be taken into account too, but they can also be ordered canonically by their labels.
here is a thesis form 1992 where they produce canonical minimized automata: Minimization of Nondeterministic Finite Automata
Once you have the canonical, form you can easily hash it for example by performing a depth first enumeration of the states and transitions, and hashing a string obtained by encoding state numbers (count them in the order of their first appearance) for states and transitions as triples
<from_state, symbol, to_state, is_accepting_final_state>
This should solve the problem.
When a problem seems insurmountable, the solution is often to publicly announce how difficult you think the problem is. Then, you will immediately realise that the problem is trivial and that you've just made yourself look an idiot - and that's basically where I am now ;-)
As suggested in the question, to lexically order the two automata, I need to consider two things. The two sets of possible first tokens, and the two sets of possible everything-else tails. The tails can be represented as finite automata, and can be derived from the original automata.
So the comparison algorithm is recursive - compare the head, if different you have your result, if the same then recursively compare the tail.
The problem is the infinite sequence needed to prove equivalence for regular grammars in general. If, during a comparison, a pair of automata recur, equivalent to a pair that you checked previously, you have proven equivalence and you can stop checking. It is in the nature of finite automata that this must happen in a finite number of steps.
The problem is that I still have a problem in the same form. To spot my termination criteria, I need to compare my pair of current automata with all the past automata pairs that occurred during the comparison so far. That's what has been giving me a headache.
It also turns out that that paper is relevant, but probably only takes me this far. Regular languages can form a group using the concatenation operator, and the left coset is related to the head:tail things I've been considering.
The reason I'm an idiot is because I've been imposing a far too strict termination condition, and I should have known it, because it's not that unusual an issue WRT automata algorithms.
I don't need to stop at the first recurrence of an automata pair. I can continue until I find a more easily detected recurrence - one that has some structural equivalence as well as logical equivalence. So long as my derive-a-tail-automaton algorithm is sane (and especially if I minimise and do other cleanups at each step) I will not generate an infinite sequence of equivalent-but-different-looking automata pairs during the comparison. The only sources of variation in structure are the original two automata and the tail automaton algorithm, both of which are finite.
The point is that it doesn't matter that much if I compare too many lexical terms - I will still get the correct result, and while I will terminate a little later, I will still terminate in finite time.
This should mean that I can use an unreliable recurrence detection (allowing some false negatives) using a hash or ordered comparison that is sensitive to the structure of the automata. That's a simpler problem than the structure-insensitive comparison, and I think it's the key that I need.
Of course there's still the issue of performance. A linear search using a standard equivalence algorithm might be a faster approach, based on the issues involved here. Certainly I would expect this comparison to be a less efficient equivalence test than existing algorithms, as it is doing more work - lexical ordering of the non-equivalent cases. The real issue is the overall efficiency of a key-based search, and that is likely to need some headache-inducing analysis. I'm hoping that the fact that non-equivalent automata will tend to compare quickly (detecting a difference in the first few steps, like traditional string comparisons) will make this a practical approach.
Also, if I reach a point where I suspect equivalence, I could use a standard equivalence algorithm to check. If that check fails, I just continue comparing for the ordering where I left off, without needing to check for the tail language recurring - I know that I will find a difference in a finite number of steps.
If all you can do is == or !=, then I think you have to check every set member before adding another one. This is slow. (Edit: I guess you already know this, given the title of your question, even though you go on about comparison functions to directly compare two finite automata.)
I tried to do that with phylogenetic trees, and it quickly runs into performance problems. If you want to build large sets without duplicates, you need a way to transform to a canonical form. Then you can check a hash, or insert into a binary tree with the string representation as a key.
Another researcher who did come up with a way to transform a tree to a canonical rep used Patricia trees to store unique trees for duplicate-checking.
Related
Backtracking search is a well-known problem-solving technique, that recurs through all possible combinations of variable assignments in search of a valid solution. The general algorithm is abstracted into a concise higher-order function: https://en.wikipedia.org/wiki/Backtracking
Some problems require partial backtracking, that is, they have a mixture of don't-know non-determinism (have a choice to make, that matters, if you get it wrong you have to backtrack) and don't-care non-determinism (have a choice to make, that doesn't matter, maybe it matters for how long it takes you to find the solution, but not for the correctness thereof, you don't have to backtrack).
Consider for example the Boolean satisfiability problem that can be solved with the DPLL algorithm. If you try to represent that with the general backtracking algorithm, the result will not only recur through all 2^N variable assignments (which is sadly necessary in the general case), but all N! orders of trying the variables (completely unnecessary and hopelessly inefficient).
Is there a general algorithm for partial backtracking? A concise higher-order function that takes function parameters for both don't-know and don't-care choices?
If I understand you correctly, you’re asking about symmetry-breaking in tree search. In the specific example you gave, all permutations of the list of variable assignments are equivalent.
Symmetries are going to be domain-specific. So is the more-general technique of pruning the search tree, by short-circuiting and backtracking eagerly. There are a few symmetry-breaking techniques I’ve used that generalize.
One is to search the problem space in a canonical order. If the branch that sets variable 10 only tries variables 11, 12 and up, not variables 9, 8 or 7, it won’t search any permutation of the same solution. It will only test solutions that are unique up to permutation. (In the specific case of SAT-solving, this might rule out an optimal search order—although you could re-order the variables arbitrarily.)
Another is to make a test that only one distinct solution of any equivalence class will pass, ideally one that can be checked near the top of the search tree. The classic example of this is, in the 8-queens problem, checking whether the queen on the row you look at first is on the left or the right side of the chessboard. Any solution where she’s on the right is a mirror-image of one other solution where she’s on the left, so you can cut the search space in half. (You can actually do better than this with that problem.) If you only need to test for satisfiability, you can get by with a filter that merely guarantees that, if any solution exists, at least one solution will pass.
If you have enough memory, you might also store a set of branches that have already been searched, and then check whether a branch that you are considering whether to search is equivalent to one already in the set. This would be more practical for a search space with a huge number of symmetries than one with a huge number of solutions unique up to symmetry.
Does any of you know a machine learning method or combination of methods which makes it possible to integrate prior knowledge in the building process of a decision tree?
With "prior knowledge" I mean the information if a feature in a particular node is really responsible for the resulting classification or not. Imagine we only have a short period of time where our features are measured and in this period of time we have a correlation between features. If we now would measure the same features again, we probably would not get a correlation between those features, because it was just a coincidence that they are correlated. Unfortunately it is not possible to measure again.
The problem which arises with that is: the feature which is chosen by the algorithms to perform a split is not the feature which actually leads to the split in the real world. In other words the strongly correlated feature is chosen by the algorithm while the other feature is the one which should be chosen. That's why I want to set rules / causalities / constraints for the tree learning process.
"a particular feature in an already learned tree" - the typical decision tree has one feature per node, and therefore each feature can appear in many different nodes. Similarly, each leaf has one classification, but each classification may appear in multiple leafs. (And with a binary classifier, any non-trivial tree must have repeated classifications).
This means that you can enumerate all leafs and sort them by classification to get uniform subsets of leaves. For each such subset, you can analyze all paths from the root of the tree to see which features occurred. But this will be a large set.
"But in my case there are some features which are strongly correlated ... The feature which is choosen by the algorithms to perform a split is not the feature which actually leads to the split in the real world."
It's been said that every model is wrong, but some models are useful. If the features are indeed strongly correlated, choosing this "wrong" feature doesn't really affect the model.
You can of course just modify the split algorithm in tree building. Trivially, "if the remaining classes are A and B, use split S, else determine the split using algorithm C4.5" is a valid splitting algorithm that hardcodes pre-existing knowledge about two specific classes without being restricted to just that case.
But note that it might just be easier to introduce a combined class A+B in the decision tree, and then decide between A and B in postprocessing.
I'm looking for algorithms or data structures specifically for dealing with ambiguities.
In my particular current field of interest I'm looking into ambiguous parses of natural languages, but I assume there must be many fields in computing where ambiguity plays a part.
I can find a lot out there on trying to avoid ambiguity but very little on how to embrace ambiguity and analyse ambiguous data.
Say a parser generates these alternative token streams or interpretations:
A B1 C
A B2 C
A B3 B4 C
It can be seen that some parts of the stream are shared between interpretations (A ... B) and other parts branch into alternative interpretations and often meet back with the main stream.
Of course there may be many more interpretations, nesting of alternatives, and interpretations which have no main stream.
This is obviously some kind of graph with nodes. I don't know if it has an established name.
Are there extant algorithms or data structures I can study that are intended to deal with just this kind of ambiguous graph?
Ambiguity and sharing in Natural Language Parsing
Ambiguity and sharing in general
Given the generality of your question, I am trying to match that
generality.
The concept of ambiguity arises as soon a you consider a mapping or function f: A -> B which is not injective.
An injective function (also called one-to-one function) is one such
that when a≠a' then f(a) ≠ f(a'). Given a function f, you are often
interested in reversing it: given an element b of the codomain B of f,
you want to know what element a of the domain A is such that f(a)=b.
Note that there may be none if the function is not surjective
(i.e. onto).
When the function is not injective, there may be several values a in A
such that f(a)=b. In other words, if you use values in B to actually
represent values in A through the mapping f, you have an ambiguous
representation b that may not determine the value a uniquely.
From this you realize that the concept of ambiguity is so general that
it is unlikely that there is a unified body of knowledge about it,
even when limiting this to computer science and programming.
However, if you wish to consider reversing a function creating such
ambiguities, for example to compute the set f'(b)={a∈A | f(a)=b}, or
the best element(s) in that set according to some optimality criterion,
there are inded some techniques that may help you in situations where
the problem can be decomposed into subproblems that often re-occur
with the same arguments. Then, if you memorize the result(s) for the
various combinations of arguments encountered, you never compute twice
the same thing (the subproblem is said to be memo-ized). Note that
ambiguity may exist for subproblems too, so that there may be several
answers for some subproblem instances, or optimal answers among
several others.
This amount to sharing a single copy of a subproblem between all the
situations that require solving it with this set of parameters. The
whole technique is called dynamic programming, and the difficulty is
often to find the right decomposition into subproblems. Dynamic
programming is primarily a way to share the repeated subcomputation for a
solution, so as to reduce complexity. However, if each subcomputation
produces a fragment of a structure that is reused recursively in
larger structures to find an answer which is a structured object (a
graph for exmple), then the sharing of subcomputation step may result in
also sharing a corresponding substructure in all the places where it is
needed. When many answers are to be found (because of ambiguity for
example), these answers can share subparts.
Rather than finding all the answers, dynamic programming can be used
to find those satisfying some optimality criterion. This requires that
an optimal solution of a problem uses optimal solutions of
subproblems.
The case of linguistic processing
Things can be more specific in the case of linguistics and language
processing. For that purpose, you have to identify the domains you are
dealing with, and the kind of functions you use with these domains.
The purpose of language is to exchange information, concepts, ideas
that reside in our brains, with the very approximate assumption that
our brains use the same functions to represent these ideas
linguistically. I must also simplify things considerably (sorry about
it) because this is not exactly the place for a full theory of
language, which would be disputed anyway. And I cannot even consider
all types of syntactic theories.
So linguistic exchange of information, of an idea, from a person P to a person Q
goes as follow:
idea in P ---f--> syntactic tree ---g--> lexical sequence ---h--> sound sequence
|
s
|
V
idea in Q <--f'-- syntactic tree <--g'-- lexical sequence <--h'-- sound sequence
The first line is about sentence generation taking place in person P,
and the second line is about sentence analysis taking place in person
Q. The function s stands for speech transmission, and should be the
identity function. The functions f', g' and h' are supposed to be the
inverse of the functions f,g, and h that compute the successive
representations down to the spoken representation of the idea. But
each of these functions may be non injective (usually is) so that
ambiguities are introduced at each level, making it difficult for Q to
inverse then to retrieve the original meaning from the sound sequence
it receives (I am deliberately using the word sound to avoid getting
into details). The same diagram holds, with some variations in details,
for written communication.
We ignore f and f' since they are concerned with semantics, which may
be less formalized, and for which I do not have competence. Syntax
trees are often defined by grammatical formalisms (here I am skipping
over important refinements such as feature structures, but they can be
taken into account).
Both the function g and the function h are usually not injective, and
thus are sources of ambiguity. There are actually other sources of
ambiguity due to all kind of errors inherent to the speech chain, but
we will ignore it for simplicity, as it does not much change the
nature of problems. The presence of errors, due to sentence generation
or transmission, or to language specification mismatch between the
speaker and the listener, is an extra source of ambiguity since the
listener attempts to correct potential errors without knowing what
they may have been or whether they exist at all.
We assume that the listener does not make mistakes, and that he
attempts to best "decode" the sentence according to his own linguistic
standards and knowledge, including knowledge of error sources and
statistics.
Lexical ambiguity
Given a sound sequence, the listening system has to inverse the effect
of the lexical generation function g with a function g'. A first
probem is that several different words may give the same sound
sequence, which is a first source of ambiguity. The second problem is
that the listening system actually receives the sequence corresponding
to a string of words, and there may be no indication of where words
begin or end. So they may be different ways of cutting the sound
sequence into subsequences corresponding to recognizable words.
This problem may be worsened when noise creates more confusion
between words.
An example is the following holorime verses taken from the web, that
are pronounced more or less similarly:
Ms Stephen, without a first-rate stakeholder sum or deal,
Must, even with outer fur straight, stay colder - some ordeal.
The analysis of the sound sequence can be performed by a finite state
non-deterministic automaton, interpreted in a dynamic programming
mode, which produces a directed acyclic graph where the nodes
correspong the word separations and the edges to recongnized words.
Any longest path through the graph corresponds to a possible way
of analyzing the sound sequence as a sequence of words.
The above example gives the (fairly simple) word lattice (oriented
left to right):
the-*-fun
/ \
Ms -*-- Stephen \ without --*-- a first -*- ...
/ \ / \ /
* * *
\ / \ / \
must --*-- even with -*- outer fur -*- ...
So that the sound sequence could also correspond to the following word
sequences (among several others):
Ms Stephen, with outer first-rate ...
Must, even with outer first-rate ...
This make the lexical analysis ambiguous.
Probabilities may be used to choose a best sequence. But it is also
possible to keep the ambiguity of all possible reading and use it
as is in the next stage of sentence analysis.
Note that the word lattice may be seen as a finite state automaton
that generates or recognizes all the possible lexical readings of the
word sequence
Syntactic ambiguity
Syntactic structure is often based on a context-free grammar
skeleton. The problem of ambiguity of context-free languages is well
known and analyzed. A number of general CF parsers have been devised
to parse ambiguous sentences, and produce a structure (which varies
somewhat) from which all parses can be extracted. Such structure have
come to be known as parse forests, or shared parse forest.
What is known is that the structure can be at worst cubic in the length of
the analyzed sentence, on the condition that the language grammar is
binarized, i.e. with no more than 2 non-terminals in each rule
right-hand-side (or more simply, no more than 2 symbols in each rule
right-hand-side).
Actually, all these general CF parsing algorithms are more or less
sophisticated variations around a simple concept: the intersection of
the language L(A) of a finite state automaton A and the language L(G)
of a CF grammar G. Construction of such an intersection goes back to
the early papers on context-free languages (Bar-Hillel, Perles and
Shamir 1961), and was intended as proof of a closure property. It took
some thirty years to realize that it was also a very general parsing algorithm in a
1995 paper.
This classical cross-product construction yields a a CF grammar for the
intersection of the two languages L(A) and L(G). If you consider a
sentence w to be parsed, represented as a sequence of lexical elements,
it can also be viewed as finite state automaton W that generate only
the sentence w. For example:
this is a finite state automaton
=> (1)------(2)----(3---(4)--------(5)-------(6)-----------((7))
is a finite state automaton W accepting only the sentence
w="this is a finite state automaton". So L(W)={w}.
If the grammar of the language is G, then the intersection
construction gives a grammar G_w for the language
L(G_w)=L(W)∩L(G).
It the sentence w is not in L(G), then L(G_w) is empty, and the
sentence is not recognized. Else L(G_w)={w}. Furthermore, it is then
easily proved that the grammar G_w generates the sentence w with
exactly the same parse-trees (hence the same ambiguity) as the grammar
G, up to a simple renaming of the non-terminals.
The grammar G_w is the (shared) parse forest for w, and the set of
parse trees of w is precisely the set of derivations with this
grammar. So this gives a very simple view organizing the concepts, and
explaining the structure of shared parse forests and general CF parsers.
But there is more to it, because it shows how to generalize to
different grammars and to different structures to be parsed.
Constructive closure of intersection with regular sets by
cross-product constructions is common to a lot of grammatical
formalisms that extend the power of CF grammars somewhat into the
context-sensitive realm. This includes tree-adjoining grammars, and
linear context-free rewriting systems. Hence this is a guideline on
how to build for these more powerful formalisms general parsers that
can handle ambiguity and produce shared parse-forests, wich are simply
specialized grammars of the same type.
The other generalization is that, when there is lexical ambiguity so
that lexical analysis produces many candidate sentences represented
with sharing by a word lattice, this word lattice can be read as a
finite state automaton recognizing all these sentences. Then, the same
intersection construction will eliminate all sentences that are not in
the language (not grammatical), and produce a CF grammar that is a
shared parse forest for all possible parses of all admissible
(grammatical) sentences from the word lattice.
As requested in the question, all possible ambiguous readings are
preserved as long as compatible with available linguistic or utterance
information.
The handling of noise and ill-formed sentences is usually modelled also
with finite state devices, and can thus be addressed by the same
techniques.
There are actually many other issues to be considered. For example,
there are many ways of building the shared forest, with more or less
sharing. The techniques used to precompile pushdown automata to be
used for general context-free parsing my have an effect on the quality
of the sharing. Being too smart is not always very smart.
See also other answers I made on SE on this topic:
https://cs.stackexchange.com/questions/27937/how-do-i-reconstruct-the-forest-of-syntax-trees-from-the-earley-vector/27952#27952
https://cstheory.stackexchange.com/questions/7374/recovering-a-parse-forest-from-an-earley-parser/18006#18006
I'm experimenting with PFGs -- Parse-Forest Grammars built using Marpa::R2 ASF.
The approach is to represent ambiguous parses as a grammar, design a criterion to prune unneeded rules, apply it and then remove unproductive and unaccessible symbols from the PFG thus arriving at a parse tree.
This test case is an illustration: it parses arithmetic expressions with highly ambiguous grammar, then prunes the PFG rules based on associativity and precedence, cleans up the grammar, converts it to abstract syntax tree (Problem 3.10 from the cited source — Grune and Jacobs).
I'd call this data structure a lattice, see for instance Lexicalized Parsing (PDF).
Are there any algorithms/tools to detect an a priori unknown pattern in input sequence of discrete symbols?
For example, for string "01001000100001" it is something like ("0"^i"1"),
and for "01001100011100001111" it is like ("0"^i"1"^i)
I've found some approaches but they are applied when a set of patterns to detect in a sequence is a priori known. I've also found the sequitur algorithm for hierarchical structure detection in data but it does not work when the sequence is like "arithmetic progression" like in my examples.
So I'll be very thankful for any information about method/algorithm/tool/scientific paper.
I believe that as someone pointed out, the general case is not solvable. Douglas Hofstadter spent a lot of time studying this problem, and describes some approaches (some automated, some manual), see the first chapter of:
http://www.amazon.com/Fluid-Concepts-And-Creative-Analogies/dp/0465024750
I believe his general approach was to do use an AI search algorithm (depth or breadth first search combined with some good heuristics). Using the algorithm would generate possible sequences using different operators (such as repeat the previous digit i times, or i/2 times) and follow branches in the search tree where the operations specified by the nodes along that branch had correctly predicted the next digit(s), until it can successfully predict the sequence far enough ahead that you are satisfied that it has the correct answer. It would then output the sequence of operations that constitute the pattern (although these operations need to be designed by the user as building blocks for the pattern generation).
Also, genetic programming may be able to solve this problem.
So I've had at least two professors mention that backtracking makes an algorithm non-deterministic without giving too much explanation into why that is. I think I understand how this happens, but I have trouble putting it into words. Could somebody give me a concise explanation of the reason for this?
It's not so much the case that backtracking makes an algorithm non-deterministic.
Rather, you usually need backtracking to process a non-deterministic algorithm, since (by the definition of non-deterministic) you don't know which path to take at a particular time in your processing, but instead you must try several.
I'll just quote wikipedia:
A nondeterministic programming language is a language which can specify, at certain points in the program (called "choice points"), various alternatives for program flow. Unlike an if-then statement, the method of choice between these alternatives is not directly specified by the programmer; the program must decide at runtime between the alternatives, via some general method applied to all choice points. A programmer specifies a limited number of alternatives, but the program must later choose between them. ("Choose" is, in fact, a typical name for the nondeterministic operator.) A hierarchy of choice points may be formed, with higher-level choices leading to branches that contain lower-level choices within them.
One method of choice is embodied in backtracking systems, in which some alternatives may "fail", causing the program to backtrack and try other alternatives. If all alternatives fail at a particular choice point, then an entire branch fails, and the program will backtrack further, to an older choice point. One complication is that, because any choice is tentative and may be remade, the system must be able to restore old program states by undoing side-effects caused by partially executing a branch that eventually failed.
Out of the Nondeterministic Programming article.
Consider an algorithm for coloring a map of the world. No color can be used on adjacent countries. The algorithm arbitrarily starts at a country and colors it an arbitrary color. So it moves along, coloring countries, changing the color on each step until, "uh oh", two adjacent countries have the same color. Well, now we have to backtrack, and make a new color choice. Now we aren't making a choice as a nondeterministic algorithm would, that's not possible for our deterministic computers. Instead, we are simulating the nondeterministic algorithm with backtracking. A nondeterministic algorithm would have made the right choice for every country.
The running time of backtracking on a deterministic computer is factorial, i.e. it is in O(n!).
Where a non-deterministic computer could instantly guess correctly in each step, a deterministic computer has to try all possible combinations of choices.
Since it is impossible to build a non-deterministic computer, what your professor probably meant is the following:
A provenly hard problem in the complexity class NP (all problems that a non-deterministic computer can solve efficiently by always guessing correctly) cannot be solved more efficiently on real computers than by backtracking.
The above statement is true, if the complexity classes P (all problems that a deterministic computer can solve efficiently) and NP are not the same. This is the famous P vs. NP problem. The Clay Mathematics Institute has offered a $1 Million prize for its solution, but the problem has resisted proof for many years. However, most researchers believe that P is not equal to NP.
A simple way to sum it up would be: Most interesting problems a non-deterministic computer could solve efficiently by always guessing correctly, are so hard that a deterministic computer would probably have to try all possible combinations of choices, i.e. use backtracking.
Thought experiment:
1) Hidden from view there is some distribution of electric charges which you feel a force from and you measure the potential field they create. Tell me exactly the positions of all the charges.
2) Take some charges and arrange them. Tell me exactly the potential field they create.
Only the second question has a unique answer. This is the non-uniqueness of vector fields. This situation may be in analogy with some non-deterministic algorithms you are considering. Further consider in math limits which do not exist because they have different answers depending on which direction you approach a discontinuity from.
I wrote a maze runner that uses backtracking (of course), which I'll use as an example.
You walk through the maze. When you reach a junction, you flip a coin to decide which route to follow. If you chose a dead end, trace back to the junction and take another route. If you tried them all, return to the previous junction.
This algorithm is non-deterministic, non because of the backtracking, but because of the coin flipping.
Now change the algorithm: when you reach a junction, always try the leftmost route you haven't tried yet first. If that leads to a dead end, return to the junction and again try the leftmost route you haven't tried yet.
This algorithm is deterministic. There's no chance involved, it's predictable: you'll always follow the same route in the same maze.
If you allow backtracking you allow infinite looping in your program which makes it non-deterministic since the actual path taken may always include one more loop.
Non-Deterministic Turing Machines (NDTMs) could take multiple branches in a single step. DTMs on the other hand follow a trial-and-error process.
You can think of DTMs as regular computers. In contrast, quantum computers are alike to NDTMs and can solve non-deterministic problems much easier (e.g. see their application in breaking cryptography). So backtracking would actually be a linear process for them.
I like the maze analogy. Lets think of the maze, for simplicity, as a binary tree, in which there is only one path out.
Now you want to try a depth first search to find the correct way out of the maze.
A non deterministic computer would, at every branching point, duplicate/clone itself and run each further calculations in parallel. It is like as if the person in the maze would duplicate/clone himself (like in the movie Prestige) at each branching point and send one copy of himself into the left subbranch of the tree and the other copy of himself into the right subbranch of the tree.
The computers/persons who end up at a dead end they die (terminate without answer).
Only one computer will survive (terminate with an answer), the one who gets out of the maze.
The difference between backtracking and non-determinism is the following.
In the case of backtracking there is only one computer alive at any given moment, he does the traditional maze solving trick, simply marking his path with a chalk and when he gets to a dead end he just simply backtracks to a branching point whose sub branches he did not yet explore completely, just like in a depth first search.
IN CONTRAST :
A non deteministic computer can clone himself at every branching point and check for the way out by running paralell searches in the sub branches.
So the backtracking algorithm simulates/emulates the cloning ability of the non-deterministic computer on a sequential/non-parallel/deterministic computer.