Detect an a priori unknown pattern in dataset

Detect an a priori unknown pattern in dataset - algorithm

Are there any algorithms/tools to detect an a priori unknown pattern in input sequence of discrete symbols?
For example, for string "01001000100001" it is something like ("0"^i"1"),
and for "01001100011100001111" it is like ("0"^i"1"^i)
I've found some approaches but they are applied when a set of patterns to detect in a sequence is a priori known. I've also found the sequitur algorithm for hierarchical structure detection in data but it does not work when the sequence is like "arithmetic progression" like in my examples.
So I'll be very thankful for any information about method/algorithm/tool/scientific paper.

I believe that as someone pointed out, the general case is not solvable. Douglas Hofstadter spent a lot of time studying this problem, and describes some approaches (some automated, some manual), see the first chapter of:
http://www.amazon.com/Fluid-Concepts-And-Creative-Analogies/dp/0465024750
I believe his general approach was to do use an AI search algorithm (depth or breadth first search combined with some good heuristics). Using the algorithm would generate possible sequences using different operators (such as repeat the previous digit i times, or i/2 times) and follow branches in the search tree where the operations specified by the nodes along that branch had correctly predicted the next digit(s), until it can successfully predict the sequence far enough ahead that you are satisfied that it has the correct answer. It would then output the sequence of operations that constitute the pattern (although these operations need to be designed by the user as building blocks for the pattern generation).
Also, genetic programming may be able to solve this problem.

Related

Minimum-size Trie Construction by Dynamic Programming (War Story: What’s Past is Prolog from The Algorithm Design Manual by Steven S Skiena)

I'm reading the The Algorithm Design Manual by Steven S Skiena, and trying to understand the solution to the problem of War Story: What’s Past is Prolog.
The problem is also well described here.
Basically, the problem is, given an ordered list of strings, give a optimal solution to construct a trie with minimum size (the string character as the node), with the constraint that the order of the strings must be reserved, while the character index can be reordered.
Maybe this is not an appropriate question here for stackoverflow, still I'm wondering if anyone could give me some hint on the solution, especially what this recurrence means by its arguments:
the recurrence for the Dynamic Programming algorithm

You can think about it this way:
Let's assume that we fix the index of the first character. All strings get split into r bins based on the value of the character in this position (bins are essentially subtrees).
We can work with each bin independently. It won't change the order across different bins because two strings in different bins are different in the first character.
Thus, we can solve the problem for each bin independently. After that, we need exactly one edge to connect the root to each bin (that is, subtree). That's where the formula
C[i_k, j_k] + 1 comes from.
As we want to minimize the total number of edges and we're free to pick the first position, we just try all possible options among m positions.
Note: this algorithm is correct under assumption that we can reorder the rest of the characters in each subtree independently. If it's not the case, the dynamic programming solution is incorrect.

Exhaustive search for algorithms - existing works?

You might have to read this twice so that my idea becomes clear. Please be patient.
I'm looking for existing work for exhausive search for algorithms for given problems. Exhaustive search is also known as brute force search, or simply brute force.
Other exhaustive search algorithms search for a solution for a given problem. Usually, a solution for such a problem is some data which fulfills some requirements.
Exhaustive search example:
You want the solution for a KNAPSACK problem. That is the objects which can be packed into the bag so that there is no other combination of objects which fit into the bag and which would sum up to a bigger value than your result combination.
You can solve this by going over all possible combinations (exhaustive) and search for the one which fits into the bag and which is the most valuable one of those combinations.
What I'm looking for is just a special case of exhaustive search: Exhaustive search which searches for an algorithm as a solution. So in the end, I'm looking for an algorithm which searches for an algorithm which solves some given problem.
You might say: Go google it. Well, yes I've done that. The difficulty I'm facing here is that googling for "algorithm which searches for another algorithm" results in all the same as "another search algorithm". Obviously, this has too many unwanted results, so I'm stuck here.
How can I find existing work related to exhaustive search for algorithms?
More specifically: Has there any software been written for this? Can you point me to any links or algorithm names/better keywords in relation to this topic?
Update:
The purpose for which I'm looking for such an algorithm search is solving problems for which no good heuristics are known, e.g. prooving-algorithms or trying to finding other solution algorithms for problems which might or might not be NP-complete problems (thus proving that the problem is not NP-complete if a faster algorithm can be found; without any human interaction).

You seem to be looking for "program synthesis", which can work in some limited instances, provided you can correctly and formally specify what your algorithm is supposed to do (without giving an implementation).
Synthesis is an effective approach to build gate-level circuits, but applying synthesis to software is so far still more of a research avenue than practical.
Still, here are a couple of references on the subject,
(Some of the most advanced work in the area in my opinion, has a tool)
program sketching by Armando Solar-Lezama
Check out Microsoft research page on the topic, they think it's hot topic: http://research.microsoft.com/en-us/um/people/sumitg/pubs/synthesis.html
Some other similar stuff I've seen :
Model Checking-Based Genetic Programming with an Application to Mutual Exclusion. (Katz & Peled # TACAS '08), they have a more recent version on ArXiv : http://arxiv.org/abs/1402.6785
Essentially the search space is explored (exhaustively) using a model-checker.

I wouldn't think it would be possible (at least without some restriction on the class of algorithms) and, in any event, the search space would be so large that it would make ordinary brute-force tame by comparison. You could, for example, enumerate algorithms for a specific model of computation (e.g. Turing machines) but then the Halting problem rears its head -- how can you tell if it solves your problem? If you have a family of algorithms which depend on discrete parameters then you could of course brute force the choice of the parameters.
There is an extensive literature on Genetic Programming (see https://en.wikipedia.org/wiki/Genetic_programming ). That might give you something to go on. In particular -- the tree data structures that are often used in this context (essentially expression trees or more generally abstract syntax trees) could admit a brute force enumeration.

Take a look at Neural Turing Machines by Alex Graves, Greg Wayne and Ivo Danihelka from Google DeepMind.
Abstract:
We extend the capabilities of neural networks by coupling them to
external memory resources, which they can interact with by attentional
processes. The combined system is analogous to a Turing Machine or Von
Neumann architecture but is differentiable end-to-end, allowing it to
be efficiently trained with gradient descent. Preliminary results
demonstrate that Neural Turing Machines can infer simple algorithms
such as copying, sorting, and associative recall from input and output
examples.

This paper, entitled "Systematic search for lambda expressions" performs exhaustive enumeration in the space of small type-safe functional programs expressed in the lambda calculus.

I used Genetic Programming for evolving/generating heuristics for the Travelling Salesman Problem. The evolved heuristic is better than the Nearest Neighbor on the tested problems (random graphs and other graphs taken from TSPLIB). If you want the source code, you can download it from here: http://www.tcreate.org/evolve_heuristics.html

How genetic algorithm is different from random selection and evaluation for fittest?

I have been learning the genetic algorithm since 2 months. I knew about the process of initial population creation, selection , crossover and mutation etc. But could not understand how we are able to get better results in each generation and how its different than random search for a best solution. Following I am using one example to explain my problem.
Lets take example of travelling salesman problem. Lets say we have several cities as X1,X2....X18 and we have to find the shortest path to travel. So when we do the crossover after selecting the fittest guys, how do we know that after crossover we will get a better chromosome. The same applies for mutation also.
I feel like its just take one arrangement of cities. Calculate the shortest distance to travel them. Then store the distance and arrangement. Then choose another another arrangement/combination. If it is better than prev arrangement, then save the current arrangement/combination and distance else discard the current arrangement. By doing this also, we will get some solution.
I just want to know where is the point where it makes the difference between random selection and genetic algorithm. In genetic algorithm, is there any criteria that we can't select the arrangement/combination of cities which we have already evaluated?
I am not sure if my question is clear. But I am open, I can explain more on my question. Please let me know if my question is not clear.

A random algorithm starts with a completely blank sheet every time. A new random solution is generated each iteration, with no memory of what happened before during the previous iterations.
A genetic algorithm has a history, so it does not start with a blank sheet, except at the very beginning. Each generation the best of the solution population are selected, mutated in some way, and advanced to the next generation. The least good members of the population are dropped.
Genetic algorithms build on previous success, so they are able to advance faster than random algorithms. A classic example of a very simple genetic algorithm, is the Weasel program. It finds its target far more quickly than random chance because each generation it starts with a partial solution, and over time those initial partial solutions are closer to the required solution.

I think there are two things you are asking about. A mathematical proof that GA works, and empirical one, that would waive your concerns.
Although I am not aware if there is general proof, I am quite sure at least a good sketch of a proof was given by John Holland in his book Adaptation in Natural and Artificial Systems for the optimization problems using binary coding. There is something called Holland's schemata theoerm. But you know, it's heuristics, so technically it does not have to be. It basically says that short schemes in genotype raising the average fitness appear exponentially with successive generations. Then cross-over combines them together. I think the proof was given only for binary coding and got some criticism as well.
Regarding your concerns. Of course you have no guarantee that a cross-over will produce a better result. As two intelligent or beautiful parents might have ugly stupid children. The premise of GA is that it is less likely to happen. (As I understand it) The proof for binary coding hinges on the theoerm that says a good partial patterns will start emerging, and given that the length of the genotype should be long enough, such patterns residing in different specimen have chance to be combined into one improving his fitness in general.
I think it is fairly easy to understand in terms of TSP. Crossing-over help to accumulate good sub-paths into one specimen. Of course it all depends on the choice of the crossing method.
Also GA's path towards the solution is not purely random. It moves towards a certain direction with stochastic mechanisms to escape trappings. You can lose best solutions if you allow it. It works because it wants to move towards the current best solutions, but you have a population of specimens and they kind of share knowledge. They are all similar, but given that you preserve diversity new better partial patterns can be introduced to the whole population and get incorporated into the best solutions. This is why diversity in population is regarded as very important.
As a final note please remember the GA is a very broad topic and you can modify the base in nearly every way you want. You can introduce elitarism, taboos, niches, etc. There is no one-and-only approach/implementation.

What is the best algorithm for closest word

What is the best algorithm for closest word.
Possible word dictionary is given and first characters in the input word can be wrong.

One option is BK-trees - see my blog post about them here. Another, faster but more complex option is Levenshtein Automata, which I've also written about, here.

There are tools such as HunSpell (open-source spell-checker widely including OpenOffice) which have approached the problem from multiple perspectives. One widely used criterion for deciding how close the words are is Levenshtein distance which is also used in HunSpell.

You could use BLAST
and modify it to use the fact that words in a dictionary are discrete units which makes the process of matching more specific unlike a long DNA string.
BLAST already has built into it the notion of edit distances.
Alternatively, you could use suffix trees (Dan Gusfeld has an excellent book on basic string matching algorithms) and build in the idea of edit distances in.

Using finite automata as keys to a container

I have a problem where I really need to be able to use finite automata as the keys to an associative container. Each key should actually represent an equivalence class of automata, so that when I search, I will find an equivalent automaton (if such a key exists), even if that automaton isn't structurally identical.
An obvious last-resort approach is of course to use linear search with an equivalence test for each key checked. I'm hoping it's possible to do a lot better than this.
I've been thinking in terms of trying to impose an arbitrary but consistent ordering, and deriving an ordered comparison algorithm. First principles involve the sets of strings that the automata represent. Evaluate the set of possible first tokens for each automaton, and apply an ordering based on those two sets. If necessary, continue to the sets of possible second tokens, third tokens etc. The obvious problem with doing this naively is that there's an infinite number of token-sets to check before you can prove equivalence.
I've been considering a few vague ideas - minimising the input automata first and using some kind of closure algorithm, or converting back to a regular grammar, some ideas involving spanning trees. I've come to the conclusion that I need to abandon the set-of-tokens lexical ordering, but the most significant conclusion I've reached so far is that this isn't trivial, and I'm probably better off reading up on someone elses solution.
I've downloaded a paper from CiteSeerX - Total Ordering on Subgroups and Cosets - but my abstract algebra isn't even good enough to know if this is relevant yet.
It also occurred to me that there might be some way to derive a hash from an automaton, but I haven't given this much thought yet.
Can anyone suggest a good paper to read? - or at least let me know if the one I've downloaded is a red herring or not?

I believe that you can obtain a canonical form from minimized automata. For any two equivalent automatons, their minimized forms are isomorphic (I believe this follows from Myhill-Nerode theorem). This isomorphism respects edge labels and of course node classes (start, accepting, non-accepting). This makes it easier than unlabeled graph isomorphism.
I think that if you build a spanning tree of the minimized automaton starting from the start state and ordering output edges by their labels, then you'll get a canonical form for the automaton which can then be hashed.
Edit: Non-tree edges should be taken into account too, but they can also be ordered canonically by their labels.

here is a thesis form 1992 where they produce canonical minimized automata: Minimization of Nondeterministic Finite Automata
Once you have the canonical, form you can easily hash it for example by performing a depth first enumeration of the states and transitions, and hashing a string obtained by encoding state numbers (count them in the order of their first appearance) for states and transitions as triples
<from_state, symbol, to_state, is_accepting_final_state>
This should solve the problem.

When a problem seems insurmountable, the solution is often to publicly announce how difficult you think the problem is. Then, you will immediately realise that the problem is trivial and that you've just made yourself look an idiot - and that's basically where I am now ;-)
As suggested in the question, to lexically order the two automata, I need to consider two things. The two sets of possible first tokens, and the two sets of possible everything-else tails. The tails can be represented as finite automata, and can be derived from the original automata.
So the comparison algorithm is recursive - compare the head, if different you have your result, if the same then recursively compare the tail.
The problem is the infinite sequence needed to prove equivalence for regular grammars in general. If, during a comparison, a pair of automata recur, equivalent to a pair that you checked previously, you have proven equivalence and you can stop checking. It is in the nature of finite automata that this must happen in a finite number of steps.
The problem is that I still have a problem in the same form. To spot my termination criteria, I need to compare my pair of current automata with all the past automata pairs that occurred during the comparison so far. That's what has been giving me a headache.
It also turns out that that paper is relevant, but probably only takes me this far. Regular languages can form a group using the concatenation operator, and the left coset is related to the head:tail things I've been considering.
The reason I'm an idiot is because I've been imposing a far too strict termination condition, and I should have known it, because it's not that unusual an issue WRT automata algorithms.
I don't need to stop at the first recurrence of an automata pair. I can continue until I find a more easily detected recurrence - one that has some structural equivalence as well as logical equivalence. So long as my derive-a-tail-automaton algorithm is sane (and especially if I minimise and do other cleanups at each step) I will not generate an infinite sequence of equivalent-but-different-looking automata pairs during the comparison. The only sources of variation in structure are the original two automata and the tail automaton algorithm, both of which are finite.
The point is that it doesn't matter that much if I compare too many lexical terms - I will still get the correct result, and while I will terminate a little later, I will still terminate in finite time.
This should mean that I can use an unreliable recurrence detection (allowing some false negatives) using a hash or ordered comparison that is sensitive to the structure of the automata. That's a simpler problem than the structure-insensitive comparison, and I think it's the key that I need.
Of course there's still the issue of performance. A linear search using a standard equivalence algorithm might be a faster approach, based on the issues involved here. Certainly I would expect this comparison to be a less efficient equivalence test than existing algorithms, as it is doing more work - lexical ordering of the non-equivalent cases. The real issue is the overall efficiency of a key-based search, and that is likely to need some headache-inducing analysis. I'm hoping that the fact that non-equivalent automata will tend to compare quickly (detecting a difference in the first few steps, like traditional string comparisons) will make this a practical approach.
Also, if I reach a point where I suspect equivalence, I could use a standard equivalence algorithm to check. If that check fails, I just continue comparing for the ordering where I left off, without needing to check for the tail language recurring - I know that I will find a difference in a finite number of steps.

If all you can do is == or !=, then I think you have to check every set member before adding another one. This is slow. (Edit: I guess you already know this, given the title of your question, even though you go on about comparison functions to directly compare two finite automata.)
I tried to do that with phylogenetic trees, and it quickly runs into performance problems. If you want to build large sets without duplicates, you need a way to transform to a canonical form. Then you can check a hash, or insert into a binary tree with the string representation as a key.
Another researcher who did come up with a way to transform a tree to a canonical rep used Patricia trees to store unique trees for duplicate-checking.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio