Related
Recently I am thinking about an algorithm constructed by myself. I call it Replacment Compiling.
It works as follows:
Define a language as well as its operators' precedence, such as
(1) store <value> as <id>, replace with: var <id> = <value>, precedence: 1
(2) add <num> to <num>, replace with: <num> + <num>, precedence: 2
Accept a line of input, such as store add 1 to 2 as a;
Tokenize it: <kw,store><kw,add><num,1><kw,to><2><kw,as><id,a><EOF>;
Then scan through all the tokens until reach the end-of-file, find the operation with highest precedence, and "pack" the operation:
<kw,store>(<kw,add><num,1><kw,to><2>)<kw,as><id,a><EOF>
Replace the "sub-statement", the expression in parenthesis, with the defined replacement:
<kw,store>(1 + 2)<kw,as><id,a><EOF>
Repeat until no more statements left:
(<kw,store>(1 + 2)<kw,as><id,a>)<EOF>
(var a = (1 + 2))
Then evaluate the code with the built-in function, eval().
eval("var a = (1 + 2)")
Then my question is: would this algorithm work, and what are the limitations? Is this algorithm works better on simple languages?
This won't work as-is, because there's no way of deciding the precedence of operations and keywords, but you have essentially defined parsing (and thrown in an interpretation step at the end). This looks pretty close to operator-precedence parsing, but I could be wrong in the details of your vision. The real keys to what makes a parsing algorithm are the direction/precedence it reads the code, whether the decisions are made top-down (figure out what kind of statement and apply the rules) or bottom-up (assemble small pieces into larger components until the types of statements are apparent), and whether the grammar is encoded as code or data for a generic parser. (I'm probably overlooking something, but this should give you a starting point to make sense out of further reading.)
More typically, code is generally parsed using an LR technique (LL if it's top-down) that's driven from a state machine with look-ahead and next-step information, but you'll also find the occasional recursive descent. Since they're all doing very similar things (only implemented differently), your rough algorithm could probably be refined to look a lot like any of them.
For most people learning about parsing, recursive-descent is the way to go, since everything is in the code instead of building what amounts to an interpreter for the state machine definition. But most parser generators build an LL or LR compiler.
And I'm obviously over-simplifying the field, since you can see at the bottom of the Wikipedia pages that there's a smattering of related systems that partly revolve around the kind of grammar you have available. But for most languages, those are the big-three algorithms.
What you've defined is a rewriting system: https://en.wikipedia.org/wiki/Rewriting
You can make a compiler like that, but it's hard work and runs slowly, and if you do a really good job of optimizing it then you'll get conventional table-driven parser. It would be better in the end to learn about those first and just start there.
If you really don't want to use a parser generating tool, then the easiest way to write a parser for a simple language by hand is usually recursive descent: https://en.wikipedia.org/wiki/Recursive_descent_parser
Its well known in theoretical computer science that the "Hello world tester" program is an undecidable problem.(Here is a link what i mean by hello world tester
My question is since given a program as input we can't say what the program will do,can we solve the reverse problem:
Given set of input and output,is there an algorithm for writing a program that writes a program to achieve a one to one mapping between the given input and output.
I know about metaprogramming but my question is more of theoretical interest. Something which can apply for a generic case.
With these kind of musings one has to be very careful. A lot of confusion arises from not clearly distinguishing about a program x for which proposition P(x) holds or any program x for which proposition P(x) hold. As long as the set of programs for which P(x) holds is finite there always is a proof, of their correctness (although this proof may not be known).
At this point you also have to distinguish between programs, which are and can be known and programs which can only be shown to exist by full enumeration of all posibilities. Let's make an example:
Take 10 Programs, which take no input and may or may not terminate and produce "hello World". Then there is a program which decides exactly which of these programs are correct, and which aren't. Lets call these programs (x_1,...,x_10). Then take the programs (X_0,...,X_{2^10}) where X_i output true for program x_j if the j-th bit in the binary representation of i is set. One of these programs has to be the one which decides correctly for all ten x_i, there just might not be any way to ever figure out which one of these 100 X_js is the correct one (a meta-problem at this point).
This goes to show that considering finite sets of programs and input/output pairs one can always resolve to full enumeration and all halting-problem type of paradoxies instantly disappear. In your case the set of generated programs for each input is of size one and the set of input/output pairs is of finite size (because you have to supply it to the meta-program). Hence full enumeration solves your problem very simple and you can also easily proof both the correctness of the corrected program as well as the correctness of the meta-program.
Note: Since the set of generated programs is infinite, this is one of the few cases where you can proof P(x) for a infinite set of programs (actually you can proof P(x,input,output) for this set). This shows that the set being infinite is only a necessary, not a sufficient condition for this type of paradoxes to appear.
Your question is ambiguously phrased.
How would you specify "what a program should do"?
Any precise, complete, and machine-readable specification of a program's functionality is already a program.
Thus, the answer to your question is, a compiler.
Now, you're asking how to find a function based on a sample of its input and output.
That is a question about statistical analysis that I cannot answer.
Sounds like you want to generate a state machine that learns by being given an input sequence and then updates itself to produce the appropriate output sequence. Assuming your output sequences are always the same for the same input sequence it should be simple enough to write. If your output is not deterministic, such as changing the output depending on the time of day, then you cannot automatically generate a state machine.
Depends on what you mean by "one to one mapping". (And also, I suppose, "input" and "output".)
My guess is that you're asking whether, given an example of inputs and outputs for a given arbitrary program, can one devise an algorithm to write an equivalent program? If so, the answer is no. Eg, you could have a program with the inputs/outputs of 1/1, 2/2, 3/3, 4/4, and yet, if the next input value was 5, the output would be 3782. There's no way to know, from a given set of results, what the next result might be.
The question is underspecified since you did not say how the input and output are presented. For finite lists, the answer is "yes", as in this Python code:
def f(input,output):
print "def g():"
print " x = {" + ",".join(repr(x) + ":" + repr(y) for x,y in zip(input,output)) + "}"
print " print x[raw_input()]"
>>> f(['2','3','4'],['a','b','x'])
def g():
x = {'2':'a','3':'b','4':'x'}
print x[raw_input()]
>>> def g():
... x = {'2':'a','3':'b','4':'x'}
... print x[raw_input()]
...
>>> g()
3
b
for infinite sets how are you going to present them? If you show only a small sample of input this does not specify the whole algorithm. Guessing the best fit is undecidable. If you have a "magic blackbox" then there are continuum many mappings but only a countable number of programs, so it's impossible.
I think I agree with SLaks, but taking things from a different angle, what does a compiler do?
(EDIT: I see SLaks edited his original answer, which was essentially 'you're describing the identity function').
It takes a program in one source language that describes the intended behaviour of a program, and "writes" another program in a target language that exhibits that behaviour.
We could also think of this in terms of things like process refinement --- given an abstract specification, we can construct a refinement mapping to some "more concrete" (read: less non-deterministic, usually) implementation.
But based on your question, it's really very difficult to tell which of these you meant, if any.
Apologies for the non-descriptive question; if you can think of a better one, I'm all ears.
I'm writing some Perl to implement an algorithm and the code I have smells fishy. Since I don't have a CS background, I don't have a lot of knowledge of standard algorithms in my back pocket, but this seems like something that it might be.
Let me describe what I'm doing by way of metaphor:
You have a conveyor belt of oranges. The oranges pass you one by one. You also have an unlimited supply of flat-packed boxes.
For each orange, check it. If it is rotten, dispose of it
If it is good, put it in a box. If you don't have a box, grab a new one and construct it.
If the box has 10 oranges in it, close it up and put it on a pallet. Do not construct a new one.
Repeat until you have no more oranges
If you have a constructed box with some oranges in it, close it up and put it on a pallet
So, we have an algorithm for processing items in a list, if they meet some criteria, they should be added to a structure which, when it meets some other criteria, should be 'closed out'. Also, once the list has been processed, if there's an 'open' structure, it should be 'closed out' as well.
Naively, I assume that the algorithm consists of a loop acting over the list, with a conditional to see if the list element belongs in the structure and a conditional to see if the structure needs to be 'closed'.
Outside the loop, there would be one more conditional to close any outstanding structures.
So, here are my questions:
Is this a description of a well-known algorithm? If so, does it have a name?
Is there an effective way to coalesce the 'closing out the box' activity into a single place, as opposed to once inside the loop and once outside of the loop?
I tagged this as 'Perl' because Perlish approaches are of interest, but I'd be interested to hear of any other languages that have neat solutions to this.
It's a nice fit with a functional approach - you're iterating over a stream of Oranges, testing, grouping and operating on them. In Scala, it would be something like:
val oranges:Stream[Oranges] = ... // generate a stream of Oranges
oranges.filter(_.isNotRotten).grouped(10).foreach{ o => {(new Box).fillBox(o)}}
(grouped does the right thing with the partial box at the end)
There's probably Perl equivalents.
Is there an effective way to coalesce the 'closing out the box' activity into a single place,
as opposed to once inside the loop and once outside of the loop?
Yes. Simply add "... or there are no more oranges" to the "does the structure need to be closed" function. The easiest way of doing this is a do/while construct (technically speaking it's NOT a loop in Perl, though it looks like one):
my $current_container;
my $more_objects;
do {
my $object = get_next_object(); # Easiest implementation returns undef if no more
$more_objects = more_objects($object) # Easiest to implement as "defined $object"
if (!$more_objects || can_not_pack_more($current_container) {
close_container($current_container);
$current_container = open_container() if $more_objects;
}
pack($object, $current_container) if $more_objects;
} while ($more_objects);
IMHO, this doesn't really win you anything if the close_container() is encapsulated into a method - there's no major technical or code quality cost to calling it both inside and outside the loop. Actually, I'm strongly of the opinion that a convoluted workaround like I presented above is WORSE code quality wise than a straightforward:
my $current_container;
while (my $more_objects = more_objects(my $object = get_next_object())) {
if (can_not_pack_more($current_container)) { # false on undef
close_container($current_container);
}
$current_container = open_container_if_closed($current_container); # if defined
pack($object, $current_container);
}
close_container($current_container);
It seems like a bit over-complicated for the problem you are describing, but it sounds theoretically close to Petri Nets. check Petri Nets on wikipedia
A perl implementation can be found here
I hope this will help you,
Jerome Wagner
I don't think there's a name for this algorithm. For a straight-forward implementation you'll need two tests: one to detect a full box while in the processing loop and one after the loop to detect a partially full box. The "closing the box" logic can be made into a subroutine to avoid duplicating it. A functional approach could provide a way around that:
use List::MoreUtils qw(part natatime);
my ($good, $bad) = part { $_->is_rotten() } #oranges;
$_->dispose() foreach #$bad;
my $it = natatime 10, #$good;
while (my #batch = $it->()) {
my $box = Box->new();
$box->add(#batch);
$box->close();
$box->stack();
}
When looking at algorithms, the mainstream CS ones tend to handle very complex situations, or employ very complex approaches (look up NP-Complete for example). Moreover, the algorithms tend to focus on optimization. How can this system be more efficient? How can I use less steps in this production schedule? What is the most amount of foos that I can fit in my bar? etc.
An example of a complex approach in my opinion is quick sort because the recursion is pretty genius. I know it is standard, but I really like it. If you like algorithms, then check out the Simplex Algorithm - it has been very influential.
An example of a complex situation would be if you had oranges that go in, get sorted into 5 orange piles, then went to 5 different places to be peeled, then all came back with another path of oranges to total 10 orange piles, then each orange was individually sliced, and boxed in groups of exactly 2 pounds.
Back to your example. Your example is a simplified version of a flow network. Instead of having so many side paths and options, there is only one path with a capacity of one orange at a time. Of the flow network algorithms, the Ford-Fulkerson algorithm is probably the most influential.
So, you can probably fit one of these algorithms into the example posed, but it would be through a simplification process. Basically there is not enough complexity here to need any optimization. And there is no risk of running at an inefficient time complexity so there is no need to be running the "perfect approach".
The approach you detailed is going to be fine here, and the accepted answer above does a good job suggesting an actual functional solution to the defined problem. I just wanted to add my 2 cents with regards to algorithms.
I was thinking earlier today about an idea for a small game and stumbled upon how to implement it. The idea is that the player can make a series of moves that cause a little effect, but if done in a specific sequence would cause a greater effect. So far so good, this I know how to do. Obviously, I had to make it be more complicated (because we love to make it more complicated), so I thought that there could be more than one possible path for the sequence that would both cause greater effects, albeit different ones. Also, part of some sequences could be the beggining of other sequences, or even whole sequences could be contained by other bigger sequences. Now I don't know for sure the best way to implement this. I had some ideas, though.
1) I could implement a circular n-linked list. But since the list of moves never end, I fear it might cause a stack overflow ™. The idea is that every node would have n children and upon receiving a command, it might lead you to one of his children or, if no children was available to such command, lead you back to the beggining. Upon arrival on any children, a couple of functions would be executed causing the small and big effect. This might, though, lead to a lot of duplicated nodes on the tree to cope up with all the possible sequences ending on that specific move with different effects, which might be a pain to maintain but I am not sure. I never tried something this complex on code, only theoretically. Does this algorithm exist and have a name? Is it a good idea?
2) I could implement a state machine. Then instead of wandering around a linked list, I'd have some giant nested switch that would call functions and update the machine state accordingly. Seems simpler to implement, but... well... doesn't seem fun... nor ellegant. Giant switchs always seem ugly to me, but would this work better?
3) Suggestions? I am good, but I am far inexperienced. The good thing of the coding field is that no matter how weird your problem is, someone solved it in the past, but you must know where to look. Someone might have a better idea than those I had, and I really wanted to hear suggestions.
I'm not absolutely completely sure that I understand exactly what you're saying, but as an analagous situation, say someone's inputting an endless stream of numbers on the keyboard. '117' is a magic sequence, '468' is another one, '411799' is another (which contains the first one).
So if the user enters:
55468411799
you want to fire 'magic events' at the *s:
55468*4117*99*
or something like that, right? If that's analagous to the problem you're talking about, then what about something like (Java-like pseudocode):
MagicSequence fireworks = new MagicSequence(new FireworksAction(), 1, 1, 7);
MagicSequence playMusic = new MagicSequence(new MusicAction(), 4, 6, 8);
MagicSequence fixUserADrink = new MagicSequence(new ManhattanAction(), 4, 1, 1, 7, 9, 9);
Collection<MagicSequence> sequences = ... all of the above ...;
while (true) {
int num = readNumberFromUser();
for (MagicSequence seq : sequences) {
seq.handleNumber(num);
}
}
while MagicSequence has something like:
Action action = ... populated from constructor ...;
int[] sequence = ... populated from constructor ...;
int position = 0;
public void handleNumber(int num) {
if (num == sequence[position]) {
// They've entered the next number in the sequence
position++;
if (position == sequence.length) {
// They've got it all!
action.fire();
position = 0; // Or disable this Sequence from accepting more numbers if it's a once-off
}
} else {
position = 0; // missed a number, start again!
}
}
You might want to implement a state machine anyway, but you don't have to hardcode state transitions.
Try to make a graph of states, where link between state A to state B will mean A can lead to B.
Then you can traverse graph at runtime to find where player goes.
Edit: You can define graph node as:
-state-id
-list of links to other states,
where every link defines:
-state-id
-precondition, a list of states what must be visited before going to this state
What you're describing sounds very similar to the technology tree in a game live Civilization.
I don't know how the Civ authors built theirs, but I'd be inclined to use a multigraph to represent possible 'moves' - there will be some you can start at with no 'experience', and once you're in them, there will be multiple paths through to the end.
Draw-out what potential options you can have at each stage of the game, and then draw lines going from some options to others.
That should give you a start on implementation, as graphs are [relatively] easy concepts to implement and utilize.
Sounds like a neural network. You could create one and train it to recognize the patterns that cause the various effects you are looking for.
What you're describing sounds somewhat similar to a dependency graph or a word graph. You might look into those.
#Cowan, #Javier: Nice idea, mind if I add to it?
Let the MagicSequence objects listen to the incoming stream of user input, that is notify them of the input (broadcast) and let each of them add the input to there internal input stream. This stream is cleared when the input is not the expected next input in the pattern that would have the MagicSequence fire its action. As soon as the pattern is completed, fire the action and clear the internal input stream.
Optimize this by only feeding input to the MagicSequences that are waiting for it. This could be done two ways:
You have an object that lets all MagicSequences connect with events that correspond with numbers in their patterns. MagicSequence(1,1,7) would add itself to got1 and got7, for example:
UserInput.got1 += MagicSequnece[i].SendMeInput;
You could optimize this such that after each input MagicSequences deregister from invalid events and register with valid ones.
create a small state machine for each effect that you'd want. at each user action, 'broadcast' it to all state machines. most of then won't care, but some will advance, or maybe go backwards. when one of them reaches it's goal, produce the desired effect.
to keep the code neat, don't hardcode the state machines, instead build a simple data structure that encodes the state graph: each state is a node with a list of interesting events, each one points to the next state's node. Each machine's state is simply a reference to the appropriate state node.
edit: It seems Cowan's advice is equivalent to this, but he optimises his state machines to express only simple sequences. seems enough for your specific problem, but more complex conditions could need a more general solution.
I am working on a project that requires the parsing of log files. I am looking for a fast algorithm that would take groups messages like this:
The temperature at P1 is 35F.
The temperature at P1 is 40F.
The temperature at P3 is 35F.
Logger stopped.
Logger started.
The temperature at P1 is 40F.
and puts out something in the form of a printf():
"The temperature at P%d is %dF.", Int1, Int2"
{(1,35), (1, 40), (3, 35), (1,40)}
The algorithm needs to be generic enough to recognize almost any data load in message groups.
I tried searching for this kind of technology, but I don't even know the correct terms to search for.
I think you might be overlooking and missed fscanf() and sscanf(). Which are the opposite of fprintf() and sprintf().
Overview:
A naïve!! algorithm keeps track of the frequency of words in a per-column manner, where one can assume that each line can be separated into columns with a delimiter.
Example input:
The dog jumped over the moon
The cat jumped over the moon
The moon jumped over the moon
The car jumped over the moon
Frequencies:
Column 1: {The: 4}
Column 2: {car: 1, cat: 1, dog: 1, moon: 1}
Column 3: {jumped: 4}
Column 4: {over: 4}
Column 5: {the: 4}
Column 6: {moon: 4}
We could partition these frequency lists further by grouping based on the total number of fields, but in this simple and convenient example, we are only working with a fixed number of fields (6).
The next step is to iterate through lines which generated these frequency lists, so let's take the first example.
The: meets some hand-wavy criteria and the algorithm decides it must be static.
dog: doesn't appear to be static based on the rest of the frequency list, and thus it must be dynamic as opposed to static text. We loop through a few pre-defined regular expressions and come up with /[a-z]+/i.
over: same deal as #1; it's static, so leave as is.
the: same deal as #1; it's static, so leave as is.
moon: same deal as #1; it's static, so leave as is.
Thus, just from going over the first line we can put together the following regular expression:
/The ([a-z]+?) jumps over the moon/
Considerations:
Obviously one can choose to scan part or the whole document for the first pass, as long as one is confident the frequency lists will be a sufficient sampling of the entire data.
False positives may creep into the results, and it will be up to the filtering algorithm (hand-waving) to provide the best threshold between static and dynamic fields, or some human post-processing.
The overall idea is probably a good one, but the actual implementation will definitely weigh in on the speed and efficiency of this algorithm.
Thanks for all the great suggestions.
Chris, is right. I am looking for a generic solution for normalizing any kind of text. The solution of the problem boils down to dynmamically finding patterns in two or more similar strings.
Almost like predicting the next element in a set, based on the previous two:
1: Everest is 30000 feet high
2: K2 is 28000 feet high
=> What is the pattern?
=> Answer:
[name] is [number] feet high
Now the text file can have millions of lines and thousands of patterns. I would like to parse the files very, very fast, find the patterns and collect the data sets that are associated with each pattern.
I thought about creating some high level semantic hashes to represent the patterns in the message strings.
I would use a tokenizer and give each of the tokens types a specific "weight".
Then I would group the hashes and rate their similarity. Once the grouping is done I would collect the data sets.
I was hoping, that I didn't have to reinvent the wheel and could reuse something that is already out there.
Klaus
It depends on what you are trying to do, if your goal is to quickly generate sprintf() input, this works. If you are trying to parse data, maybe regular expressions would do too..
You're not going to find a tool that can simply take arbitrary input, guess what data you want from it, and produce the output you want. That sounds like strong AI to me.
Producing something like this, even just to recognize numbers, gets really hairy. For example is "123.456" one number or two? How about this "123,456"? Is "35F" a decimal number and an 'F' or is it the hex value 0x35F? You're going to have to build something that will parse in the way you need. You can do this with regular expressions, or you can do it with sscanf, or you can do it some other way, but you're going to have to write something custom.
However, with basic regular expressions, you can do this yourself. It won't be magic, but it's not that much work. Something like this will parse the lines you're interested in and consolidate them (Perl):
my #vals = ();
while (defined(my $line = <>))
{
if ($line =~ /The temperature at P(\d*) is (\d*)F./)
{
push(#vals, "($1,$2)");
}
}
print "The temperature at P%d is %dF. {";
for (my $i = 0; $i < #vals; $i++)
{
print $vals[$i];
if ($i < #vals - 1)
{
print ",";
}
}
print "}\n";
The output from this isL
The temperature at P%d is %dF. {(1,35),(1,40),(3,35),(1,40)}
You could do something similar for each type of line you need to parse. You could even read these regular expressions from a file, instead of custom coding each one.
I don't know of any specific tool to do that. What I did when I had a similar problem to solve was trying to guess regular expressions to match lines.
I then processed the files and displayed only the unmatched lines. If a line is unmatched, it means that the pattern is wrong and should be tweaked or another pattern should be added.
After around an hour of work, I succeeded in finding the ~20 patterns to match 10000+ lines.
In your case, you can first "guess" that one pattern is "The temperature at P[1-3] is [0-9]{2}F.". If you reprocess the file removing any matched line, it leaves "only":
Logger stopped.
Logger started.
Which you can then match with "Logger (.+).".
You can then refine the patterns and find new ones to match your whole log.
#John: I think that the question relates to an algorithm that actually recognises patterns in log files and automatically "guesses" appropriate format strings and data for it. The *scanf family can't do that on its own, it can only be of help once the patterns have been recognised in the first place.
#Derek Park: Well, even a strong AI couldn't be sure it had the right answer.
Perhaps some compression-like mechanism could be used:
Find large, frequent substrings
Find large, frequent substring patterns. (i.e. [pattern:1] [junk] [pattern:2])
Another item to consider might be to group lines by edit-distance. Grouping similar lines should split the problem into one-pattern-per-group chunks.
Actually, if you manage to write this, let the whole world know, I think a lot of us would like this tool!
#Anders
Well, even a strong AI couldn't be sure it had the right answer.
I was thinking that sufficiently strong AI could usually figure out the right answer from the context. e.g. Strong AI could recognize that "35F" in this context is a temperature and not a hex number. There are definitely cases where even strong AI would be unable to answer. Those are the same cases where a human would be unable to answer, though (assuming very strong AI).
Of course, it doesn't really matter, since we don't have strong AI. :)
http://www.logparser.com forwards to an IIS forum which seems fairly active. This is the official site for Gabriele Giuseppini's "Log Parser Toolkit". While I have never actually used this tool, I did pick up a cheap copy of the book from Amazon Marketplace - today a copy is as low as $16. Nothing beats a dead-tree-interface for just flipping through pages.
Glancing at this forum, I had not previously heard about the "New GUI tool for MS Log Parser, Log Parser Lizard" at http://www.lizardl.com/.
The key issue of course is the complexity of your GRAMMAR. To use any kind of log-parser as the term is commonly used, you need to know exactly what you're scanning for, you can write a BNF for it. Many years ago I took a course based on Aho-and-Ullman's "Dragon Book", and the thoroughly understood LALR technology can give you optimal speed, provided of course that you have that CFG.
On the other hand it does seem you're possibly reaching for something AI-like, which is a different order of complexity entirely.