Colossal memory usage/stack problems with ANTLR lexer/parser - memory-management

I'm porting over a grammar from flex/bison, and mostly seem to have everything up and running (in particular, my token stream seems fine, and my parser grammar is compiling and running), but seem to be running into problems of runaway stack/memory usage even with very small/moderate sized inputs to my grammar. What is the preferred construct for chaining together an unbounded sequence of the same nonterminal? In my Bison grammar I had production rules of the form:
statements: statement | statement statements
words: | word words
In ANTLR, if I maintain the same rule setup, this seems to perform admirably on small inputs (on the order of 4kB), but leads to stack overflow on larger inputs (on the order of 100kB). In both cases the automated parse tree produced is also rather ungainly.
I experimented with changing these production rules to have an explicitly additive (rather than recursive form):
statements: statement+
words: word*
However this seems to have lead to absolutely horrific blowup in memory usage (upwards of 1GB) on even very small inputs, and the parser has not yet managed to return a parse tree after 20 minutes of letting it run.
Any pointers would be appreciated.

Your rewritten statements are the optimal ANTLR 4 form of the two rules you described (highest performing and minimum memory usage). Here is some general feedback regarding the issues you describe.
I developed some very advanced diagnostic code for numerous potential performance problems. Much of this code is included in TestPerformance, but it is geared towards expert users and requires a rather deep understanding of ANTLR 4's new ALL(*) algorithm to interpret the results.
Terence and I are interested in turning the above into a tool that users can make use of. I may be able to help (run and interpret the test) if you provide a complete grammar and example inputs, so that I can use that grammar and input pair as part of evaluating the usability of a tool further down the road that automates the analysis.
Make sure you are using the two-stage parsing strategy from the book. In many cases, this will vastly improve the parsing performance for correct inputs (incorrect inputs would not be faster).
We don't like to use more memory than necessary, but you should be aware that we are working under a very different definition of "excessive" - e.g. we run our testing applications with -Xmx4g to -Xmx12g, depending on the test.

Okay, so I've gotten it working, in the following manner. My YACC grammar had the following constructions:
lines: lines | line lines;
words: | word words;
However, this did not make the recursive parsing happy, so I rewrote it as:
lines: line+;
words: word*;
Which is in line with #280Z28's feedback (and my original guess). This hung the parser, which is why I posted the question in the first place, but the debugging procedure outlined in my comments to #280Z28's answer showed that in fact it was only the lines parsing which was causing the problem (words) was fine. On a whim, I tried the following rewrite:
lines : stmt (EOL stmt)+ EOL*;
(where line had originally been defined as:
line : stmt (EOL | EOF);
)
This seems to be working quite well, even for large inputs. However it is entirely unclear to me WHY this is the Right Thing To Do(tm), or why it makes a difference compared to the revision which prompted this question. Any feedback on this matter would still be appreciated.

Related

What makes a DCG predicate expensive?

I'm building a Definite Clause Grammar to parse 20,000 pieces of semi-natural text. As the size of my database of predicates grows (now up to 1,200 rules), parsing a string can take quite a long time -- particularly for strings that are not currently interpretable by the DCG, due to syntax I haven't yet encoded. The current worst-case is 3 minutes for a string containing 30 words. I'm trying to figure out how I can optimize this, or if I should just start researching cloud computing.
I'm using SWI-Prolog, and that provides a "profile" goal, which provides some statistics. I was surprised to find that the simplest rules in my database are taking up the majority of execution time. My corpus contains strings that represent numbers, and I want to capture these in a scalar/3 predicate. These are hogging ~50-60% of total execution time.
At the outset, I had 70 lines in my scalars.pl, representing the numeric and natural language representations of the numbers in my corpus. Like so:
scalar(scalar(3)) --> ["three"].
scalar(scalar(3)) --> ["3"].
scalar(scalar(4)) --> ["four"].
scalar(scalar(4)) --> ["4"].
...and so on.
Thinking that the length of the file was the problem, I put in a new rule that would automatically parse any numeric representations:
scalar(scalar(X)) --> [Y], { atom_number(Y, X) }.
Thanks to that, I've gone from 70 rules to 31, and helped a bit -- but it wasn't a huge savings. Is there anything more that can be done? My feeling is maybe not, because what could be simpler than a single atom in a list?
These scalars are called in a lot of places throughout the grammar, and I assume that's the root of the issue. Though they're simple rules, they're everywhere, and unavoidably so. A highly general grammar just won't work for my application, and I wouldn't be surprised if I end up with 3,000 rules or more.
I've never built a DCG this large, so I'm not sure how much I can expect in terms of performance. Happy to take any kind of advice on this one: is there some other way of encoding these rules? Should I accept that some parses will take a long time, and figure out how to run parses in parallel?
Thank you in advance!
EDIT: I was asked to provide a reproducible example, but to do that I'd have to link SO to the entire project, since this is an issue of scale. Here's a toy version of what I'm doing for the sake of completeness. Just imagine there were large files describing hundreds of nouns, hundreds of verbs, and hundreds of syntactic structures.
sent(sent(VP, NP)) --> vp(VP), np(NP).
vp(vp(V)) --> v(V).
np(np(Qty, Noun)) --> qty(Qty), n(Noun).
scalar(scalar(3)) --> ["three"].
scalar(scalar(X)) --> [Y], { atom_number(Y, X) }.
qty(qty(Scalar)) --> scalar(Scalar).
v(v(eat)) --> ["eat"].
n(n(pie)) --> ["pie"].
One aspect of your program that you might investigate is to make sure individual predicates succeed quickly and fail quickly. This is particularly useful to check for predicates that have many clauses.
For instance, when scalar(X) is evaluated on a token that is not a scalar then the program will have to try 31 (by your last count) times before it can determine that scalar//1 fails. If the structure of your program is such that scalar(X) is checked against every token then this could be very expensive.
Further, if scalar(X) does happen to find that a token matches but a subsequent goal fails then it appears that your program will retry the scalar(X) until all of the scalar//1 clauses have been attempted.
The judicious use of cut (!) or if-then-else (C1->G1;C2->G2;G3) can provide a tremendous performance improvement.
Or you can structure your predicates so that they rely on indexing to select the appropriate clause. E.g.:
scalar(scalar(N)) --> [Token], {scalar1(Token, scalar(N))}.
scalar1("3", scalar(3)) :- !.
scalar1(Y, scalar(X)) :- atom_number(Y, X).
This uses both cut and clause indexing (if the compiler provides it) with the scalar1/1 predicate.
EDIT: You should read R. A. O'Keefe's The Craft of Prolog. It is an excellent guide to the practical aspects of Prolog.
Here's how I've tackled performance and optimization problems as a novice Prologer.
1.) Introduce timeouts to your application. I'm calling Prolog via the subprocess module in Python 3.6, and that allows you to set a timeout. As I've worked with my code base more, I've got a pretty good sense of how long a successful parse might take, and can assume anything taking longer is not going to work.
2.) Make use of the graphical profiler that's packaged in the swi-prolog IDE. This gives a lot more insight, as you can bounce around the call tree. I found it particularly helpful to sort predicates by the execution time of their children. Before I was thinking about it like pollution in a river. "Man, there's a lot of junk floating in here," I thought, not considering that upstream some factories were contributing a lot of that junk.
As for how to optimize a DCG without hurting the semantics & expressivity of one's grammar, I think that will have to be a question for another Stack Overflow. And as for my initial question, that's still an open one -- predicates that seem simple (to me) take quite a while.

Competitive Programming : Generating test cases and validating the program correctness

I have been doing sport programming for a while and still improving day by day. But one thing I have always wondered is that it would be really nice if I could automate the test-case generation process and cross-validation of my program. Definitely it would be a brute force approach as some test cases would be algorithm specific.
Doing a google search gives me a nice link on Quora : How do programming contest problem setters make test cases ? and the popular testlib used by problem setters.
But isn't this a chicken-egg problem?
Assume I generated 1 million input test cases, but what would I check them against? How will I generate the outputs? Because I am still in the process of validating the program... If my script generates the correct outputs as well, then whats the point of writing the program in the first place. I can submit the script itself. Also, its not possible to write 1 million outputs for generated test cases manually. Can anyone please clarify this confusion.
I hope i have clarified the problem correctly.
It's common to generate the answer by a slow but obviously correct solution (like an exhaustive search). It can't be used as a main solution as it's too slow for large test cases, but you can check the output of your fast (but possibly incorrect) program using it.
Well the thing is it is not as broad as you think it to be. Test generation in competitive programming is guided by the algorithm of the problem and it's correctness proof.
So when you are thinking that there are million of test cases if you analyze the different situations the program can be then you will likely to get all the test cases. Maybe in certain algorithm you are some times processing the even index elements or the odd index elements of an array. Now what you will do? Divide it in 2 cases even or odd. Consider the smallest case for even ones . Same for odd ones. This way you are basically visiting all the control flow path of the program.
In competitive programming as we first determine the algorithm then we decide on a proper input sizes and then all this test cases and validations, it is often easy to think the corner points. Test case for 1000000 elements or when input is 0 or 1 ...test cases like this.
Another is most of the time we write a brute force solution much more slower than the original one. Now what we do? we just generate random medium size test cases and then run it again the slow program and we can check with our checker solution etc.
correctness is guided by some mathematical proof also.(Heuristics, Induction, Box principle, Number Theory etc ) That way we are sure about the correctness of the solution.
I faced the same issue earlier this year and saw some of my colleagues also figuring out a way to deal with this. Because there are sometimes when I just couldn’t think any more test cases and that’s when I decided to make a test case generator tool of my own, It’s free and open-source so anyone can use it.
You can easily generate a lot of test cases using this tool and validate the result using the output given from the correct but slower approach(in terms of time complexity and space complexity). You can either run them parallelly and check for outputs or write a simple script to compare the output of both programs (the slower but correct one and the better but unsure one) to validate.
I believe good coders won’t be needing it anyway, but for the middle level (div2, div3) coders and newbies, it can prove to be a lot helpful.
You can access it from GitHub : Test Case generator.
Both python source code and .exe files are present there with instructions,
If you want to make some changes of your own you can work with the python file.
If you just directly want it to generate some test cases you should prefer the .exe file(inside the zip).
It’ll help a lot if you’re beginning your Competitive programming journey.
Also, any suggestions or improvements are always welcome. Additionally you can also contribute to this project by adding some feature that you think the project is lacking or by adding some new test case formats by yourselves or making by request for the same.

What exactly is the token count in functions/methods used for?

I've been using some tools to measure code quality and CCN (Cyclomatic Complexity Number) and some of those tools provides a count for tokens in functions what does that count says about my function or method? What is it used for?
Cyclomatic Complexity Number is a metric to indicate complexity of function, procedure or program. The best (large enough and intuitive) explanation I have found is provided here.
I think that tokens refer to conditional statements tokens that actually are taken into account to compute the cyclomatic complexity.
[later edit]
A high CCN means complex code that:
it is (much) more hard to read and understand
it is hard to maintain
unit tests are harder to maintain since a decent code coverage is reached with more difficulty
might lead to more bugs
CCN can be reduced using various techniques. Some examples can be seen here or here.
In the context of CCN tools, a token is any distinct operator or operand.
How this is implemented depends on the tool. Since the page on Lizard doesn't go into details, you will have to examine the source code (its not many lines)
https://github.com/terryyin/lizard/tree/master/lizard_languages
If you search the source for 'token', you will see how the tool is parsing the code. In most cases it is looking for code blocks, expressions, annotations and accessing of methods/objects.
For example, according to java.py, Java is only parsed for '{', '#', and '.'
Not sure why it isn't looking for expression...?
The OP has not declared which tool they're using but for lizard this has been asked from as an issue so it might help someone
Token is the word and operators, etc.
For example: if (abc % 3 != 0) Has [‘if’, ‘(‘, ‘abc’, ‘%’, ‘3’, ‘!=‘, ‘0’, ‘)’] 8 tokens.
Also another source that has similar description:
One program can have a maximum of 8192 tokens. Each token is a word
(e.g. variable name) or operator. Pairs of brackets, and strings count
as 1 token. commas, periods, LOCALs, semi-colons, ENDs, and comments
are not counted.
Now the next question is, would the number of tokens matter like CNN? Giving the disclaimer that I am not an expert in code quality, it depends on the language. For example, in compiled languages, you might want to break a complex line into multiple lines which increases the number of tokens but significantly enhances the readability of the code. You should go for it, the modern compilers are smart enough to optimize them.
However, this might not be so much true in interpreted languages. Again, you should look into the specific language you are using to make sure if there is any optimization behind the scene or not. That being said, some languages such as Python provide syntaxes to reduce the number of tokens. This is great as long as it was designed in the language.
TL;DR: This factor doesn't matter as much as code readability. Double-check your code if it is high but don't mess up the code to lower it.

Code generation by genetic algorithms

Evolutionary programming seems to be a great way to solve many optimization problems. The idea is very easy and the implementation does not make problems.
I was wondering if there is any way to evolutionarily create a program in ruby/python script (or any other language)?
The idea is simple:
Create a population of programs
Perform genetic operations (roulette-wheel selection or any other selection), create new programs with inheritance from best programs, etc.
Loop point 2 until program that will satisfy our condition is found
But there are still few problems:
How will chromosomes be represented? For example, should one cell of chromosome be one line of code?
How will chromosomes be generated? If they will be lines of code, how do we generate them to ensure that they are syntactically correct, etc.?
Example of a program that could be generated:
Create script that takes N numbers as input and returns their mean as output.
If there were any attempts to create such algorithms I'll be glad to see any links/sources.
If you are sure you want to do this, you want genetic programming, rather than a genetic algorithm. GP allows you to evolve tree-structured programs. What you would do would be to give it a bunch of primitive operations (while($register), read($register), increment($register), decrement($register), divide($result $numerator $denominator), print, progn2 (this is GP speak for "execute two commands sequentially")).
You could produce something like this:
progn2(
progn2(
read($1)
while($1
progn2(
while($1
progn2( #add the input to the total
increment($2)
decrement($1)
)
)
progn2( #increment number of values entered, read again
increment($3)
read($1)
)
)
)
)
progn2( #calculate result
divide($1 $2 $3)
print($1)
)
)
You would use, as your fitness function, how close it is to the real solution. And therein lies the catch, that you have to calculate that traditionally anyway*. And then have something that translates that into code in (your language of choice). Note that, as you've got a potential infinite loop in there, you'll have to cut off execution after a while (there's no way around the halting problem), and it probably won't work. Shucks. Note also, that my provided code will attempt to divide by zero.
*There are ways around this, but generally not terribly far around it.
It can be done, but works very badly for most kinds of applications.
Genetic algorithms only work when the fitness function is continuous, i.e. you can determine which candidates in your current population are closer to the solution than others, because only then you'll get improvements from one generation to the next. I learned this the hard way when I had a genetic algorithm with one strongly-weighted non-continuous component in my fitness function. It dominated all others and because it was non-continuous, there was no gradual advancement towards greater fitness because candidates that were almost correct in that aspect were not considered more fit than ones that were completely incorrect.
Unfortunately, program correctness is utterly non-continuous. Is a program that stops with error X on line A better than one that stops with error Y on line B? Your program could be one character away from being correct, and still abort with an error, while one that returns a constant hardcoded result can at least pass one test.
And that's not even touching on the matter of the code itself being non-continuous under modifications...
Well this is very possible and #Jivlain correctly points out in his (nice) answer that genetic Programming is what you are looking for (and not simple Genetic Algorithms).
Genetic Programming is a field that has not reached a broad audience yet, partially because of some of the complications #MichaelBorgwardt indicates in his answer. But those are mere complications, it is far from true that this is impossible to do. Research on the topic has been going on for more than 20 years.
Andre Koza is one of the leading researchers on this (have a look at his 1992 work) and he demonstrated as early as 1996 how genetic programming can in some cases outperform naive GAs on some classic computational problems (such as evolving programs for Cellular Automata synchronization).
Here's a good Genetic Programming tutorial from Koza and Poli dated 2003.
For a recent reference you might wanna have a look at A field guide to genetic programming (2008).
Since this question was asked, the field of genetic programming has advanced a bit, and there have been some additional attempts to evolve code in configurations other than the tree structures of traditional genetic programming. Here are just a few of them:
PushGP - designed with the goal of evolving modular functions like human coders use, programs in this system store all variables and code on different stacks (one for each variable type). Programs are written by pushing and popping commands and data off of the stacks.
FINCH - a system that evolves Java byte-code. This has been used to great effect to evolve game-playing agents.
Various algorithms have started evolving C++ code, often with a step in which compiler errors are corrected. This has had mixed, but not altogether unpromising results. Here's an example.
Avida - a system in which agents evolve programs (mostly boolean logic tasks) using a very simple assembly code. Based off of the older (and less versatile) Tierra.
The language isn't an issue. Regardless of the language, you have to define some higher-level of mutation, otherwise it will take forever to learn.
For example, since any Ruby language can be defined in terms of a text string, you could just randomly generate text strings and optimize that. Better would be to generate only legal Ruby programs. However, it would also take forever.
If you were trying to build a sorting program and you had high level operations like "swap", "move", etc. then you would have a much higher chance of success.
In theory, a bunch of monkeys banging on a typewriter for an infinite amount of time will output all the works of Shakespeare. In practice, it isn't a practical way to write literature. Just because genetic algorithms can solve optimization problems doesn't mean that it's easy or even necessarily a good way to do it.
The biggest selling point of genetic algorithms, as you say, is that they are dirt simple. They don't have the best performance or mathematical background, but even if you have no idea how to solve your problem, as long as you can define it as an optimization problem you will be able to turn it into a GA.
Programs aren't really suited for GA's precisely because code isn't good chromossome material. I have seen someone who did something similar with (simpler) machine code instead of Python (although it was more of an ecossystem simulation then a GA per se) and you might have better luck if you codify your programs using automata / LISP or something like that.
On the other hand, given how alluring GA's are and how basically everyone who looks at them asks this same question, I'm pretty sure there are already people who tried this somewhere - I just have no idea if any of them succeeded.
Good luck with that.
Sure, you could write a "mutation" program that reads a program and randomly adds, deletes, or changes some number of characters. Then you could compile the result and see if the output is better than the original program. (However we define and measure "better".) Of course 99.9% of the time the result would be compile errors: syntax errors, undefined variables, etc. And surely most of the rest would be wildly incorrect.
Try some very simple problem. Say, start with a program that reads in two numbers, adds them together, and outputs the sum. Let's say that the goal is a program that reads in three numbers and calculates the sum. Just how long and complex such a program would be of course depends on the language. Let's say we have some very high level language that lets us read or write a number with just one line of code. Then the starting program is just 4 lines:
read x
read y
total=x+y
write total
The simplest program to meet the desired goal would be something like
read x
read y
read z
total=x+y+z
write total
So through a random mutation, we have to add "read z" and "+z", a total of 9 characters including the space and the new-line. Let's make it easy on our mutation program and say it always inserts exactly 9 random characters, that they're guaranteed to be in the right places, and that it chooses from a character set of just 26 letters plus 10 digits plus 14 special characters = 50 characters. What are the odds that it will pick the correct 9 characters? 1 in 50^9 = 1 in 2.0e15. (Okay, the program would work if instead of "read z" and "+z" it inserted "read w" and "+w", but then I'm making it easy by assuming it magically inserts exactly the right number of characters and always inserts them in the right places. So I think this estimate is still generous.)
1 in 2.0e15 is a pretty small probability. Even if the program runs a thousand times a second, and you can test the output that quickly, the chance is still just 1 in 2.0e12 per second, or 1 in 5.4e8 per hour, 1 in 2.3e7 per day. Keep it running for a year and the chance of success is still only 1 in 62,000.
Even a moderately competent programmer should be able to make such a change in, what, ten minutes?
Note that changes must come in at least "packets" that are correct. That is, if a mutation generates "reax z", that's only one character away from "read z", but it would still produce compile errors, and so would fail.
Likewise adding "read z" but changing the calculation to "total=x+y+w" is not going to work. Depending on the language, you'll either get errors for the undefined variable or at best it will have some default value, like zero, and give incorrect results.
You could, I suppose, theorize incremental solutions. Maybe one mutation adds the new read statement, then a future mutation updates the calculation. But without the calculation, the additional read is worthless. How will the program be evaluated to determine that the additional read is "a step in the right direction"? The only way I see to do that is to have an intelligent being read the code after each mutation and see if the change is making progress toward the desired goal. And if you have an intelligent designer who can do that, that must mean that he knows what the desired goal is and how to achieve it. At which point, it would be far more efficient to just make the desired change rather than waiting for it to happen randomly.
And this is an exceedingly trivial program in a very easy language. Most programs are, what, hundreds or thousands of lines, all of which must work together. The odds against any random process writing a working program are astronomical.
There might be ways to do something that resembles this in some very specialized application, where you are not really making random mutations, but rather making incremental modifications to the parameters of a solution. Like, we have a formula with some constants whose values we don't know. We know what the correct results are for some small set of inputs. So we make random changes to the constants, and if the result is closer to the right answer, change from there, if not, go back to the previous value. But even at that, I think it would rarely be productive to make random changes. It would likely be more helpful to try changing the constants according to a strict formula, like start with changing by 1000's, then 100's then 10's, etc.
I want to just give you a suggestion. I don't know how successful you'd be, but perhaps you could try to evolve a core war bot with genetic programming. Your fitness function is easy: just let the bots compete in a game. You could start with well known bots and perhaps a few random ones then wait and see what happens.

Pseudocode interpreter?

Like lots of you guys on SO, I often write in several languages. And when it comes to planning stuff, (or even answering some SO questions), I actually think and write in some unspecified hybrid language. Although I used to be taught to do this using flow diagrams or UML-like diagrams, in retrospect, I find "my" pseudocode language has components of C, Python, Java, bash, Matlab, perl, Basic. I seem to unconsciously select the idiom best suited to expressing the concept/algorithm.
Common idioms might include Java-like braces for scope, pythonic list comprehensions or indentation, C++like inheritance, C#-style lambdas, matlab-like slices and matrix operations.
I noticed that it's actually quite easy for people to recognise exactly what I'm triying to do, and quite easy for people to intelligently translate into other languages. Of course, that step involves considering the corner cases, and the moments where each language behaves idiosyncratically.
But in reality, most of these languages share a subset of keywords and library functions which generally behave identically - maths functions, type names, while/for/if etc. Clearly I'd have to exclude many 'odd' languages like lisp, APL derivatives, but...
So my questions are,
Does code already exist that recognises the programming language of a text file? (Surely this must be a less complicated task than eclipse's syntax trees or than google translate's language guessing feature, right?) In fact, does the SO syntax highlighter do anything like this?
Is it theoretically possible to create a single interpreter or compiler that recognises what language idiom you're using at any moment and (maybe "intelligently") executes or translates to a runnable form. And flags the corner cases where my syntax is ambiguous with regards to behaviour. Immediate difficulties I see include: knowing when to switch between indentation-dependent and brace-dependent modes, recognising funny operators (like *pointer vs *kwargs) and knowing when to use list vs array-like representations.
Is there any language or interpreter in existence, that can manage this kind of flexible interpreting?
Have I missed an obvious obstacle to this being possible?
edit
Thanks all for your answers and ideas. I am planning to write a constraint-based heuristic translator that could, potentially, "solve" code for the intended meaning and translate into real python code. It will notice keywords from many common languages, and will use syntactic clues to disambiguate the human's intentions - like spacing, brackets, optional helper words like let or then, context of how variables are previously used etc, plus knowledge of common conventions (like capital names, i for iteration, and some simplistic limited understanding of naming of variables/methods e.g containing the word get, asynchronous, count, last, previous, my etc). In real pseudocode, variable naming is as informative as the operations themselves!
Using these clues it will create assumptions as to the implementation of each operation (like 0/1 based indexing, when should exceptions be caught or ignored, what variables ought to be const/global/local, where to start and end execution, and what bits should be in separate threads, notice when numerical units match / need converting). Each assumption will have a given certainty - and the program will list the assumptions on each statement, as it coaxes what you write into something executable!
For each assumption, you can 'clarify' your code if you don't like the initial interpretation. The libraries issue is very interesting. My translator, like some IDE's, will read all definitions available from all modules, use some statistics about which classes/methods are used most frequently and in what contexts, and just guess! (adding a note to the program to say why it guessed as such...) I guess it should attempt to execute everything, and warn you about what it doesn't like. It should allow anything, but let you know what the several alternative interpretations are, if you're being ambiguous.
It will certainly be some time before it can manage such unusual examples like #Albin Sunnanbo's ImportantCustomer example. But I'll let you know how I get on!
I think that is quite useless for everything but toy examples and strict mathematical algorithms. For everything else the language is not just the language. There are lots of standard libraries and whole environments around the languages. I think I write almost as many lines of library calls as I write "actual code".
In C# you have .NET Framework, in C++ you have STL, in Java you have some Java libraries, etc.
The difference between those libraries are too big to be just syntactic nuances.
<subjective>
There has been attempts at unifying language constructs of different languages to a "unified syntax". That was called 4GL language and never really took of.
</subjective>
As a side note I have seen a code example about a page long that was valid as c#, Java and Java script code. That can serve as an example of where it is impossible to determine the actual language used.
Edit:
Besides, the whole purpose of pseudocode is that it does not need to compile in any way. The reason you write pseudocode is to create a "sketch", however sloppy you like.
foreach c in ImportantCustomers{== OrderValue >=$1M}
SendMailInviteToSpecialEvent(c)
Now tell me what language it is and write an interpreter for that.
To detect what programming language is used: Detecting programming language from a snippet
I think it should be possible. The approach in 1. could be leveraged to do this, I think. I would try to do it iteratively: detect the syntax used in the first line/clause of code, "compile" it to intermediate form based on that detection, along with any important syntax (e.g. begin/end wrappers). Then the next line/clause etc. Basically write a parser that attempts to recognize each "chunk". Ambiguity could be flagged by the same algorithm.
I doubt that this has been done ... seems like the cognitive load of learning to write e.g. python-compatible pseudocode would be much easier than trying to debug the cases where your interpreter fails.
a. I think the biggest problem is that most pseudocode is invalid in any language. For example, I might completely skip object initialization in a block of pseudocode because for a human reader it is almost always straightforward to infer. But for your case it might be completely invalid in the language syntax of choice, and it might be impossible to automatically determine e.g. the class of the object (it might not even exist). Etc.
b. I think the best you can hope for is an interpreter that "works" (subject to 4a) for your pseudocode only, no-one else's.
Note that I don't think that 4a,4b are necessarily obstacles to it being possible. I just think it won't be useful for any practical purpose.
Recognizing what language a program is in is really not that big a deal. Recognizing the language of a snippet is more difficult, and recognizing snippets that aren't clearly delimited (what do you do if four lines are Python and the next one is C or Java?) is going to be really difficult.
Assuming you got the lines assigned to the right language, doing any sort of compilation would require specialized compilers for all languages that would cooperate. This is a tremendous job in itself.
Moreover, when you write pseudo-code you aren't worrying about the syntax. (If you are, you're doing it wrong.) You'll wind up with code that simply can't be compiled because it's incomplete or even contradictory.
And, assuming you overcame all these obstacles, how certain would you be that the pseudo-code was being interpreted the way you were thinking?
What you would have would be a new computer language, that you would have to write correct programs in. It would be a sprawling and ambiguous language, very difficult to work with properly. It would require great care in its use. It would be almost exactly what you don't want in pseudo-code. The value of pseudo-code is that you can quickly sketch out your algorithms, without worrying about the details. That would be completely lost.
If you want an easy-to-write language, learn one. Python is a good choice. Use pseudo-code for sketching out how processing is supposed to occur, not as a compilable language.
An interesting approach would be a "type-as-you-go" pseudocode interpreter. That is, you would set the language to be used up front, and then it would attempt to convert the pseudo code to real code, in real time, as you typed. An interactive facility could be used to clarify ambiguous stuff and allow corrections. Part of the mechanism could be a library of code which the converter tried to match. Over time, it could learn and adapt its translation based on the habits of a particular user.
People who program all the time will probably prefer to just use the language in most cases. However, I could see the above being a great boon to learners, "non-programmer programmers" such as scientists, and for use in brainstorming sessions with programmers of various languages and skill levels.
-Neil
Programs interpreting human input need to be given the option of saying "I don't know." The language PL/I is a famous example of a system designed to find a reasonable interpretation of anything resembling a computer program that could cause havoc when it guessed wrong: see http://horningtales.blogspot.com/2006/10/my-first-pli-program.html
Note that in the later language C++, when it resolves possible ambiguities it limits the scope of the type coercions it tries, and that it will flag an error if there is not a unique best interpretation.
I have a feeling that the answer to 2. is NO. All I need to prove it false is a code snippet that can be interpreted in more than one way by a competent programmer.
Does code already exist that
recognises the programming language
of a text file?
Yes, the Unix file command.
(Surely this must be a less
complicated task than eclipse's syntax
trees or than google translate's
language guessing feature, right?) In
fact, does the SO syntax highlighter
do anything like this?
As far as I can tell, SO has a one-size-fits-all syntax highlighter that tries to combine the keywords and comment syntax of every major language. Sometimes it gets it wrong:
def median(seq):
"""Returns the median of a list."""
seq_sorted = sorted(seq)
if len(seq) & 1:
# For an odd-length list, return the middle item
return seq_sorted[len(seq) // 2]
else:
# For an even-length list, return the mean of the 2 middle items
return (seq_sorted[len(seq) // 2 - 1] + seq_sorted[len(seq) // 2]) / 2
Note that SO's highlighter assumes that // starts a C++-style comment, but in Python it's the integer division operator.
This is going to be a major problem if you try to combine multiple languages into one. What do you do if the same token has different meanings in different languages? Similar situations are:
Is ^ exponentiation like in BASIC, or bitwise XOR like in C?
Is || logical OR like in C, or string concatenation like in SQL?
What is 1 + "2"? Is the number converted to a string (giving "12"), or is the string converted to a number (giving 3)?
Is there any language or interpreter
in existence, that can manage this
kind of flexible interpreting?
On another forum, I heard a story of a compiler (IIRC, for FORTRAN) that would compile any program regardless of syntax errors. If you had the line
= Y + Z
The compiler would recognize that a variable was missing and automatically convert the statement to X = Y + Z, regardless of whether you had an X in your program or not.
This programmer had a convention of starting comment blocks with a line of hyphens, like this:
C ----------------------------------------
But one day, they forgot the leading C, and the compiler choked trying to add dozens of variables between what it thought was subtraction operators.
"Flexible parsing" is not always a good thing.
To create a "pseudocode interpreter," it might be necessary to design a programming language that allows user-defined extensions to its syntax. There already are several programming languages with this feature, such as Coq, Seed7, Agda, and Lever. A particularly interesting example is the Inform programming language, since its syntax is essentially "structured English."
The Coq programming language allows "syntax extensions", so the language can be extended to parse new operators:
Notation "A /\ B" := (and A B).
Similarly, the Seed7 programming language can be extended to parse "pseudocode" using "structured syntax definitions." The while loop in Seed7 is defined in this way:
syntax expr: .while.().do.().end.while is -> 25;
Alternatively, it might be possible to "train" a statistical machine translation system to translate pseudocode into a real programming language, though this would require a large corpus of parallel texts.

Resources