Is there a prolog language grammar, or something close to it that is generally used as a reference? I am using SWI-prolog, so one for that flavor would be nice to have, otherwise a general prolog language grammar/specification works as well.
Since 1995, there is an ISO/IEC standard for Prolog: ISO/IEC 13211-1:1995. It also contains a grammar defining Prolog's syntax which consists of two levels:
Token level: (6.4 Tokens, 6.5 Processor character set)
These are defined by regular expressions and using the longest input match/eager consumer rule/greedy match/maximal munch like many languages of that era. In the words of the standard (6.4):A token shall not be followed by characters such thatconcatenating the characters of the token with these characters forms a valid token as specified by the above syntax.NOTES1 This is the eager consumer rule: 123.e defines the tokens123 . e. A layout text is sometimes necessary toseparate two tokens.
This way of defining tokens is typical for programming languages originating in the 1970s.
The token level is of particular importance to Prolog's syntax, because term, or read term is first defined as a sequence of tokens:
term (* 6.4 *)
= { token (* 6.4 *) } ;
read term (* 6.4 *)
= term (* 6.4 *) , end (* 6.4 *) ;
Many tokens contain an optional layout text sequence in the beginning. But never at the end. Also note that to determine the end (that is the finishing period), a lookahead to the next character is required. In a tokenizer written in Prolog, this would be realized with peek_char/1.
Only after identifying a term on this level, the actual grammar comes into play. See the 8.14.1.1 Description of read_term/3. Of course, an implementation might do it differently, as long as it behaves "as if".
Syntax level: (6.2 Prolog text and data, 6.3 Terms)
These definitions rely on the full context-free grammar formalism plus a few context sensitive constraints.
Conformance
As to conformance of implementations,
see this table. SWI always differed in many idiosyncratic ways: both on the token level and on the syntax level. Even operator syntax is (for certain cases) incompatible with other systems and the standard. That is, certain terms are read differently. Since SWI7, SWI now differs even for canonical syntax. Try writeq('.'(1,[])). This should yield [1], but SWI7 produces some error.
For conforming implementations see sicstus-prolog (version 4.3) and gnu-prolog.
For SWI-Prolog in particular, things are a bit "complicated". It has never strictly conformed to ISO, and the current development version (SWI-Prolog 7 and beyond) has departed even further from ISO compliance. The development version is at the moment the only "actively" maintained version, meaning that soon you might expect bugs not to be removed from SWI-Prolog 6.
As for reference, you will have to read the manual and hope to figure out what is right and what is not. The information is all there, even if it is not super neatly organized.
This is where you might start:
http://www.swi-prolog.org/pldoc/man?section=syntax
The books recommended:
http://www.swi-prolog.org/pldoc/man?section=intro
are actually something that you can't completely circumvent, unfortunately (I would be glad if someone proves me wrong). Get at least one of the three listed there. Sterling & Shapiro, 1986 is for example a good starting point. The online tutorial at http://www.learnprolognow.org/ is also quite good.
Something else: in "The Craft of Prolog" by Richard O'Keefe you can find the full implementation of a Prolog tokenizer, written in Prolog (10.7, pp 337-354). I don't know if this would serve your purpose.
And some advice: make the effort to install the current development version if you are going to use SWI-Prolog. It is fairly easy on Linux (no idea how it goes on MacOS in practice, but I doubt it is more complicated).
At least, there is ISO standard (see its creator page).
Related
I am looking at the Java-written Prolog system, Prova.
https://prova.ws/
But it is not clear about its implementation, a Prolog compiler or Prolog interpreter? I read the manual, but did not found an answer.
There are some rumors that Prova is based on Mandarax. The newest
version seem to be heading in the same direction as SWI-Prolog 7,
i.e. it supports dicts and a dot notation. See also here:
http://prova.ws/confluence/display/REWRITEDEV/Prova+maps+for+defining+slotted+terms
The original Mandarax seems to have been an interpreter, and
in the user manual of Prova we find one sentence that self
declares it as a Prolog interpreter, but no hint for compilation.
But there seems to be a newer version of Mandarax (1.1.0) which was
some kind of compiler, but maybe Prova was already branched out
before the compiler arrived, and its still an interpeter.
So although it self declares as a Prolog interpreter, it is most
likely not an ISO Prolog systems, since for example op/3 is missing.
I guess it uses aa tokenizer with some hard wired operators and a
parser with some hard wired operator expressions. (*)
It might nevertheless offer some goodies, but judging from the
documentation and binary size, they might not be many. Which
is possibly compensated by the ability to directly embed Java
calls by the dot notation:
http://prova.ws/confluence/display/REWRITEDEV/Calling+Java+from+Prova+rulebases
Bye
(*)
The Prova syntax goes even that far, that it requires the end-user
to write fail() instead of fail. A syntax variant that is also
found in the new SWI-Prolog 7, although not with the same drastic
effect on the end-user that he/she would be not anymore allowed to use
atoms as goals.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 8 years ago.
Improve this question
OK, I know that this is very general question and that there were written some papers on the subject, but I have a feeling that these publications cover very basic material and I'm looking for something more advanced which would improve style and efficency. This is what I have in paper:
"Research Report AI-1989-08 Efficient Prolog: A Practical Guide" by Michael A. Covington, 1989
"Efficient Prolog Programming" by Timo Knuutila, 1992
"Coding guidelines for Prolog" by Covington, Bagnara, O'Keefe, Wielemaker, Price, 2011
Sample subjects covered in these: tail recursion and differential lists, proper use of indexing, proper use of cuts, avoiding asserts and retracts, avoiding CONSing, code formatting guidelines (indentation, if-then-elses etc.), naming conventions, code documenting, arguments order, testing.
What would you add here from your own personal experience with Prolog? Are there any special style guidelines applicable only to CLP programming? Do you know of some common efficiency problems and know how to deal with them?
UPDATE:
Some interesting (but still too basic and too general for me) points are made here: Prolog programming guidelines of Lifeware Team
Just to highlight the whole problem I would like to qoute "Coding guidelines for Prolog" (Covington et al.):
As far as we know, a coherent and reasonably complete set of coding guidelines for Prolog has never been published. Moreover, when we look at the corpus of published Prolog programs, we do not see a de facto standard emerging. The most important reason behind this apparent omission is that the small Prolog community, due to the lack of a comprehensive language standard, is further fragmented into sub-communities centered around individual Prolog systems, none of which has a dominant position.
For designing clean interfaces in Prolog, I recommend reading the Prolog standard, see iso-prolog.
In particular the specific format how built-in predicates are codified which includes a particular style of documentation but also the way how errors are signaled. See 8.1 The format of built-in predicate definitions of ISO/IEC 13211-1:1995. You find definitions in that style online in Cor.2 and the
Prolog prologue.
A very good example of a library that follows the ISO error signaling conventions up to the letter (and yet is not standardized) is the implementation of library(clpfd) in SICStus and SWI. While both implementations are fundamentally different in their approach, they use the error conventions to their best advantage.
Back to ISO. This is ISO's format for built-in predicates:
x.y.z Name/Arity
In the beginning, there may be a short optional informal remark.
x.y.z.1 Description
A declarative description is given which starts very often with the most general goal using descriptive variable names such that they can be referred to later on. Should the predicate's meaning be not declarative at all, it is either stated "is true" or some otherwise unnecessary operationalizing word like "unifies", "assembles" is used. Let me give an example:
8.5.4 copy_term/2
8.5.4.1 Description
copy_term(Term_1, Term_2) is true iff Term_2 unifies with a term T which is a renamed copy (7.1.6.2) of Term_1.
So this unifies is a big red warning sign: Don't ever think this predicate is a relation, it can only be understood procedurally. And even more so it (implicitly) states that the definition is steadfast in the second argument.
Another example: sort/2. Is this now a relation or not?
8.4.3 sort/2
8.4.3.1 Description
sort(List, Sorted) is true iff Sorted unifies with the sorted list of List (7.1.6.5).
So, again, no relation. Surprised? Look at 8.4.3.4 Examples:
8.4.3.4 Examples
...
sort([X, 1], [1, 1]).
Succeeds, unifying X with 1.
sort([1, 1], [1, 1]).
Fails.
If necessary, a separate procedural description is added, starting with "Procedurally,". It again does not cover any errors at all. This is one of the big advantages of the standard descriptions: Errors are all separated from "doing", which helps a programmer (= user of the built-in) catching errors more systematically. To be fair, it slightly increases the burden of the implementor who wants to optimize by hand and on a case-to-case basis. Such optimized code is often prone to subtle errors anyway.
x.y.z.2 Template and modes
Here, a comprehensive, one or two line specification of the arguments' modes and types is given. The notation is very similar to other notations which finds its origin in the 1978 DECsystem-10 mode declarations.
8.5.2.2 Template and modes
arg(+integer, +compound_term, ?term)
There is, however, a big difference between ISO's approach and Covington et al.'s guideline which is of informal nature only and states how a programmer should use a predicate. ISO's approach describes how the built-in will behave - in particular which errors should be expected. (There are 4 errors following from above plus one extra error that cannot be seen from above spec, see below).
x.y.z.3 Errors
All error conditions are given, each in its own subclause numbered alphabetically. The codex in 7.12 Errors:
When more than one error condition is satisfied, the error that is reported by the Prolog processor is implementation dependent.
That means, that each error condition must state all preconditions where it applies. All of them. The error conditions are not read like an if-then-elsif-then...
It also means that the codifier has to put extra effort for finding good error conditions. This is all to the advantage of the actual user-programmer but certainly a bit of a pain for the codifier and implementor.
Many error conditions directly follow from the spec given in x.y.z.2 according to the NOTES in 8.1.3 Errors and according to 7.12.2 Error classification (summary). For the built-in predicate arg/3, errors a, b, c, d follow from the spec. Only error e does not follow.
8.5.2.3 Errors
a) N is a variable — instantiation_error.
b) Term is a variable — instantiation_error.
c) N is neither a variable nor an integer—type_error(integer, N).
d) Term is neither a variable nor a compound term— type_error(compound, Term).
e) N is an integer less than zero— domain_error(not_less_than_zero, N).
x.y.z.4 Examples
(Optional).
x.y.z.5 Bootstrapped built-in predicates
(Optional).
Defines other predicates that are so similar, they can be "bootstrapped".
I'm writing a Prolog interpreter as an exercise and wondering what I should be aiming for. Unfortunately there are many versions of Prolog to choose from and they are documented to various degrees. I quickly found this question from someone who was apparently expecting far too much from the internet by wanting a detailed html specification of Prolog. The answer to that was that you can get the ISO standard for $30, but that's rather impractical. Users are never going to pay $30 just to read about Prolog when they can get a Prolog interpreter for even less money, so if you pay the money and conform to the standard few people will ever recognize your effort. Therefore it doesn't surprise me at all that the ISO standard isn't universally respected.
Starting from the assumption that the ISO standard is a joke, what is the real version of Prolog that an interpreter should be aiming for? I don't mean that every little Prolog interpreter should fully implement every feature, but when constructing a Prolog interpreter there's no end to the little decisions that must be made. How should someone discover the consensus of the Prolog community about what Prolog should be?
If you are writing a Prolog system as an exercise, don't expect that you get too much done. After all, it is quite an effort.
To start with, aim for the core of the ISO standard, that is 13211-1:1995 including Cor.1:2007
and Cor.2:2012. That core is pretty much supported by many systems like: IF, SWI, YAP, B, GNU, SICStus, Jekejeke, Minerva. So while this core just covers the very basics, it will be still a lot of work to you.
Then, you can consider what further direction you want to go. From a standard's viewpoint these are implementation specific extensions. Systems pretty much differ in the way they offer extensions, so there is no clear way to choose. The most popular systems are SICStus (commercial) and SWI (open source). An open source system with better conformance than SWI is GNU.
You are putting a lot of quite debatable implications into your question, so let me try to sort some out:
Price of standards. ISO standards do cost something - these documents have a certain legal status - depending on your country and legislation. Freely available web documents can serve as evidence only. See for example the C standard which you get for the same prices: One official high price (USD 285) and a reduced one by INCITS (USD 30). The difference is only the cover sheet. At least, you can get the Prolog standard for a significantly reduced price.
Relevance. There is just one standard. And systems conform quite closely. Where they differ, they differ rather randomly. As an example, look at this detailed comparison of syntax which covers both reading and writing terms. Typically, such differences are reported by users who get hit by one or another difference. These differences are nowhere formally defined.
I don't agree with your assumption about ISO Prolog, indeed I would suggest to try to implement a small subset of ISO Prolog (i.e. be 'sure' to properly implement findall/setof).
A principal problem with ISO standard it's the module directive. Then choose an implementation to model modules, or skip them altogether.
Even some 'undiscussed' builtin will be difficult to implement, depending on the language you use (C, Haskell, Lisp, SQL, Javascript, C++...) and the choices you will do about the degree of translation. Most implementations out there are not interpreters, but bytecode compilers with various degrees of runtime support. The most used choice for the bytecode level is Warren's Abstract Machine (WAM, as you surely know).
When I wrote my Prolog interpreter, many years ago, I designed and implemented an object oriented database model, using algorithm ABC instead of the WAM, and I designed and implemented the Variables handling with ingenuity... but I left out setof/bagof, for instance...
I think SWI-Prolog pretty-much drives the Prolog standard nowdays, so take a look at their documentation... Other than that, what you are asking is being debated over and over again during the past years in the Prolog Standardization meetings. Some argue that tabling should be made into the standard, others claim the same for automatic indexing, and so on. So, in my humble opinion, the best you can do is "mimic" what SWI does for most stuff, and you'll be almost certainly in the standard
I'm working on a slot-machine mini-game application. The rules for what constitutes a winning prize are rather complex (n of a kind, n of any kind, specific sequences), and to make matters even more complicated, this code should work for a slot-machine with (n >= 3) reels.
So, after some thought, I believe defining a context-free language is the most efficient and extensible way to go. This way I could define the grammar in an XML file.
So my question is, given a string of symbols S, how do I go about testing if S is in a given Context-Free Language? Would I simply exhaust rules until I'm out of valid rules/symbols, or is there a known algorithm that could help. Thanks.
Also, a language like this seems non-regular, am I correct? I've never been good at proofs, so I've avoided trying.
Any comments on my approach would be appreciated as well.
Thanks.
"...given a string of symbols S, how do I go about testing if S is in
a given Context-Free Language?"
If a string w is in L(G); the process of finding a sequence of production rules of G by which w is derived is call parsing. So, you have to create a parse tree to search for some derivation. To do this you perform an exhaustive Breadth-First-Search. There is a serious issue that arises: The searching process may never terminate. To prevent endless searches you have to transform the grammer into what is known as normal form.
"Also, a language like this seems non-regular, am I correct?"
Not necessarily. Every regular language is context-free (because it can be described by a CTG), but not every context-free language is regular.
General cases of context free grammers are hard to evaluate.
However, there are methods to parse grammers in subsets of the context free grammers.
For example: SLR and LL grammers are often used by compilers to parse programming languages, which are also context free languages. To use these, your grammer must be in one of these "families" (remember - there are infinite number of grammers for each context free language).
Some practical tools you might want to use that are generally used for compilers are JavaCC in java and bison in C++.
(If I remember correctly, Bison is SLR parser and JavaCC is LL Parser, but I could be wrong)
P.S.
For a specific slot machine, with n slots and k symbols - the language is definetly regular, since there are at most kn "words" in it, and every finite language is regular. Things obviously get compilcated if you are looking for a grammer for all slot machines.
Your best bet is to actually code this with a proper programming language. A CFG is overkill, because it can be extremely hard to code some, as you say, "rather complex" rules. For example, grammars are poorly suited to talking about the number of things.
For example, how would you code "the number of cherries is > the number of any other object" in such a language? How would the person you're giving the program to do so? CFGs cannot easily express such concepts, and regular expressions cannot sanely do so by any stretch.
The answer is that grammars are not right for this task, unless the slot machines is trying to make English sentences.
You also have to consider what happens when TWO or more "prize sequences" match! Assuming you want to give out the highest prize, you need an ordered list of recognizers. This is not to say you can't code your recognizers with (for example) regular expressions in addition to arbitrary functions. I'm just saying that general CFG parsing is overkill, because what CFGs get you over regular languages (i.e. regular expressions) is the ability to consider parse trees of arbitrary depth (like nested parentheses of level N or more), which is probably not what you care about.
This is not to say that you don't, for example, want to allow regular expressions. You can make that job easy by using a parser generator to recognize regexes involving cherries bananas and pears, see http://en.wikipedia.org/wiki/Comparison_of_parser_generators, which you can then embed, though you might want to simply roll your own recursive descent parser (assuming again you don't care about CFGs, especially if your tokens are bounded length).
For example, here is how I might implement it in pseudocode (ideally you'd use a statically typechecked language with good list manipulation, which I can't think of off the top of my head):
rules = []
function Rule(name, code) {
this.name = name
this.code = code
rules.push(this) # adds them in order
}
##########################
Rule("All the same", regex(.*))
Rule("No two-in-a-row", function(list, counts) {
not regex(.{2}).match(list)
})
Rule("More cherries than anything else", function(list, counts) {
counts[cherries]>counts[x] for all x in counts
or
sorted(counts.items())[0]==cherries
or
counts.greatest()==cherries
})
for token in [cherry, banana, ...]:
Rule("At least 50% "+token, function(list, counts){
counts[token] >= list.length/2
})
Like lots of you guys on SO, I often write in several languages. And when it comes to planning stuff, (or even answering some SO questions), I actually think and write in some unspecified hybrid language. Although I used to be taught to do this using flow diagrams or UML-like diagrams, in retrospect, I find "my" pseudocode language has components of C, Python, Java, bash, Matlab, perl, Basic. I seem to unconsciously select the idiom best suited to expressing the concept/algorithm.
Common idioms might include Java-like braces for scope, pythonic list comprehensions or indentation, C++like inheritance, C#-style lambdas, matlab-like slices and matrix operations.
I noticed that it's actually quite easy for people to recognise exactly what I'm triying to do, and quite easy for people to intelligently translate into other languages. Of course, that step involves considering the corner cases, and the moments where each language behaves idiosyncratically.
But in reality, most of these languages share a subset of keywords and library functions which generally behave identically - maths functions, type names, while/for/if etc. Clearly I'd have to exclude many 'odd' languages like lisp, APL derivatives, but...
So my questions are,
Does code already exist that recognises the programming language of a text file? (Surely this must be a less complicated task than eclipse's syntax trees or than google translate's language guessing feature, right?) In fact, does the SO syntax highlighter do anything like this?
Is it theoretically possible to create a single interpreter or compiler that recognises what language idiom you're using at any moment and (maybe "intelligently") executes or translates to a runnable form. And flags the corner cases where my syntax is ambiguous with regards to behaviour. Immediate difficulties I see include: knowing when to switch between indentation-dependent and brace-dependent modes, recognising funny operators (like *pointer vs *kwargs) and knowing when to use list vs array-like representations.
Is there any language or interpreter in existence, that can manage this kind of flexible interpreting?
Have I missed an obvious obstacle to this being possible?
edit
Thanks all for your answers and ideas. I am planning to write a constraint-based heuristic translator that could, potentially, "solve" code for the intended meaning and translate into real python code. It will notice keywords from many common languages, and will use syntactic clues to disambiguate the human's intentions - like spacing, brackets, optional helper words like let or then, context of how variables are previously used etc, plus knowledge of common conventions (like capital names, i for iteration, and some simplistic limited understanding of naming of variables/methods e.g containing the word get, asynchronous, count, last, previous, my etc). In real pseudocode, variable naming is as informative as the operations themselves!
Using these clues it will create assumptions as to the implementation of each operation (like 0/1 based indexing, when should exceptions be caught or ignored, what variables ought to be const/global/local, where to start and end execution, and what bits should be in separate threads, notice when numerical units match / need converting). Each assumption will have a given certainty - and the program will list the assumptions on each statement, as it coaxes what you write into something executable!
For each assumption, you can 'clarify' your code if you don't like the initial interpretation. The libraries issue is very interesting. My translator, like some IDE's, will read all definitions available from all modules, use some statistics about which classes/methods are used most frequently and in what contexts, and just guess! (adding a note to the program to say why it guessed as such...) I guess it should attempt to execute everything, and warn you about what it doesn't like. It should allow anything, but let you know what the several alternative interpretations are, if you're being ambiguous.
It will certainly be some time before it can manage such unusual examples like #Albin Sunnanbo's ImportantCustomer example. But I'll let you know how I get on!
I think that is quite useless for everything but toy examples and strict mathematical algorithms. For everything else the language is not just the language. There are lots of standard libraries and whole environments around the languages. I think I write almost as many lines of library calls as I write "actual code".
In C# you have .NET Framework, in C++ you have STL, in Java you have some Java libraries, etc.
The difference between those libraries are too big to be just syntactic nuances.
<subjective>
There has been attempts at unifying language constructs of different languages to a "unified syntax". That was called 4GL language and never really took of.
</subjective>
As a side note I have seen a code example about a page long that was valid as c#, Java and Java script code. That can serve as an example of where it is impossible to determine the actual language used.
Edit:
Besides, the whole purpose of pseudocode is that it does not need to compile in any way. The reason you write pseudocode is to create a "sketch", however sloppy you like.
foreach c in ImportantCustomers{== OrderValue >=$1M}
SendMailInviteToSpecialEvent(c)
Now tell me what language it is and write an interpreter for that.
To detect what programming language is used: Detecting programming language from a snippet
I think it should be possible. The approach in 1. could be leveraged to do this, I think. I would try to do it iteratively: detect the syntax used in the first line/clause of code, "compile" it to intermediate form based on that detection, along with any important syntax (e.g. begin/end wrappers). Then the next line/clause etc. Basically write a parser that attempts to recognize each "chunk". Ambiguity could be flagged by the same algorithm.
I doubt that this has been done ... seems like the cognitive load of learning to write e.g. python-compatible pseudocode would be much easier than trying to debug the cases where your interpreter fails.
a. I think the biggest problem is that most pseudocode is invalid in any language. For example, I might completely skip object initialization in a block of pseudocode because for a human reader it is almost always straightforward to infer. But for your case it might be completely invalid in the language syntax of choice, and it might be impossible to automatically determine e.g. the class of the object (it might not even exist). Etc.
b. I think the best you can hope for is an interpreter that "works" (subject to 4a) for your pseudocode only, no-one else's.
Note that I don't think that 4a,4b are necessarily obstacles to it being possible. I just think it won't be useful for any practical purpose.
Recognizing what language a program is in is really not that big a deal. Recognizing the language of a snippet is more difficult, and recognizing snippets that aren't clearly delimited (what do you do if four lines are Python and the next one is C or Java?) is going to be really difficult.
Assuming you got the lines assigned to the right language, doing any sort of compilation would require specialized compilers for all languages that would cooperate. This is a tremendous job in itself.
Moreover, when you write pseudo-code you aren't worrying about the syntax. (If you are, you're doing it wrong.) You'll wind up with code that simply can't be compiled because it's incomplete or even contradictory.
And, assuming you overcame all these obstacles, how certain would you be that the pseudo-code was being interpreted the way you were thinking?
What you would have would be a new computer language, that you would have to write correct programs in. It would be a sprawling and ambiguous language, very difficult to work with properly. It would require great care in its use. It would be almost exactly what you don't want in pseudo-code. The value of pseudo-code is that you can quickly sketch out your algorithms, without worrying about the details. That would be completely lost.
If you want an easy-to-write language, learn one. Python is a good choice. Use pseudo-code for sketching out how processing is supposed to occur, not as a compilable language.
An interesting approach would be a "type-as-you-go" pseudocode interpreter. That is, you would set the language to be used up front, and then it would attempt to convert the pseudo code to real code, in real time, as you typed. An interactive facility could be used to clarify ambiguous stuff and allow corrections. Part of the mechanism could be a library of code which the converter tried to match. Over time, it could learn and adapt its translation based on the habits of a particular user.
People who program all the time will probably prefer to just use the language in most cases. However, I could see the above being a great boon to learners, "non-programmer programmers" such as scientists, and for use in brainstorming sessions with programmers of various languages and skill levels.
-Neil
Programs interpreting human input need to be given the option of saying "I don't know." The language PL/I is a famous example of a system designed to find a reasonable interpretation of anything resembling a computer program that could cause havoc when it guessed wrong: see http://horningtales.blogspot.com/2006/10/my-first-pli-program.html
Note that in the later language C++, when it resolves possible ambiguities it limits the scope of the type coercions it tries, and that it will flag an error if there is not a unique best interpretation.
I have a feeling that the answer to 2. is NO. All I need to prove it false is a code snippet that can be interpreted in more than one way by a competent programmer.
Does code already exist that
recognises the programming language
of a text file?
Yes, the Unix file command.
(Surely this must be a less
complicated task than eclipse's syntax
trees or than google translate's
language guessing feature, right?) In
fact, does the SO syntax highlighter
do anything like this?
As far as I can tell, SO has a one-size-fits-all syntax highlighter that tries to combine the keywords and comment syntax of every major language. Sometimes it gets it wrong:
def median(seq):
"""Returns the median of a list."""
seq_sorted = sorted(seq)
if len(seq) & 1:
# For an odd-length list, return the middle item
return seq_sorted[len(seq) // 2]
else:
# For an even-length list, return the mean of the 2 middle items
return (seq_sorted[len(seq) // 2 - 1] + seq_sorted[len(seq) // 2]) / 2
Note that SO's highlighter assumes that // starts a C++-style comment, but in Python it's the integer division operator.
This is going to be a major problem if you try to combine multiple languages into one. What do you do if the same token has different meanings in different languages? Similar situations are:
Is ^ exponentiation like in BASIC, or bitwise XOR like in C?
Is || logical OR like in C, or string concatenation like in SQL?
What is 1 + "2"? Is the number converted to a string (giving "12"), or is the string converted to a number (giving 3)?
Is there any language or interpreter
in existence, that can manage this
kind of flexible interpreting?
On another forum, I heard a story of a compiler (IIRC, for FORTRAN) that would compile any program regardless of syntax errors. If you had the line
= Y + Z
The compiler would recognize that a variable was missing and automatically convert the statement to X = Y + Z, regardless of whether you had an X in your program or not.
This programmer had a convention of starting comment blocks with a line of hyphens, like this:
C ----------------------------------------
But one day, they forgot the leading C, and the compiler choked trying to add dozens of variables between what it thought was subtraction operators.
"Flexible parsing" is not always a good thing.
To create a "pseudocode interpreter," it might be necessary to design a programming language that allows user-defined extensions to its syntax. There already are several programming languages with this feature, such as Coq, Seed7, Agda, and Lever. A particularly interesting example is the Inform programming language, since its syntax is essentially "structured English."
The Coq programming language allows "syntax extensions", so the language can be extended to parse new operators:
Notation "A /\ B" := (and A B).
Similarly, the Seed7 programming language can be extended to parse "pseudocode" using "structured syntax definitions." The while loop in Seed7 is defined in this way:
syntax expr: .while.().do.().end.while is -> 25;
Alternatively, it might be possible to "train" a statistical machine translation system to translate pseudocode into a real programming language, though this would require a large corpus of parallel texts.