Testing membership in context-free language

Testing membership in context-free language - algorithm

I'm working on a slot-machine mini-game application. The rules for what constitutes a winning prize are rather complex (n of a kind, n of any kind, specific sequences), and to make matters even more complicated, this code should work for a slot-machine with (n >= 3) reels.
So, after some thought, I believe defining a context-free language is the most efficient and extensible way to go. This way I could define the grammar in an XML file.
So my question is, given a string of symbols S, how do I go about testing if S is in a given Context-Free Language? Would I simply exhaust rules until I'm out of valid rules/symbols, or is there a known algorithm that could help. Thanks.
Also, a language like this seems non-regular, am I correct? I've never been good at proofs, so I've avoided trying.
Any comments on my approach would be appreciated as well.
Thanks.

"...given a string of symbols S, how do I go about testing if S is in
a given Context-Free Language?"
If a string w is in L(G); the process of finding a sequence of production rules of G by which w is derived is call parsing. So, you have to create a parse tree to search for some derivation. To do this you perform an exhaustive Breadth-First-Search. There is a serious issue that arises: The searching process may never terminate. To prevent endless searches you have to transform the grammer into what is known as normal form.
"Also, a language like this seems non-regular, am I correct?"
Not necessarily. Every regular language is context-free (because it can be described by a CTG), but not every context-free language is regular.

General cases of context free grammers are hard to evaluate.
However, there are methods to parse grammers in subsets of the context free grammers.
For example: SLR and LL grammers are often used by compilers to parse programming languages, which are also context free languages. To use these, your grammer must be in one of these "families" (remember - there are infinite number of grammers for each context free language).
Some practical tools you might want to use that are generally used for compilers are JavaCC in java and bison in C++.
(If I remember correctly, Bison is SLR parser and JavaCC is LL Parser, but I could be wrong)
P.S.
For a specific slot machine, with n slots and k symbols - the language is definetly regular, since there are at most kn "words" in it, and every finite language is regular. Things obviously get compilcated if you are looking for a grammer for all slot machines.

Your best bet is to actually code this with a proper programming language. A CFG is overkill, because it can be extremely hard to code some, as you say, "rather complex" rules. For example, grammars are poorly suited to talking about the number of things.
For example, how would you code "the number of cherries is > the number of any other object" in such a language? How would the person you're giving the program to do so? CFGs cannot easily express such concepts, and regular expressions cannot sanely do so by any stretch.
The answer is that grammars are not right for this task, unless the slot machines is trying to make English sentences.
You also have to consider what happens when TWO or more "prize sequences" match! Assuming you want to give out the highest prize, you need an ordered list of recognizers. This is not to say you can't code your recognizers with (for example) regular expressions in addition to arbitrary functions. I'm just saying that general CFG parsing is overkill, because what CFGs get you over regular languages (i.e. regular expressions) is the ability to consider parse trees of arbitrary depth (like nested parentheses of level N or more), which is probably not what you care about.
This is not to say that you don't, for example, want to allow regular expressions. You can make that job easy by using a parser generator to recognize regexes involving cherries bananas and pears, see http://en.wikipedia.org/wiki/Comparison_of_parser_generators, which you can then embed, though you might want to simply roll your own recursive descent parser (assuming again you don't care about CFGs, especially if your tokens are bounded length).
For example, here is how I might implement it in pseudocode (ideally you'd use a statically typechecked language with good list manipulation, which I can't think of off the top of my head):
rules = []
function Rule(name, code) {
this.name = name
this.code = code
rules.push(this) # adds them in order
}
##########################
Rule("All the same", regex(.*))
Rule("No two-in-a-row", function(list, counts) {
not regex(.{2}).match(list)
})
Rule("More cherries than anything else", function(list, counts) {
counts[cherries]>counts[x] for all x in counts
or
sorted(counts.items())[0]==cherries
or
counts.greatest()==cherries
})
for token in [cherry, banana, ...]:
Rule("At least 50% "+token, function(list, counts){
counts[token] >= list.length/2
})

Related

Examples of practical context sensitive programming structures

So, I am implementing a context sensitive syntactical analyzator. It's kind of experimantal thing and one of the things I need are usable and practical syntactical contructs to test it on.
For example the following example isn't possible to parse using standard CFG (context free grammar). Basically it allows to declare multiple variables of unrelated data types and simultaneously initialize them.
int bool string number flag str = 1 true "Hello";
If I omit a few details, it can be formally described like this:
L = {anbncn | n >= 1}
So, I would appreciate as much of similar examples as you can think of, however, they really should be practical. Something that actual programmers would appreciate.

Just about all binary formats have some context-sensitivity, one of the simplest examples being a number of elements followed by an undelimited array of that length. (Technically, this could be parsed by a CFG if the possible array lengths are a finite set, but only with billions and billions of production rules.) Pascal and other languages traditionally represented strings this way. Another context-sensitive grammar that programmers often use is two-dimensional source-code layout, which right now gets translated into an intermediate CFG during preprocessing. References to another part of the document, such as looking up a label. Turing-complete macro languages. Not sure exactly what kind of language your parser is supposed to recognize.

How to define the grammar for TeX/LaTeX and Makefile?

Both are technologies that are expressed via languages full of macros, but in a more technical terms, what is the kind of grammar and how to describe their own properties ?
I'm not interested in a graphical representation, by properties I mean a descriptive phrase about this subject, so please don't just go for a BNF/EBNF oriented response full of arcs and graphs .
I assume that both are context-free grammars, but this is a big family of grammars, there is a way to describe this 2 in a more precise way ?
Thanks.

TeX can change the meaning of characters at run time, so it's not context free.

Is my language Context-Free?
I believe that every useful language ends up being Turing-complete, reflexive, etc.
Fortunately that is not the end of the story.
Most of the parser generation tools (yacc, antler, etc) process up to context-free grammars (CFG).
So we divide the language processing problem in 3 steps:
Build an over-generating CFG; this is the "syntactical" part that constitutes a solid base where we add the other components,
Add "semantic" constraints (with some extra syntactic and semantic constraints)
main semantics ( static semantics, pragmatics, attributive semantics, etc)
Writing a context-free grammar is a very standard way of speaking about all the languages!
It is a very clear and didactic notation for languages!! (and sometimes is not telling all the truth).
When We say that "is not context-free, is Turing-complete, ..." you can translate it to "you can count with lots of semantic extra work" :)
How can I speak about it?
Many choices available. I like to do a subset of the following:
Write a clear semantic oriented CFG
for each symbol (T or NT) add/define a set of semantic attributes
for each production rule: add syntactic/semantic constraints predicates
for each production rule: add a set equations to define the values of the attributes
for each production rule: add a English explanation, examples, etc

Big-O for a compiler [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
Does anyone have insight into the typical big-O complexity of a compiler?
I know it must be >= n (where n is the number of lines in the program), because it needs to scan each line at least once.
I believe it must also be >= n.logn for a procedural language, because the program can introduce O(n) variables, functions, procedures, and types etc., and when these are referenced within the program it will take O(log n) to look up each reference.
Beyond that my very informal understanding of compiler architecture has reached its limits and I am not sure if forward declarations, recursion, functional languages, and/or other tricks will increase the algorithmic complexity of the compiler.
So, in summary:
For a 'typical' procedural language (C, pascal, C#, etc.) is there a limiting big-O for an efficiently designed compiler (as a measure of number of lines)
For a 'typical' functional language (lisp, Haskell, etc.) is there a limiting big-O for an efficiently designed compiler (as a measure of number of lines)

This question is unanswerable in it's current form. The complexity of a compiler certainly wouldn't be measured in lines of code or characters in the source file. This would describe the complexity of the parser or lexer, but no other part of the compiler will ever even touch that file.
After parsing, everything will be in terms of various AST's representing the source file in a more structured manner. A compiler will have a lot of intermediate languages, each with it's own AST. The complexity of various phases would be in terms of the size of the AST, which doesn't correlate at all to the character count or even to the previous AST necessarily.
Consider this, we can parse most languages in linear time to the number of characters and generate some AST. Simple operations such as type checking are generally O(n) for a tree with n leaves. But then we'll translate this AST into a form with potentially, double, triple or even exponentially more nodes then on the original tree. Now we again run single pass optimizations on our tree, but this might be O(2^n) relative to the original AST and lord knows what to the character count!
I think you're going to find it quite impossible to even find what n should be for some complexity f(n) for a compiler.
As a nail in the coffin, compiling some languages is undecidable including java, C# and Scala (it turns out that nominal subtyping + variance leads to undecidable typechecking). Of course C++'s templating system is turing complete which makes decidable compilation equivalent to the halting problem (undecidable). Haskell + some extensions is undecidable. And many others that I can't think of off the top of my head. There is no worst case complexity for these languages' compilers.

Reaching back to what I can remember from my compilers class... some of the details here may be a bit off, but the general gist should be pretty much correct.
Most compilers actually have multiple phases that they go through, so it'd be useful to narrow down the question somewhat. For example, the code is usually run through a tokenizer that pretty much just creates objects to represent the smallest possible units of text. var x = 1; would be split into tokens for the var keyword, a name, an assignment operator, and a literal number, followed by a statement finalizer (';'). Braces, parentheses, etc. each have their own token type.
The tokenizing phase is roughly O(n), though this can be complicated in languages where keywords can be contextual. For example, in C#, words like from and yield can be keywords, but they could also be used as variables, depending on what's around them. So depending on how much of that sort of thing you have going on in the language, and depending on the specific code that's being compiled, just this first phase could conceivably have O(n²) complexity. (Though that would be highly uncommon in practice.)
After tokenizing, then there's the parsing phase, where you try to match up opening/closing brackets (or the equivalent indentations in some languages), statement finalizers, and so forth, and try to make sense of the tokens. This is where you need to determine whether a given name represents a particular method, type, or variable. A wise use of data structures to track what names have been declared within various scopes can make this task pretty much O(n) in most cases, but again there are exceptions.
In one video I saw, Eric Lippert said that correct C# code can be compiled in the time between a user's keystrokes. But if you want to provide meaningful error and warning messages, then the compiler has to do a great deal more work.
After parsing, there can be a number of extra phases including optimizations, conversion to an intermediate format (like byte code), conversion to binary code, just-in-time compilation (and extra optimizations that can be applied at that point), etc. All of these can be relatively fast (probably O(n) most of the time), but it's such a complex topic that it's hard to answer the question even for a single language, and practically impossible to answer it for a genre of languages.

As fas as i know:
It depends on the type of parser the compiler uses in it's parsing step.
The main type of parsers are LL and LR, and both have different complexities.

Context-free grammars versus context-sensitive grammars?

Can someone explain to me why grammars [context-free grammar and context-sensitive grammar] of this kind accepts a String?
What I know is
Context-free grammar is a formal grammar in which every production(rewrite) rule is a form of V→w
Where V is a single nonterminal symbol and w is a string of terminals and/or non-terminals. w can be empty
Context-sensitive grammar is a formal grammar in which left-hand sides and right hand sides of any production (rewrite) rules may be surrounded by a context of terminal and nonterminal symbols.
But how can i explain why these grammar accepts a String?

An important detail here is that grammars do not accept strings; they generate strings. Grammars are descriptions of languages that provide a means for generating all possible strings contained in the language. In order to tell if a particular string is contained in the language, you would use a recognizer, some sort of automaton that processes a given string and says "yes" or "no."
A context-free grammar (CFG) is a grammar where (as you noted) each production has the form A → w, where A is a nonterminal and w is a string of terminals and nonterminals. Informally, a CFG is a grammar where any nonterminal can be expanded out to any of its productions at any point. The language of a grammar is the set of strings of terminals that can be derived from the start symbol.
A context-sensitive grammar (CSG) is a grammar where each production has the form wAx → wyx, where w and x are strings of terminals and nonterminals and y is also a string of terminals. In other words, the productions give rules saying "if you see A in a given context, you may replace A by the string y." It's an unfortunate that these grammars are called "context-sensitive grammars" because it means that "context-free" and "context-sensitive" are not opposites, and it means that there are certain classes of grammars that arguably take a lot of contextual information into account but aren't formally considered to be context-sensitive.
To determine whether a string is contained in a CFG or a CSG, there are many approaches. First, you could build a recognizer for the given grammar. For CFGs, the pushdown automaton (PDA) is a type of automaton that accepts precisely the context-free languages, and there is a simple construction for turning any CFG into a PDA. For the context-sensitive grammars, the automaton you would use is called a linear bounded automaton (LBA).
However, these above approaches, if treated naively, are not very efficient. To determine whether a string is contained in the language of a CFG, there are far more efficient algorithms. For example, many grammars can have LL(k) or LR(k) parsers built for them, which allows you to (in linear time) decide whether a string is contained in the grammar. All grammars can be parsed using the Earley parser, which in O(n3) can determine whether a string of length n is contained in the grammar (interestingly, it can parse any unambiguous CFG in O(n2), and with lookaheads can parse any LR(k) grammar in O(n) time!). If you were purely interested in the question "is string x contained in the language generated by grammar G?", then one of these approaches would be excellent. If you wanted to know how the string x was generated (by finding a parse tree), you can adapt these approaches to also provide this information. However, parsing CSGs is, in general, PSPACE-complete, so there are no known parsing algorithms for them that run in worst-case polynomial time. There are some algorithms that in practice tend to run quickly, though. The authors of Parsing Techniques: A Practical Guide (see below) have put together a fantastic page containing all sorts of parsing algorithms, including one that parses context-sensitive languages.
If you're interested in learning more about parsing, consider checking out the excellent book "Parsing Techniques: A Practical Guide, Second Edition" by Grune and Jacobs, which discusses all sorts of parsing algorithms for determining whether a string is contained in a grammar and, if so, how it is generated by the parsing algorithm.

As was said before, a Grammar doesn't accept a string, but it is simply a way in order to generate specific words of a Language that you analyze. In fact, the grammar as the generative rule in the Formal Language Theory instead the finite state automaton do what you're saying, the recognition of specific strings.
In particular, you need recursive enumerable automaton in order to recognize Type 1 Languages( the Context Sensitive Languages in the Chomsky's Hierarchy ).
A grammar for a specific language only grants to you to specify the property of all the strings which gather to the set of strings of the CS language.
I hope that my explanation was clear.

One easy way to show that a grammar accepts a string is to show the production rules for that string.

Pseudocode interpreter?

Like lots of you guys on SO, I often write in several languages. And when it comes to planning stuff, (or even answering some SO questions), I actually think and write in some unspecified hybrid language. Although I used to be taught to do this using flow diagrams or UML-like diagrams, in retrospect, I find "my" pseudocode language has components of C, Python, Java, bash, Matlab, perl, Basic. I seem to unconsciously select the idiom best suited to expressing the concept/algorithm.
Common idioms might include Java-like braces for scope, pythonic list comprehensions or indentation, C++like inheritance, C#-style lambdas, matlab-like slices and matrix operations.
I noticed that it's actually quite easy for people to recognise exactly what I'm triying to do, and quite easy for people to intelligently translate into other languages. Of course, that step involves considering the corner cases, and the moments where each language behaves idiosyncratically.
But in reality, most of these languages share a subset of keywords and library functions which generally behave identically - maths functions, type names, while/for/if etc. Clearly I'd have to exclude many 'odd' languages like lisp, APL derivatives, but...
So my questions are,
Does code already exist that recognises the programming language of a text file? (Surely this must be a less complicated task than eclipse's syntax trees or than google translate's language guessing feature, right?) In fact, does the SO syntax highlighter do anything like this?
Is it theoretically possible to create a single interpreter or compiler that recognises what language idiom you're using at any moment and (maybe "intelligently") executes or translates to a runnable form. And flags the corner cases where my syntax is ambiguous with regards to behaviour. Immediate difficulties I see include: knowing when to switch between indentation-dependent and brace-dependent modes, recognising funny operators (like *pointer vs *kwargs) and knowing when to use list vs array-like representations.
Is there any language or interpreter in existence, that can manage this kind of flexible interpreting?
Have I missed an obvious obstacle to this being possible?
edit
Thanks all for your answers and ideas. I am planning to write a constraint-based heuristic translator that could, potentially, "solve" code for the intended meaning and translate into real python code. It will notice keywords from many common languages, and will use syntactic clues to disambiguate the human's intentions - like spacing, brackets, optional helper words like let or then, context of how variables are previously used etc, plus knowledge of common conventions (like capital names, i for iteration, and some simplistic limited understanding of naming of variables/methods e.g containing the word get, asynchronous, count, last, previous, my etc). In real pseudocode, variable naming is as informative as the operations themselves!
Using these clues it will create assumptions as to the implementation of each operation (like 0/1 based indexing, when should exceptions be caught or ignored, what variables ought to be const/global/local, where to start and end execution, and what bits should be in separate threads, notice when numerical units match / need converting). Each assumption will have a given certainty - and the program will list the assumptions on each statement, as it coaxes what you write into something executable!
For each assumption, you can 'clarify' your code if you don't like the initial interpretation. The libraries issue is very interesting. My translator, like some IDE's, will read all definitions available from all modules, use some statistics about which classes/methods are used most frequently and in what contexts, and just guess! (adding a note to the program to say why it guessed as such...) I guess it should attempt to execute everything, and warn you about what it doesn't like. It should allow anything, but let you know what the several alternative interpretations are, if you're being ambiguous.
It will certainly be some time before it can manage such unusual examples like #Albin Sunnanbo's ImportantCustomer example. But I'll let you know how I get on!

I think that is quite useless for everything but toy examples and strict mathematical algorithms. For everything else the language is not just the language. There are lots of standard libraries and whole environments around the languages. I think I write almost as many lines of library calls as I write "actual code".
In C# you have .NET Framework, in C++ you have STL, in Java you have some Java libraries, etc.
The difference between those libraries are too big to be just syntactic nuances.
<subjective>
There has been attempts at unifying language constructs of different languages to a "unified syntax". That was called 4GL language and never really took of.
</subjective>
As a side note I have seen a code example about a page long that was valid as c#, Java and Java script code. That can serve as an example of where it is impossible to determine the actual language used.
Edit:
Besides, the whole purpose of pseudocode is that it does not need to compile in any way. The reason you write pseudocode is to create a "sketch", however sloppy you like.
foreach c in ImportantCustomers{== OrderValue >=$1M}
SendMailInviteToSpecialEvent(c)
Now tell me what language it is and write an interpreter for that.

To detect what programming language is used: Detecting programming language from a snippet
I think it should be possible. The approach in 1. could be leveraged to do this, I think. I would try to do it iteratively: detect the syntax used in the first line/clause of code, "compile" it to intermediate form based on that detection, along with any important syntax (e.g. begin/end wrappers). Then the next line/clause etc. Basically write a parser that attempts to recognize each "chunk". Ambiguity could be flagged by the same algorithm.
I doubt that this has been done ... seems like the cognitive load of learning to write e.g. python-compatible pseudocode would be much easier than trying to debug the cases where your interpreter fails.
a. I think the biggest problem is that most pseudocode is invalid in any language. For example, I might completely skip object initialization in a block of pseudocode because for a human reader it is almost always straightforward to infer. But for your case it might be completely invalid in the language syntax of choice, and it might be impossible to automatically determine e.g. the class of the object (it might not even exist). Etc.
b. I think the best you can hope for is an interpreter that "works" (subject to 4a) for your pseudocode only, no-one else's.
Note that I don't think that 4a,4b are necessarily obstacles to it being possible. I just think it won't be useful for any practical purpose.

Recognizing what language a program is in is really not that big a deal. Recognizing the language of a snippet is more difficult, and recognizing snippets that aren't clearly delimited (what do you do if four lines are Python and the next one is C or Java?) is going to be really difficult.
Assuming you got the lines assigned to the right language, doing any sort of compilation would require specialized compilers for all languages that would cooperate. This is a tremendous job in itself.
Moreover, when you write pseudo-code you aren't worrying about the syntax. (If you are, you're doing it wrong.) You'll wind up with code that simply can't be compiled because it's incomplete or even contradictory.
And, assuming you overcame all these obstacles, how certain would you be that the pseudo-code was being interpreted the way you were thinking?
What you would have would be a new computer language, that you would have to write correct programs in. It would be a sprawling and ambiguous language, very difficult to work with properly. It would require great care in its use. It would be almost exactly what you don't want in pseudo-code. The value of pseudo-code is that you can quickly sketch out your algorithms, without worrying about the details. That would be completely lost.
If you want an easy-to-write language, learn one. Python is a good choice. Use pseudo-code for sketching out how processing is supposed to occur, not as a compilable language.

An interesting approach would be a "type-as-you-go" pseudocode interpreter. That is, you would set the language to be used up front, and then it would attempt to convert the pseudo code to real code, in real time, as you typed. An interactive facility could be used to clarify ambiguous stuff and allow corrections. Part of the mechanism could be a library of code which the converter tried to match. Over time, it could learn and adapt its translation based on the habits of a particular user.
People who program all the time will probably prefer to just use the language in most cases. However, I could see the above being a great boon to learners, "non-programmer programmers" such as scientists, and for use in brainstorming sessions with programmers of various languages and skill levels.
-Neil

Programs interpreting human input need to be given the option of saying "I don't know." The language PL/I is a famous example of a system designed to find a reasonable interpretation of anything resembling a computer program that could cause havoc when it guessed wrong: see http://horningtales.blogspot.com/2006/10/my-first-pli-program.html
Note that in the later language C++, when it resolves possible ambiguities it limits the scope of the type coercions it tries, and that it will flag an error if there is not a unique best interpretation.

I have a feeling that the answer to 2. is NO. All I need to prove it false is a code snippet that can be interpreted in more than one way by a competent programmer.

Does code already exist that
recognises the programming language
of a text file?
Yes, the Unix file command.
(Surely this must be a less
complicated task than eclipse's syntax
trees or than google translate's
language guessing feature, right?) In
fact, does the SO syntax highlighter
do anything like this?
As far as I can tell, SO has a one-size-fits-all syntax highlighter that tries to combine the keywords and comment syntax of every major language. Sometimes it gets it wrong:
def median(seq):
"""Returns the median of a list."""
seq_sorted = sorted(seq)
if len(seq) & 1:
# For an odd-length list, return the middle item
return seq_sorted[len(seq) // 2]
else:
# For an even-length list, return the mean of the 2 middle items
return (seq_sorted[len(seq) // 2 - 1] + seq_sorted[len(seq) // 2]) / 2
Note that SO's highlighter assumes that // starts a C++-style comment, but in Python it's the integer division operator.
This is going to be a major problem if you try to combine multiple languages into one. What do you do if the same token has different meanings in different languages? Similar situations are:
Is ^ exponentiation like in BASIC, or bitwise XOR like in C?
Is || logical OR like in C, or string concatenation like in SQL?
What is 1 + "2"? Is the number converted to a string (giving "12"), or is the string converted to a number (giving 3)?
Is there any language or interpreter
in existence, that can manage this
kind of flexible interpreting?
On another forum, I heard a story of a compiler (IIRC, for FORTRAN) that would compile any program regardless of syntax errors. If you had the line
= Y + Z
The compiler would recognize that a variable was missing and automatically convert the statement to X = Y + Z, regardless of whether you had an X in your program or not.
This programmer had a convention of starting comment blocks with a line of hyphens, like this:
C ----------------------------------------
But one day, they forgot the leading C, and the compiler choked trying to add dozens of variables between what it thought was subtraction operators.
"Flexible parsing" is not always a good thing.

To create a "pseudocode interpreter," it might be necessary to design a programming language that allows user-defined extensions to its syntax. There already are several programming languages with this feature, such as Coq, Seed7, Agda, and Lever. A particularly interesting example is the Inform programming language, since its syntax is essentially "structured English."
The Coq programming language allows "syntax extensions", so the language can be extended to parse new operators:
Notation "A /\ B" := (and A B).
Similarly, the Seed7 programming language can be extended to parse "pseudocode" using "structured syntax definitions." The while loop in Seed7 is defined in this way:
syntax expr: .while.().do.().end.while is -> 25;
Alternatively, it might be possible to "train" a statistical machine translation system to translate pseudocode into a real programming language, though this would require a large corpus of parallel texts.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio