How to represent vertical alignment of syntax of code using BNF, EBNF or etc.? - syntax

How to say that (in BNF, EBNF, etc) any two or more letters are placed in the same vertical alignment
e.g In python 2.x, we have what we call indentation.
def hello():
print "hello,"
print "world"
hello()
Note letter p (second line) is placed in the same vertical alignment of letter p (third line)
Further example (in markdown):
MyHeader
========
topic
-----
Note M and the first = are placed in the same vertical alignment (also r and last =, t and first -, c and last -)
My question is How to represent these vertical alignment of letters using BNF, EBNF or etc.?
Further note:
My point of this question is searching for a representation method to represent a vertical alignment of code, not just want to know how to write BNF or EBNF of Python or Markdown.

You can parse an indentation-sensitive language (like Python or Haskell) by using a little hack, which is well-described in the Python language reference's chapter on lexical analysis. As described, the lexical analyzer turns leading whitespace into INDENT and DEDENT tokens [Note 1], which are then used in the Python grammar in a straightforward fashion. Here's a small excerpt:
suite ::= stmt_list NEWLINE | NEWLINE INDENT statement+ DEDENT
statement ::= stmt_list NEWLINE | compound_stmt
stmt_list ::= simple_stmt (";" simple_stmt)* [";"]
while_stmt ::= "while" expression ":" suite ["else" ":" suite]
So if you are prepared to describe (or reference) the lexical analysis algorithm, the BNF is simple.
However, you cannot actually write that algorithm as a context free grammar, because it is not context-free. (I'll leave out the proof, but it's similar to the proof that anbncn is not context free, which you can find in most elementary formal language textbooks, and all over the internet.)
ISO standard EBNF (a free PDF is available) provides a way of including "extensions which a user may require": a Special-sequence, which is any text not containing a ? surrounded on both sides by a ?. So you could abuse the notation by including [Note 2]:
DEDENT = ? See section 2.1.8 of https://docs.python.org/3.3/reference/ ? ;
Or you could insert a full description of the algorithm. Of course, neither of those techniques will allow a parser generator to produce an accurate lexical analyzer, but it would be a reasonable way of communicating intent to a human reader.
It's worth noting that EBNF itself uses a special sequence to define one of its productions:
(* see 4.7 *) syntactic exception
= ? a syntactic-factor that could be replaced
by a syntactic-factor containing no
meta-identifiers
? ;
Notes
The lexical analyzer also converts some physical newline characters into NEWLINE tokens, while making other newline characters vanish.
EBNF normally uses the syntax = rather than ::= for a production, and insists that they be terminated with ;. Comments are enclosed between (* and *).

Related

How can I write a sentence in BNF?

What is the syntax for writing a sentence in BNF? The structure or correctness of it doesn't matter. I only care whether it has 1 or more words, which may or may not be separated by spaces, and which may or may not contain symbols or numbers.
What is the syntax for writing a sentence in BNF?
Backus-Naur form (BNF) isn't a data format like JSON or PDF, it's a description of the grammar that defines a given format. You wouldn't write a sentence in BNF, bu you might use BNF to describe what a sentence is. That description would probably enumerate the characters acceptable in a sentence, tell you that words are made up up non-whitespace characters, and that sentences are sequences of words separated by spaces and ending with a sentence-terminating punctuation mark. Of course, there are many rules about what makes a valid sentence in a natural language like English, so if you were going down that route you'd probably also need to create entities for things like subject-phrase, verb-phrase, object-phrase, etc.
A complete English grammar expressed in BNF would surely be exceedingly complex. BNF is better suited to grammars of computer languages and formats. You might have a BNF description of JSON, for example, or of allowable syntax in C or Java.
I only care whether it has 1 or more words, which may or may not be separated by spaces, and which may or may not contain symbols or numbers.
The rules for BNF are available on the Wikipedia page among other places. Again, you wouldn't write a specific sentence with BNF, but you'd say what's allowable for a sentence. So you might have rules like:
<char> ::= "A" | "B" | "C" | "D" ...
<word> ::= <char>*
<terminator> ::= "." | "!" | "?"
<phrase> ::= <word> | <word> " " <phrase>
<sentence> ::= <phrase> <terminator>
Each rule is built up from other rules, or sometimes built recursively... e.g. the rule for phrase says that a phrase is a word or a word plus a space plus a phrase, so you could have any number of space-separated words.

Determine that the word is from language with the help of stack

What is the algorithm for determining that the word is from a specific language with the help of the stack?
I know that I can put the word into stack symbol by symbol and while doing that I can record any needed info about symbols, but it will be no different from just iterating the word.
If the language is defined by a context-free grammar, membership of a specific word can be determined efficiently by the so-called CYK-Algorithm.
The language given in the example above can be represented by the following context-free grammar where epsilon denotes the empty string.
S -> epsilon | aSb | ab
Update
For the CYK-algorithm to be applicable, the grammar needs to beinChomsky normal form; for the grammar above, this can be done as follows.
S -> epsilon | AT | AB
T -> SB
A -> a
B -> b
In this formulation, A and B are artificial nonterminal symbols for the terminal symbols a and b; T is an artificial variable introduced because each right-hand side may contain at most two nonterminal symbols.
Maybe this helps for a start
LanguageIdentifier
Rosette Language Identifier
Other than that you could count the frequency of the characters composing the word and compare it to frequency tables of different languages to check (maybe this won't work for a single word but for a bunch of sentences it should work though)

Regix in ruby on rails for adding "\n\n" " 2 new line" when i found "\n" [duplicate]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 11 years ago.
This question's answers are a community effort. Edit existing answers to improve this post. It is not currently accepting new answers or interactions.
I don't really understand regular expressions. Can you explain them to me in an easy-to-follow manner? If there are any online tools or books, could you also link to them?
The most important part is the concepts. Once you understand how the building blocks work, differences in syntax amount to little more than mild dialects. A layer on top of your regular expression engine's syntax is the syntax of the programming language you're using. Languages such as Perl remove most of this complication, but you'll have to keep in mind other considerations if you're using regular expressions in a C program.
If you think of regular expressions as building blocks that you can mix and match as you please, it helps you learn how to write and debug your own patterns but also how to understand patterns written by others.
Start simple
Conceptually, the simplest regular expressions are literal characters. The pattern N matches the character 'N'.
Regular expressions next to each other match sequences. For example, the pattern Nick matches the sequence 'N' followed by 'i' followed by 'c' followed by 'k'.
If you've ever used grep on Unix—even if only to search for ordinary looking strings—you've already been using regular expressions! (The re in grep refers to regular expressions.)
Order from the menu
Adding just a little complexity, you can match either 'Nick' or 'nick' with the pattern [Nn]ick. The part in square brackets is a character class, which means it matches exactly one of the enclosed characters. You can also use ranges in character classes, so [a-c] matches either 'a' or 'b' or 'c'.
The pattern . is special: rather than matching a literal dot only, it matches any character†. It's the same conceptually as the really big character class [-.?+%$A-Za-z0-9...].
Think of character classes as menus: pick just one.
Helpful shortcuts
Using . can save you lots of typing, and there are other shortcuts for common patterns. Say you want to match a digit: one way to write that is [0-9]. Digits are a frequent match target, so you could instead use the shortcut \d. Others are \s (whitespace) and \w (word characters: alphanumerics or underscore).
The uppercased variants are their complements, so \S matches any non-whitespace character, for example.
Once is not enough
From there, you can repeat parts of your pattern with quantifiers. For example, the pattern ab?c matches 'abc' or 'ac' because the ? quantifier makes the subpattern it modifies optional. Other quantifiers are
* (zero or more times)
+ (one or more times)
{n} (exactly n times)
{n,} (at least n times)
{n,m} (at least n times but no more than m times)
Putting some of these blocks together, the pattern [Nn]*ick matches all of
ick
Nick
nick
Nnick
nNick
nnick
(and so on)
The first match demonstrates an important lesson: * always succeeds! Any pattern can match zero times.
A few other useful examples:
[0-9]+ (and its equivalent \d+) matches any non-negative integer
\d{4}-\d{2}-\d{2} matches dates formatted like 2019-01-01
Grouping
A quantifier modifies the pattern to its immediate left. You might expect 0abc+0 to match '0abc0', '0abcabc0', and so forth, but the pattern immediately to the left of the plus quantifier is c. This means 0abc+0 matches '0abc0', '0abcc0', '0abccc0', and so on.
To match one or more sequences of 'abc' with zeros on the ends, use 0(abc)+0. The parentheses denote a subpattern that can be quantified as a unit. It's also common for regular expression engines to save or "capture" the portion of the input text that matches a parenthesized group. Extracting bits this way is much more flexible and less error-prone than counting indices and substr.
Alternation
Earlier, we saw one way to match either 'Nick' or 'nick'. Another is with alternation as in Nick|nick. Remember that alternation includes everything to its left and everything to its right. Use grouping parentheses to limit the scope of |, e.g., (Nick|nick).
For another example, you could equivalently write [a-c] as a|b|c, but this is likely to be suboptimal because many implementations assume alternatives will have lengths greater than 1.
Escaping
Although some characters match themselves, others have special meanings. The pattern \d+ doesn't match backslash followed by lowercase D followed by a plus sign: to get that, we'd use \\d\+. A backslash removes the special meaning from the following character.
Greediness
Regular expression quantifiers are greedy. This means they match as much text as they possibly can while allowing the entire pattern to match successfully.
For example, say the input is
"Hello," she said, "How are you?"
You might expect ".+" to match only 'Hello,' and will then be surprised when you see that it matched from 'Hello' all the way through 'you?'.
To switch from greedy to what you might think of as cautious, add an extra ? to the quantifier. Now you understand how \((.+?)\), the example from your question works. It matches the sequence of a literal left-parenthesis, followed by one or more characters, and terminated by a right-parenthesis.
If your input is '(123) (456)', then the first capture will be '123'. Non-greedy quantifiers want to allow the rest of the pattern to start matching as soon as possible.
(As to your confusion, I don't know of any regular-expression dialect where ((.+?)) would do the same thing. I suspect something got lost in transmission somewhere along the way.)
Anchors
Use the special pattern ^ to match only at the beginning of your input and $ to match only at the end. Making "bookends" with your patterns where you say, "I know what's at the front and back, but give me everything between" is a useful technique.
Say you want to match comments of the form
-- This is a comment --
you'd write ^--\s+(.+)\s+--$.
Build your own
Regular expressions are recursive, so now that you understand these basic rules, you can combine them however you like.
Tools for writing and debugging regexes:
RegExr (for JavaScript)
Perl: YAPE: Regex Explain
Regex Coach (engine backed by CL-PPCRE)
RegexPal (for JavaScript)
Regular Expressions Online Tester
Regex Buddy
Regex 101 (for PCRE, JavaScript, Python, Golang, Java 8)
I Hate Regex
Visual RegExp
Expresso (for .NET)
Rubular (for Ruby)
Regular Expression Library (Predefined Regexes for common scenarios)
Txt2RE
Regex Tester (for JavaScript)
Regex Storm (for .NET)
Debuggex (visual regex tester and helper)
Books
Mastering Regular Expressions, the 2nd Edition, and the 3rd edition.
Regular Expressions Cheat Sheet
Regex Cookbook
Teach Yourself Regular Expressions
Free resources
RegexOne - Learn with simple, interactive exercises.
Regular Expressions - Everything you should know (PDF Series)
Regex Syntax Summary
How Regexes Work
JavaScript Regular Expressions
Footnote
†: The statement above that . matches any character is a simplification for pedagogical purposes that is not strictly true. Dot matches any character except newline, "\n", but in practice you rarely expect a pattern such as .+ to cross a newline boundary. Perl regexes have a /s switch and Java Pattern.DOTALL, for example, to make . match any character at all. For languages that don't have such a feature, you can use something like [\s\S] to match "any whitespace or any non-whitespace", in other words anything.

Atom escaping rules in Prolog

I need to export to a file a Prolog program expressed using an arbitrary term representation in Java. The idea is that a Prolog interpreter should be able to consult the generated file afterwards.
My question is about the correct way to write in the file Java Strings representing atom terms.
For example, if the string has a space in the middle, it should be surrounded by single quotes in the file:
hello world becomes 'hello world'
And the exporter should take into consideration characters that should be escaped:
' becomes '\''
Could someone point me to the place were these rules are specified?, and: Can I assume that these rules are respected by major Prolog implementors? (I mean, a Prolog program generated following these rules would be correctly parsed by most Prolog interpreters?).
The precise place for this is the standard, ISO/IEC 13211-1:1995, quoted_token (* 6.4.2 *). See this answer how to get it for USD 30.
The precise syntax is quite complex due to a lot of extras like continuation lines and the like. If you are only writing atoms that should be read by Prolog, things are a bit easier. Also in that situation, you could always quote, which makes writing again a bit simpler.
Some things to be aware of:
Only simple spaces may occur as layout in a quoted atom. All other spaces need to be escaped like \t, \n (abrftnv). Many systems accept also other layout but they differ to each other in very tiny details.
Backslash and quote must be escaped.
Characters outside the printable ASCII range depend on the PCS supported by a system. In a conforming system, the accompanying documentation should define how the additional characters (extended characters) are classified. Documentation quality varies on a wide range.
In any case, test your interface also with GNU-Prolog from 1.4.1 upwards. To date, no differences are known between GNU 1.4.1+ and the standard as far as syntax is concerned.
Here are some 240+ syntax related test cases. Please report any oversight!
A practical hint: if you issue a writeq with your Prolog, with data you need to know about, you'll get quotes around when required.

Automata with kleene star

Im learning about automata. Can you please help me understand how automata with Kleene closure works? Let's say I have letters a,b,c and I need to find text that ends with Kleene star - like ab*bac - how will it work?
The question seems to be more about how an automaton would handle Kleene closure than what Kleene closure means.
With a simple regular expression, e.g., abc, it's pretty straightforward to design an automaton to recognize it. Each state essentially tells you where you are in the expression so far. State 0 means it's seen nothing yet. State 1 means it's seen a. State 2 means it's seen ab. Etc.
The difficulty with Kleene closure is that a pattern like ab*bc introduces ambiguity. Once the automaton has seen the a and is then faced with a b, it doesn't know whether that b is part of the b* or the literal b that follows it, and it won't know until it reads more symbols--maybe many more.
The simplistic answer is that the automaton simply has a state that literally means it doesn't know yet which path was taken.
In simple cases, you can build this automaton directly. In general cases, you usually build something called a non-deterministic finite automaton. You can either simulate the NDFA, or--if performance is critical--you can apply an algorithm that converts the NDFA to a deterministic one. The algorithm essentially generates all the ambiguous states for you.
The Kleene star('*') means you can have as many occurrences of the character as you want (0 or more).
a* will match any number of a's.
(ab)* will match any number of the string "ab"
If you are trying to match an actual asterisk in an expression, the way you would write it depends entirely on the syntax of the regex you are working with. For the general case, the backwards slash \ is used as an escape character:
\* will match an asterisk.
For recognizing a pattern at the end, use concatenation:
(a U b)*c* will match any string that contains 0 or more 'c's at the end, preceded by any number of a's or b's.
For matching text that ends with a Kleene star, again, you can have 0 or more occurrences of the string:
ab(c)* - Possible matches: ab, abc abcc, abccc, etc.
a(bc)* - Possible matches: a, abc, abcbc, abcbcbc, etc.
Your expression ab*bac in English would read something like:
a followed by 0 or more b followed by bac
strings that would evaluate as a match to the regular expression if used for search
abac
abbbbbbbbbbac
abbac
strings that would not match
abaca //added extra literal
bac //missing leading a
As stated in the previous answer actually searching for a * would require an escape character which is implementation specific and would require knowledge of your language/library of choice.

Resources