From the examples of AndExpr and OrExpr in the Booleans section of the XPath spec, it's clear that lowercase words (and, or) are valid. However, this doc does not explicitly mention whether upper-case (AND, OR), or mixed-case variants of these keywords are valid.
Are non-lowercase variants of these keywords valid?
Yes, boolean expression operators are case-sensitive.
No, AND and OR are not legal boolean operator keywords in XPath.
The BNF you cite clearly shows lower case and and or and mentions nothing about case insensitivity. Keywords in most languages are case-sensitive.
Any compliant XPath processor will emit an error upon encountering AND or OR where and or or are expected.
Related
I have these two rules from a jflex code:
Bool = true
Ident = [:letter:][:letterdigit:]*
if I try for example to analyse the word "trueStat", it gets recognnized as an Ident expression and not Bool.
How can I avoid this type of ambiguity in Jflex?
In almost all languages, a keyword is only recognised as such if it is a complete word. Otherwise, you would end up banning identifiers like format, downtime and endurance (which would instead start with the keywords for, do and end, respectively). That's quite confusing for programmers, although it's not unheard-of. Lexical scanner generators, like Flex and JFlex generally try to make the common case easy; thus, the snippet you provide, which recognises trueStat as an identifier. But if you really want to recognise it as a keyword followed by an identifier, you can accomplish that by adding trailing context to all your keywords:
Bool = true/[:letterdigit:]*
Ident = [:letter:][:letterdigit:]*
With that pair of patterns, true will match the Bool rule, even if it occurs as trueStat. The pattern matches true and any alphanumeric string immediately following it, and then rewinds the input cursor so that the token matched is just true.
Note that like Lex and Flex, JFlex accepts the longest match at the current input position; if more than one rule accepts this match, the action corresponding to the first such rule is executed. (See the manual section "How the Input is Matched" for a slightly longer explanation of the matching algorithm.) Trailing context is considered part of the match for the purposes of this rule (but, as noted above, is then removed from the match).
The consequence of this rule is that you should always place more specific patterns before the general patterns they might override, whether or not the specific pattern uses trailing context. So the Bool rule must precede the Ident rule.
R7RS-small says that all identifiers must be terminated by a delimiter, but at the same time it defines pretty elaborate rules for what can be in an identifier. So, which one is it?
Is an identifier supposed to start with an initial character and then continue until a delimiter, or does it start with an initial character and continue following the syntax defined in 7.1.1.
Here are a couple of obvious cases. Are these valid identifiers?
a#a
b,b
c'c
d[d]
If they are not supposed to be valid, what is the purpose of saying that an identifier must be terminated by a delimiter?
|..ident..| are delimiters for symbols in R7RS, to allow any character that you cannot insert in an old style symbol (| is the delimiter).
However, in R6RS the "official" grammar was incorrect, as it did not allow to define symbols such that 1+, which led all implementations define their own rules to overcome this illness of the official grammar.
Unless you need to read the source code of a given implementation and see how it defines the symbols, you should not care too much about these rules and use classical symbols.
In the section 7.1.1 you find the backus-naur form that defines the lexical structure of R7RS identifiers but I doubt the implementations follow it.
I quote from here
As with identifiers, different implementations of Scheme use slightly
different rules, but it is always the case that a sequence of
characters that contains no special characters and begins with a
character that cannot begin a number is taken to be a symbol
In other words, an implementation will use a function like read-atom and after that it will classify an atom by backtracking with read-number and if number? fails it will be a symbol.
I came across this: /sera/ === coursera. What does /sera/ mean? Please tell me. I do not understand the meaning of the expression above.
It's a regular expression. The more formal version of same is this:
coursera.match(/sera/)
Or:
/sera/.match(coursera)
These are both functionally similar. Either a string matches a regular expression, or a regular expression can be tested for matches against a string.
The long explanation of your original code is: Are the characters sera can be found in the variable coursera?
If you do this:
"coursera".match(/sera/)
# => #<MatchData "sera">
You get a MatchData result which means it matched. For more complicated expressions you can capture parts of the string using arbitrary patterns and so on. The general rule here is regular expressions in Ruby look like /.../ or vaguely like %r[...] in form.
You may also see the =~ operator used which is something Ruby inherited from Perl. It also means match.
Using Ruby I'd like to take a Regexp object (or a String representing a valid regex; your choice) and tokenize it so that I may manipulate certain parts.
Specifically, I'd like to take a regex/string like this:
regex = /var (\w+) = '([^']+)';/
parts = ["foo","bar"]
and create a replacement string that replaces each capture with a literal from the array:
"var foo = 'bar';"
A naïve regex-based approach to parsing the regex, such as:
i = -1
result = regex.source.gsub(/\([^)]+\)/){ parts[i+=1] }
…would fail for things like nested capture groups, or non-capturing groups, or a regex that had a parenthesis inside a character class. Hence my desire to properly break the regex into semantically-valid pieces.
Is there an existing Regex parser available for Ruby? Is there a (horror of horrors) known regex that cleanly matches regexes? Is there a gem I've not found?
The motivation for this question is a desire to find a clean and simple answer to this question.
I have a JavaScript project on GitHub called: Dynamic (?:Regex Highlighting)++ with Javascript! you may want to look at. It parses PCRE compatible regular expressions written in both free-spacing and non-free-spacing modes. Since the regexes are written in the less-feature-rich JavaScript syntax, these regexes could be easily converted to Ruby.
Note that regular expressions may contain arbitrarily nested parentheses structures and JavaScript has no recursive regex features, so the code must parse the tree of nested parens from the-inside-out. Its a bit tricky but works quite well. Be sure to try it out on the highlighter demo page, where you can input and dynamically highlight any regex. The JavaScript regular expressions used to parse regular expressions are documented here.
This question was asked to me in an interview question:
Write a code to generate the parse tree like compilers do internally for any given expression. For example:
a+(b+c*(e/f)+d)*g
Start by defining the language. No one can implement a parser or a compiler to a language that isn't very well defined. You give an example: 'a+(b+c*(e/f)+d)*g', which should trigger the following questions:
Is the language a single expression, or there may be multiple statements (separated by ';' maybe?
what are the 'a', 'b', ... 'g' tokens? Is it variable? What is the syntax of variables? Is it a C-like variable, or is it a single alphanumeric character as your example may imply.
There are 3 binary expression in your example. Is that all? Does the language also support '-'. Does your language support logical, and bit-wise operators?
Does the language support number literals? integer only? double? Does the language support string literals? Do you quote string literals?
Syntax for comments?
Which operator has precedence? Does '*' operator has precedence over '+' as in the example? Does operands evaluated right to left or left to right?
Any Pre-processing?
Once you are equipped with a good definition of the language syntax, start with implementing a tokenizer. A tokenizer gets a stream of characters and generates a list of tokens. In the example above, each character is a token, but in var*12 (var power 12) there are 3 tokens: 'var', '*' and '12'. If regular expression is permitted, it is possible you can do this part of the parsing with regular expressions.
Next, have a function that identify each token by type: is it an operator, is it a variable, a number literal, string literal, etc. Package all in a method called NextToken that returns a token and its type.
Finally, start parsing. In your sample above the root of the parsing tree will be a node with the '+' operator (which has precedence over the ''). The left child is a variable token 'a' and the right child is a tree with a root element the '' token. Work recursively.
Simple way out is to convert your expression into postfix notation (abcef/*++) & then refer to the answer to this question (http://stackoverflow.com/questions/423898/postfix-notation-to-expression-tree) for converting the postfix expression to a Tree.
This is what the interviewer expected :)
Whenever you intend to write a parser, the main question to ask is if you want to do it manually, or to use a parser generator framework.
In this case, I would say that it's a good exercise to write it all yourself.
Start with a good representation for the tree itself. This will be be output of your algorithm. For example, this could be a collection of objects, where one object kind could represent a "label" like a, b, and c in your example. Others could represent numbers. You could then defined a representation of operators, for example + is a binary operator, which would have two subobjects, representing the left and right subexpression.
Next step is the actual parser, I would suggest a classical recursive decent parser. One text describing this, and provides a standard pseudo-code implementation is this text by Theodore Norvell
I'd start with a simple grammar, something like those used by ANTLR and JavaCC.