I am in the process of learning how to program a compiler through antlr. I looked around a bit to get some knowledge about it, so I ended up on this.
What I want to ask is that, what are each symbols doing?
The '?', '+' and '*' in the expression, what are they doing?
FLOAT
: [0-9]+ '.' {_input.LA(1) != '.'}?
| [0-9]* '.' [0-9]+
;
INT
: [0-9]+
;
Do any of you know where to start learning these expressions?
These symbols are a part of regular expressions, see this tutorial about regular expression in python (it's very similar in all languages).
the * means match the previous thing zero or more times, for example, a* will match , a, aa, ...
the + means match the previous thing one or more times, for example, a+ will match a, aa, ...
the ? means match the previous thing zero or one times, for example, a? will match or a.
Related
I'm trying to find an appropriate expressions to match C++ integer suffix, which is, following cppreference:
integer-suffix, if provided, may contain one or both of the following > (if both are provided, they may appear in any order:
unsigned-suffix (the character u or the character U)
long-suffix (the character l or the character L) or the long-long-suffix (the character sequence ll or the character sequence LL) (since C++11)
As of now, the best pattern I was able to write is
/u?(ll|l)?u?/i
But this will match uu which isn't allowed per the standard… Is there a better regex?
edit
In the lexer I'm currently working on, we parse integers as follows (C rules, C++ rules are similar):
rule /\d+[lu]*/i, Num::Integer
rule /0[0-7]+[lu]*/i, Num::Oct
rule /\d+[lu]*/i, Num::Integer
As one can see, the matching of the suffix is matching a lot more than what is defined in the standard. My goal is to rewrite this as:
isuffix = /u?(ll|l)?u?/i
rule /\d+#{isuffix}/i, Num::Integer
rule /0[0-7]+#{isuffix}/i, Num::Oct
rule /\d+#{isuffix}/i, Num::Integer
Pure Ruby... U knoL
%w(u ul ull l ll llu).include? suffix.downcase
But if you insist:
/u?ll?|l?l?u/i
The first part handles the u before the ls and requires an l.
The second part handles the u after the ls and requires the u.
If you want to include an empty suffix as a possibility, you can add optional matching for these characters as well.
Note that this expects that the lexer will fail if there are some leftovers from the suffix.
See it in action
Updated answer
If you're looking for suffix so that /\d('?\d)*#{suffix}/ matches decimal integers, you can use :
suffix = /(ul?l?|ll?u?)?\b/i
Here is a Rubular example. It matches 1 in l1 and 11 in c++11 though, because there's no lookbehind before \d.
Old answers
This will find a non-empty suffix anywhere in the string :
/(?<![a-z])(ul?l?|ll?u?)\b/i
It means :
u, ul, ull or
l, ll, lu or llu
Followed by a word boundary and preceded by anything but another letter.
Other answers without boundaries match "uu" for example.
Here is a Rubular example.
If your string is just the suffix and you want to check it is correct :
/^(ul?l?|ll?u?)?$/i
Here is another example.
Force failing using negative lookahead.
For example:
/(?!u(ll|l)?u)u?(ll|l)?u?/i
or
/(?!ul*u)u?l{0,2}u?/i
My 2 cents for what ever it's worth: Sometimes it just pays to be explicit and not try to be too fancy. I think that this is one of those times. Here's my regex:
/(?<=\d)(u|ul|ull|l|lu|ll|llu)(?=([^ul]|$))/i
Well the idea was simple...
I'm trying to censor letters in a word with word.gsub(/[^#{guesses}]/i, '-'), where word and guesses are strings.
When guesses is "", I get this error RegexpError: empty char-class: /[^]/i. I could sort such cases with an if/else statement, but can I add something to the regex to make it work in one line?
Since you are only matching (or not matching) letters, you can add a non-letter character to your regex, e.g. # or %:
word.gsub(/[^%#{guesses}]/i, '-')
See IDEONE demo
If #{guesses} is empty, the regex will still be valid, and since % does not appear in a word, there is no risk of censuring some guessed percentage sign.
You have two options. One is to avoid testing if your matches are empty, that is:
unless (guesses.empty?)
word.gsub(/^#{Regex.escape(guesses)}/i, '-')
end
Although that's not your intention, it's really the safest plan here and is the most clear in terms of code.
Or you could use the tr function instead, though only for non-empty strings, so this could be substituted inside the unless block:
word.tr('^' + guesses.downcase + guesses.upcase, '-')
Generally tr performs better than gsub if used frequently. It also doesn't require any special escaping.
Edit: Added a note about tr not working on empty strings.
Since tr treats ^ as a special case on empty strings, you can use an embedded ternary, but that ends up confusing what's going on considerably:
word.tr(guesses.empty? ? '' : ('^' + guesses.downcase + guesses.upcase), '-')
This may look somewhat similar to tadman's answer.
Probably you should keep the string that represents what you want to hide, instead of what you want to show. Let's say this is remains. Then, it would be easy as:
word.tr(remains.upcase + remains.downcase, "-")
I'm trying to execute this code :
import re
pattern = r"(\w+)\*([\w\s]+)*/$"
re_compiled = re.compile(pattern)
results = re_compiled.search('COPRO*HORIZON 2000 HOR')
print(results.groups())
But Python does not respond. The process takes 100% of the CPU and does not stop. I've tried this both on Python 2.7.1 and Python 3.2 with identical results.
Your regex runs into catastrophic backtracking because you have nested quantifiers (([...]+)*). Since your regex requires the string to end in / (which fails on your example), the regex engine tries all permutations of the string in the vain hope to find a matching combination. That's where it gets stuck.
To illustrate, let's assume "A*BCD" as the input to your regex and see what happens:
(\w+) matches A. Good.
\* matches *. Yay.
[\w\s]+ matches BCD. OK.
/ fails to match (no characters left to match). OK, let's back up one character.
/ fails to match D. Hum. Let's back up some more.
[\w\s]+ matches BC, and the repeated [\w\s]+ matches D.
/ fails to match. Back up.
/ fails to match D. Back up some more.
[\w\s]+ matches B, and the repeated [\w\s]+ matches CD.
/ fails to match. Back up again.
/ fails to match D. Back up some more, again.
How about [\w\s]+ matches B, repeated [\w\s]+ matches C, repeated [\w\s]+ matches D? No? Let's try something else.
[\w\s]+ matches BC. Let's stop here and see what happens.
Darn, / still doesn't match D.
[\w\s]+ matches B.
Still no luck. / doesn't match C.
Hey, the whole group is optional (...)*.
Nope, / still doesn't match B.
OK, I give up.
Now that was a string of just three letters. Yours had about 30, trying all permutations of which would keep your computer busy until the end of days.
I suppose what you're trying to do is to get the strings before/after *, in which case, use
pattern = r"(\w+)\*([\w\s]+)$"
Try re2 or any other regular expression engine base on automata theory. The one in a current python re module is a simple and slow backtracking engine (for now, things may change in future). But automata based engines have some restriction, it wouldn't allow you to use backreferences for example. Collate with this re2 syntax page to find out will it satisfy your needs or not.
Interestingly, Perl runs it very quickly
-> perl -e 'print "Match\n" if "COPRO*HORIZON 2000 HOR" =~ m|(\w+)\*([\w\s]+)*/$|'
-> perl -e 'print "Match\n" if "COPRO*HORIZON 2000 HOR/" =~ m|(\w+)\*([\w\s]+)*/$|'
Match
Looks like it might be something in your pattern. I'm not sure what you are trying to do with the last '*' in your expression. The following code seems to work for me:
import re
pattern = r"(\w+)\*([\w\s]+)$"
re_compiled = re.compile(pattern)
results = re_compiled.search('COPRO*HORIZON 2000 HOR')
print(results.groups())
Today I came across the following regular expression and wanted to know what Ruby would do with it:
> "#a" =~ /^[\W].*+$/
=> 0
> "1a" =~ /^[\W].*+$/
=> nil
In this instance, Ruby seems to be ignoring the + character. If that is incorrect, I'm not sure what it is doing with it. I'm guessing it's not being interpreted as a quantifier, since the * is not escaped and is being used as a quantifier. In Perl/Ruby regexes, sometimes when a character (e.g., -) is used in a context in which it cannot be interpreted as a special character, it is treated as a literal. But if that was happening in this case, I would expect the first match to fail, since there is no + in the lvalue string.
Is this a subtly correct use of the + character? Is the above behavior a bug? Am I missing something obvious?
Well, you can certainly use a + after a *. You can read a bit about it on this site. The + after the * is called a possessive quantifier.
What it does? It prevents * from backtracking.
Ordinarily, when you have something like .*c and using this to match abcde, the .* will first match the whole string (abcde) and since the regex cannot match c after the .*, the engine will go back one character at a time to check if there is a match (this is backtracking).
Once it has backtracked to c, you will get the match abc from abcde.
Now, imagine that the engine has to backtrack a few hundred characters, and if you have nested groups and multiple * (or + or the {m,n} form), you can quickly end up with thousands, millions of characters to backtrack, called catastrophic backtracking.
This is where possessive quantifiers come in handy. They actually prevent any form of backtracking. In the above regex I mentioned, abcde will not be matched by .*+c. Once .*+ has consumed the whole string, it cannot backtrack and since there's no c at the end of the string, the match fails.
So, another possible use of possessive quantifiers is that they can improve the performance of some regexes, provided the engine can support it.
For your regex /^[\W].*+$/, I don't think that there's any improvement (maybe a tiny little improvement) that the possessive quantifier provides though. And last, it might easily be rewritten as /^\W.*+$/.
I have this ruby expression as below
(a|bc)(d?|e)*
when i use rubular to test out possible strings that fit this expression, I have some strings that I dont understand why they dont fit
the strings are "ade", it matches "ad" but does not match the "e". Anyone can help?
The second part of the regular expression you entered (d?|e)* is the problem. Putting the ? on the d says, match d 0 or 1 times. When you run through the string ade, the regex matches a, then d, then d 0 times... If you instead changed it to (a|bc)(d|e)*, it would match ade, and seem to have the semantics that you're looking for.
(d?)* is a non-greedy match and e* will be "short circuited" by logic or. It will match as few as possible.
I don't know why you put a question mark there. Just use
(a|bc)(d|e)*
Will be fine.