I'm trying to write a data validation rule that ensures Shipping Street is one line (doesn't contain line breaks)
I've tried things like
CONTAINS( ShippingStreet , BR() ), and
CONTAINS( ShippingStreet , "\n" ),
but I can't get the rule to trigger.
Any help?
This will do it:
REGEX(ShippingStreet,'.*\\n.*')
There are 2 things to learn from this question about SFDC REGEX parsing:
(1) As per Java SE 6 Pattern syntax, you need to double-escape the new-line character (\n), along with various other special characters, when used in a string that gets compiled to a regular expression, that is, use '\n'.
(2) The Salesforce Regular Expression parser matches the entire phrase by default. To match on just part of the phrase, you have to surround your pattern with .*
Examples:
1. REGEX('Marc Benioff','Marc Benioff') -> TRUE
2. REGEX('Marc Benioff is a CEO','Marc Benioff') -> FALSE
3. REGEX('Marc Benioff','.*Marc Benioff.*') -> TRUE
4. REGEX('Marc Benioff is a CEO','.*Marc Benioff.*') -> TRUE
For more info, read the 'Tips' section of the SFDC REGEX Help docs.
The following in a validation rule should mean that only ShippingStreet values containing characters only (and thereby no line breaks and the like) are accepted.
NOT(REGEX(ShippingStreet, '.*'))
Related
This is a dummy example, my actual language is more complicated:
grammar wordasnumber;
WS: [ \t\n] -> skip;
AS: [Aa] [Ss];
ID: [A-Za-z]+;
NUMBER: [0-9]+;
wordAsNumber: (ID AS NUMBER)* EOF;
In this language, these two strings are legal:
seven as 7 eight as 8
seven as 7eight as8
Which is exactly what I told it to do, but not what I want. Because ID and AS are both strings of letters, white space is required between them, I would like that second phrase
to be a syntax error. I could add some other rule to try and match theses mashed up things ...
fragment LETTER: [A-Za-z];
fragment DIGIT: [0-9];
BAD_THING: ( LETTER+ DIGIT (LETTER|DIGIT)* ) | ( DIGIT+ LETTER (LETTER|DIGIT)* );
ID: LETTER+;
NUMBER: DIGIT+;
... to make the lexer return a different token for these smashed up things, but this feels like a weird bandaid which sort of found the need for accidentally and maybe there are more if I really stared at my lexer very carefully.
Is there a better way to do this? My actual grammar is much larger so, for example, making WS NOT be skipped and placing it explicitly between the tokens where it is required is non starter.
There was an older question on this list, which I could not find, which I think is the same question, in that case someone who was parsing white space separated numbers was surprised that 1.2.3 was parsing as 1.2 and .3 and not as a syntax error.
Add another rule for the wrong input, but don't use that in your parser. It will then cause a syntax error when matched:
INVALID: (ID | NUMBER)+;
This additional rule will change the parse tree output, for the input in the question, to:
This trick works because ANTLR4's lexing approach tries to match the longest input in on go, and that INVALID rule matches more than ID and NUMBER alone. But you have to place it after these 2 rules, to make use of another lexing rule: "If two lexer rules would match the same input, pick the first one.". This way, you get the correct tokens for single appearances of ID and NUMBER.
I have been starting to use ANTLR and have noticed that it is pretty fickle with its lexer rules. An extremely frustrating example is the following:
grammar output;
test: FILEPATH NEWLINE TITLE ;
FILEPATH: ('A'..'Z'|'a'..'z'|'0'..'9'|':'|'\\'|'/'|' '|'-'|'_'|'.')+ ;
NEWLINE: '\r'? '\n' ;
TITLE: ('A'..'Z'|'a'..'z'|' ')+ ;
This grammar will not match something like:
c:\test.txt
x
Oddly if I change TITLE to be TITLE: 'x' ; it still fails this time giving an error message saying "mismatched input 'x' expecting 'x'" which is highly confusing. Even more oddly if I replace the usage of TITLE in test with FILEPATH the whole thing works (although FILEPATH will match more than I am looking to match so in general it isn't a valid solution for me).
I am highly confused as to why ANTLR is giving such extremely strange errors and then suddenly working for no apparent reason when shuffling things around.
This seems to be a common misunderstanding of ANTLR:
Language Processing in ANTLR:
The Language Processing is done in two strictly separated phases:
Lexing, i.e. partitioning the text into tokens
Parsing, i.e. building a parse tree from the tokens
Since lexing must preceed parsing there is a consequence: The lexer is independent of the parser, the parser cannot influence lexing.
Lexing
Lexing in ANTLR works as following:
all rules with uppercase first character are lexer rules
the lexer starts at the beginning and tries to find a rule that matches best to the current input
a best match is a match that has maximum length, i.e. the token that results from appending the next input character to the maximum length match is not matched by any lexer rule
tokens are generated from matches:
if one rule matches the maximum length match the corresponding token is pushed into the token stream
if multiple rules match the maximum length match the first defined token in the grammar is pushed to the token stream
Example: What is wrong with your grammar
Your grammar has two rules that are critical:
FILEPATH: ('A'..'Z'|'a'..'z'|'0'..'9'|':'|'\\'|'/'|' '|'-'|'_'|'.')+ ;
TITLE: ('A'..'Z'|'a'..'z'|' ')+ ;
Each match, that is matched by TITLE will also be matched by FILEPATH. And FILEPATH is defined before TITLE: So each token that you expect to be a title would be a FILEPATH.
There are two hints for that:
keep your lexer rules disjunct (no token should match a superset of another).
if your tokens intentionally match the same strings, then put them into the right order (in your case this will be sufficient).
if you need a parser driven lexer you have to change to another parser generator: PEG-Parsers or GLR-Parsers will do that (but of course this can produce other problems).
This was not directly OP's problem, but for those who have the same error message, here is something you could check.
I had the same Mismatched Input 'x' expecting 'x' vague error message when I introduced a new keyword. The reason for me was that I had placed the new key word after my VARNAME lexer rule, which assigned it as a variable name instead of as the new keyword. I fixed it by putting the keywords before the VARNAME rule.
In my elastic search setup, I would like to create tokens separated by either " " or "-" and greater than 3 chars.
I believe pattern tokenizer can work but I am not able to create the regular expression.
Please help me in regular expression
You should be able to use the following regex in the pattern field of your pattern tokenizer:
([^\s-]{3,})
The \s means any whitespace character.
The - means the literal dash character.
Putting the two of them between [^ and ] means match any character that isn't the ones in the list (in this case, anything not whitespace and not a dash)
The {3,} means the previous match has to occur 3 times or more.
The parenthesis around the entire statement means you want to capture what is inside, and the pattern tokenizer pulls its tokens from the matching groups of the regex.
You can play with this regex here and see how it will split your string:
https://regex101.com/r/2e9p34/1
On a side note, there may be other better ways to do this that will better handle edge cases you aren't thinking of, but I decided to answer your question exactly as you asked it. I highly recommend exploring all of the options ElasticSearch provides for its analyzers for your use case to see which one best fits your needs.
Hope this helps!
I am trying to pull a whole Mysql statement from a database sql file
INSERT INTO `helppages`
(`HelpPageID`, `ShowHelpItem`, `HelpRank`, `HelpCategory`, `HelpTitle`, `HelpDescription`, `HelpLink`, `HelpText`, `CMSHelpBar`, `CMSHelpBarAdditional`)
VALUES (... characters (Too many to post here, but the expression below grabs all) ...
);
The current, though I have been through many variations, expression I am using is:
preg_match("#INSERT INTO `$SearchingTableName` ([!%&'-/:<=>#^`\;\s\d\w\"\#\$\(\)\*\+\,\.\?\[\]\{\}\(\)\\\|©]*?)\)\;\r\n#s", $uploadedfile, $matches);
which gets all the information but I can't get it to stop at the end ");\r\n"
also $SearchingTableName = helppages.
Edit
Sorry the current expression uses look forward
preg_match("#INSERT INTO `$SearchingTableName` ([!%&'-/:<=>#^`\;\s\d\w\"\#\$\(\)\*\+\,\.\?\[\]\{\}\(\)\\\|©]*)(?!\)\;\r\n)#s", $uploadedfile, $matches);
Also I checked with MSword using );^p and there is only one instance at the end of the Insert
To match this kind of string you can't do it only playing with character classes. You need to describe the string structure.
For this simple particular case you can use this pattern:
$pattern = <<<EOD
~
# definitions
(?(DEFINE)
(?<elt> [^"',)]+ | '(?>[^\\']+|\\.)*' | "(?>[^\\"]+|\\.)*" )
(?<list> \( \g<elt>? (?: \s* , \s* \g<elt> )* \) )
)
# main pattern
INSERT \s+ (?:INTO \s+)? `$SearchingTableName` \s* \g<list>? \s* VALUES \s*
\g<list> \s* (?: , \s* \g<list> \s* )* ;
~xs
EOD;
if (preg_match_all($pattern, $uploadedfile, $m))
print_r($m[0]);
online demo
But keep in mind that parsing a programming language is not an easy task and is full of traps (depending of the syntax) even for the capabilities of the PHP regex engine. (It's however possible.)
regex features used here:
delimiters and modifiers:
The pattern delimiter used here is ~ instead of the classical /. There is no literal ~ in the pattern thus it's ok.
The pattern uses two modifiers: s and x:
by default the . can't match the newline character \n. The s modifier (s for singleline mode) changes this behavior. When used the . can match all characters including the newline character. (Note that you can retrieve this default behavior using \N that doesn't match the newline character whatever the mode.)
x switches on the extended mode. In this mode, whitespaces inside the pattern are ignored. This mode allows too inline comments that begin with a sharp character #. This mode is very useful to make readable long patterns using spaces, indentation and comments.
using named captures
When you have a long pattern and when you need to reuse several times the same subpatterns, you have the possibility to reuse subpatterns that are written inside capture groups.
A quick example:
You want to match several items separated by commas and composed with 4 digits and 4 letters like this: 1234abcd,5678efgh,9012ijkl,3456mnop.
The pattern to do that is obviously ^\d{4}[a-z]{4}(?:,\d{4}[a-z]{4})+$
But if I don't want to write \d{4}[a-z]{4} two times, I can put it in a capture group and use an alias for the subpattern in the capture group, like this: ^(\d{4}[a-z]{4})(?:,(?1))+$.
Here the (?1) is an alias for the subpattern inside the capture group 1 (not the content matched by the subpattern as a backreference \1 does, but the subpattern itself) that is \d{4}[a-z]{4}.
PCRE, the regex engine used by PHP supports this syntax too \g<1> instead of (?1).
But if you have a lot of capture groups in the pattern, it is not always handy to remember what's the number of the capture group you need. This is the reason why you have the possibility to name capturing groups. Example: ^(?<diglet>\d{4}[a-z]{4})(?:,\g<diglet>)+$
The other advantage of named patterns, except to make the whole pattern more readable, is to add a semantical dimension to the pattern, in the same way you can do it by addying an id attribute to an html tag.
definition section
Instead of defining the named subpattern directly in the main pattern like in the previous example, you can use a definition section to put all the subpatterns that would be used in the main pattern. Note that all that is inside this section is only here for definition purpose and doesn't match nothing. It's like a zero-width assertion.
The syntax of this section is : (?(DEFINE)(?<diglet>\d{4}[a-z]{4})) (you can put several named subpatterns inside.). The precedant pattern becomes:(?(DEFINE)(?<diglet>\d{4}[a-z]{4}))^\g<diglet>(?:,\g<diglet>)+$
the pattern itself:
The first part of the pattern enclosed between (?(DEFINE) and ) consists of subpatterns definitions that will be used later in the main pattern.
The elt subpattern describes an item (a column name or a value):
[^"',)]+ # all that is not a quote a comma or a closing parenthese:
# in the present context this will match numbers and column names
| # OR
'(?>[^\\']+|\\.)*' # string between single quotes (designed to deal with escaped quotes)
|
"(?>[^\\"]+|\\.)*" # same for double quotes
The list subpattern describes the full list of elements separated by commas between parenthesis. Note that this subpattern use a reference to the elt subpattern.
The main pattern needs only to reuse the subpattern list.
I have strings that look like this:
Executive Producer (3)
Producer (0)
1st Assistant Camera (12)
I'd like to use a regex to match the first part of the string and to remove the " (num)" part (the space preceding the parentheses and the parenthesis/digit in the parentheses). After using the regex I'd want to have my vars equal to: "Executive Producer", "Producer", "1st Assistant Camera"
If you know any resources for learning regexes that would be great too.
You just have to select all the characters except the final parenthesis and their numeric content:
(.+) \(\d+\)
The first two parenthesis capture the content (here, all content, declared by the point). Then, you want two parenthesis (be careful to the slash), meaning we do not want these parenthesis to capture the "\d+" expression, which is a number.
One of my favorite regex site: http://www.regular-expressions.info/
Maybe s/([\s\w]+\w)\s*\(\d+\)/\1/?
I don't know Ruby, so you'd have to translate it to its own regexp syntax.