Recognizing extended characters using JAVACC - ascii

I'm creating a grammar using JavaCC and have run across a small problem. I'm trying to allow for any valid character within the ASCII extended set to be recognized by the resulting compiler. After looking at the same JavaCC examples (primarily the example showing the JavaCC Grammer itself) I set up the following token to recognize my characters:
< CHARACTER:
( (~["'"," ","\\","\n","\r"])
| ("\\"
( ["n","t","b","r","f","\\","'","\""]
| ["0"-"7"] ( ["0"-"7"] )?
| ["0"-"3"] ["0"-"7"] ["0"-"7"]
)
)
)
>
If I'm understanding this correctly it should be matching on the octal representation of all of the ASCII characters, from 0-377 (which covers all 256 characters in the Extended ASCII Set). This performs as expected for all keyboard characters (a-z, 0-9, ?,./ etc) and even for most special characters (© , ¬ ®).
However, whenever I attempt to parse the 'trademark' symbol (™) my parser continually throws an End of File exception, indicating that it is unable to recognize the symbol. Is there some obvious way that I can enhance my definition of a character to allow the trademark symbol to be accepted?

I had similar a issue for recognizing special symbols of a text file (either CP1252 or ISO-8859-1 encoded) which was read to a String before parsing. My solution was adding the UNICODE_INPUT to the grammar header:
options {
UNICODE_INPUT=true;
}
Worked like a breeze.
More information on JavaCC options: http://javacc.java.net/doc/javaccgrm.html

It turns out that what I wanted my grammar to do was to accept all valid Unicode characters and not ASCII characters, the ™ symbol is part of the Unicode specification and not in an ASCII extended character set. Changing my token for a valid character as outlined below solved my problem: (A valid unicode being of the format- U+00FF)
< CHARACTER:( (~["'"," ","\\","\n","\r"])
| ("\\"
( ["n","t","b","r","f","\\","'","\""]
| ["u","U"]["+"]["0"-"9","a"-"f","A"-"F"]["0"-"9","a"-"f","A"-"F"]["0"-"9","a"-"f","A"-"F"]["0"-"9","a"-"f","A"-"F"]
)
) )>

Related

Difference in Unicode behaviour between `\w` vs `[[:alpha:]]` in Ruby

(For this question, ignore the number and underscore matching of \w, which is irrelevant to the discussion here.)
According to the Oniguruma docs, both the shorthand character classes like \w and POSIX classes like [:alpha:] have similar behaviour with regard to Unicode: they have simple ascii behaviour for "Not Unicode Case" (I assume that means the string's encoding is not a Unicode one), and a different behaviour that uses Unicode properties for "Unicode Case".
From that documentation, it sounds as in a case where one of those uses Unicode properties, the other will also use them. However, in practice they seem to differ: the POSIX classes use Unicode properties automatically, whereas the \w type classes have to be explicitly marked with ?u to use Unicode property based matching:
$ ruby -e 'print("~café.".encoding)'
UTF-8
$ ruby -e 'print(/[[:alpha:]]+/.match("~café."))'
café
$ ruby -e 'print(/\w+/.match("~café."))'
caf
$ ruby -e 'print(/(?u)\w+/.match("~café."))'
café
$ ruby -v
ruby 2.3.6p384
Is this a bug, or is my interpretation of the docs wrong? (And what exactly does ?u do, could someone link to where it is documented?)
Since version 2.0, Ruby uses Onigmo, an Oniguruma fork that supports more features implemented in Perl 5.10.
If you compare the doc you linked (Oniguruma) with Onigmo's doc you can see a difference between the \w descriptions:
Oniguruma:
\w word character
Not Unicode:
alphanumeric, "_" and multibyte char.
Unicode: General_Category -- (Letter|Mark|Number|Connector_Punctuation)
Onigmo:
\w word character
Not Unicode:
alphanumeric and "_".
Unicode: General_Category -- (Letter|Mark|Number|Connector_Punctuation)
It depends on ONIG_OPTION_ASCII_RANGE option that non-ASCII char includes or not.
As you can see, there's no more this "and multibyte char." that doesn't make sense (at least for me) and that is probably a typo. Whatever, it's very unclear.
The u modifier switches the shorthand character classes from "Not Unicode" (default) to "Unicode".
That's why you obtain only caf without it and café with it when you try to match it using the character class \w.
On the other side the character class [[:alpha:]] seems to be already extended by default to unicode characters since it matches "café" without the u modifier. A start of explanation can be found in the doc:
It depends on ONIG_OPTION_ASCII_RANGE option and
ONIG_OPTION_POSIX_BRACKET_ALL_RANGE option that POSIX brackets
match non-ASCII char or not.
But you can force it to ascii using the (?a) modifier.

":"-Character in a windows filesystem (formerly: ":"-Character in Delphi TFileStream)

I tried to put a colon in the String of the filename of a filestream.
Is it true that one can't use a colon in a TFileStream in Delphi?
And if you can, then how?
EDIT: Thanks for all the downvotes. It deserves that. In retrospekt I have asked a stupid question...
On Windows, which I presume is your platform, the colon is a reserved character and so not allowed in a filename. This is documented here:
File and Directory Names
Naming Conventions
The following fundamental rules enable applications to create and process valid names for files and directories, regardless of the file system:
...
Use any character in the current code page for a name, including Unicode characters and characters in the extended character set (128–255), except for the following:
The following reserved characters:
< (less than)
> (greater than)
: (colon)
" (double quote)
/ (forward slash)
\ (backslash)
| (vertical bar or pipe)
? (question mark)
* (asterisk)
...

XTEXT: Controlling when whitespace is allowed

I have a custom scripting language, that I am attempting to use XTEXT for syntax checking. It boils down to single line commands in the format
COMMAND:PARAMETERS
For the most part, xtext is working great. The only problem I have currently run into is how to handle wanted (or unwanted) white spaces. The language cannot have a space to begin a line, and there cannot be a space following the colon. As well, I need to allow white space in the parameters, as it could be a string of text, or something similar.
I have used a datatype to allow white space in the parameter:
UNQUOTED_STRING:
(ID | INT | WS | '.' )+
;
This works, but has the side effect of allowing spaces throughout the line.
Does anyone know a way to limit where white spaces are allowed?
Thanks in advance for any advice!
You can disallow whitespace globally for your grammar by using an empty set of hidden tokens, e.g.
grammar org.xyz.MyDsl with org.eclipse.xtext.common.Terminals hidden()
Then you can enable it at specific rules, e.g.
XParameter hidden(WS):
'x' '=' value=ID
;
Note that this would allow linebreaks as well. If you don't want that you can either pass a custom terminal rule or overwrite the default WSrule.
Here is a more complete example (not perfect):
grammar org.xtext.example.mydsl.MyDsl with org.eclipse.xtext.common.Terminals hidden()
generate myDsl "http://www.xtext.org/example/mydsl/MyDsl"
Model:
(commands+=Command '\r'? '\n')+
;
Command:
SampleCommand
;
SampleCommand:
command='get' ':' parameter=Parameter
;
Parameter:
'{' x=XParameter '}'
;
XParameter hidden(WS):
'x' '=' value=ID
;
This will parse commands such as:
get:{x=TEST}
get:{ x = TEST}
But will reject:
get:{x=TEST}
get: {x=TEST}
Hope that gives you an idea. You can also do this the other way around by limiting the whitespace only for certain rules, e.g.
CommandList hidden():
(commands+=Command '\r'? '\n')+
;
If that works better for your grammar.

Turbo Grep - find special characters in UTF-8 file

I am running Windows 7 and (have to) use Turbo Grep (Borland something) to search in a file.
I have 2 version of this file, one encoded in UTF-8 and one in ANSI.
If I run the following grep on the ANSI file, I get the expected results, but I get no results with the same statement on the UTF-8 file:
grep -ni "[äöü]" myfile.txt
[-n for line numbers, -i for ignoring cases]
The Turbo Grep Version is :
Turbo GREP 5.6 Copyright (c) 1992-2010 Embarcadero Technologies, Inc.
Syntax: GREP [-rlcnvidzewoqhu] searchstring file[s] or #filelist
GREP ? for help
Help for this command lists:
Options are one or more option characters preceded by "-", and optionally
followed by "+" (turn option on), or "-" (turn it off). The default is "+".
-r+ Regular expression search -l- File names only
-c- match Count only -n- Line numbers
-v- Non-matching lines only -i- Ignore case
-d- Search subdirectories -z- Verbose
-e Next argument is searchstring -w- Word search
-o- UNIX output format Default set: [0-9A-Z_]
-q- Quiet: supress normal output
-h- Supress display of filename
-u xxx Create a copy of grep named 'xxx' with current options set as default
A regular expression is one or more occurrences of: One or more characters
optionally enclosed in quotes. The following symbols are treated specially:
^ start of line $ end of line
. any character \ quote next character
* match zero or more + match one or more
[aeiou0-9] match a, e, i, o, u, and 0 thru 9 ;
[^aeiou0-9] match anything but a, e, i, o, u, and 0 thru 9
Is there a problem with the encoding of these charactes in UTF-8? Might there be a problem with Turbo Grep and UTF-8?
Thanks in advance
Yes there are a different w7 use UTF-16 little endian not UTF-8, UTF-8 is used in unix, linux and plan 9 for cite a few OS.
Jon Skeet explain:1
ANSI: There's no one fixed ANSI encoding - there are lots of them. Usually when people say "ANSI" they mean "the default code page for my system" which is obtained via Encoding.Default, and is often Windows-1252
UTF-8: Variable length encoding, 1-4 bytes covers every current character. ASCII values are encoded as ASCII.
UTF-16 is more similar to ANSI so for this reason with ANSI work well.
if you use only ascii both encodings are usable, but with special characters as ä ö ü etc you need use UTF-16 in windows and UTF-8 in the others

How can I get the char for a given UTF-8 code in Ruby 2.1

I was wondering if there is a way to get the character for a given UTF-8 code ?
E.g.:
1103 = > "я"(russian letter)
Using Array#pack with U directive (UTF-8 character):
[1103].pack('U')
# => "я"
Another approach is "\u{hex}", e.g. "\u{4355}". This syntax accepts only hex numbers, not decimal. Syntax U+184B is the most commonly used one for referencing Unicode characters.

Resources