Special character sets in ANTLR C++ target - antlr3

I am using ANTLR 3.5 version for parsing our grammer in c++.
The input file contains the extended characters like "°".ANTLR is not able to match these character with Lexer rule.
Can any one suggest how to match Extended characters in ANTLR

What Unicode codepoint is that? Keep in mind ANTLR (including ANLTR 4) only can handle the Unicode BMP (Basic Multilingual Plane). You cannot parse input the contains characters beyond that.

Related

How does nano represent special ASCII characters compared to Sublime?

I'm writing up a game in P5.js which draws emojis to a canvas.
I was originally using the sublime text editor to copy paste special ASCII characters straight into the code which works fine, but I now only have access to nano which doesn't seem to accept this way.
Nano manages to convert what I have already done, into some different characters. Presumably, this is Nano's way of interpreting those ASCII characters.
I am using this because phones and browsers now automatically convert these ASCII characters into emojis.
Example: the heart emoji is converted from the special ASCII character ♥ in sublime, to âö¥ in Nano automatically when you open the file.
I am wondering if there is a reference sheet somewhere where I can find other conversions for emojis I would like to use.
Just forget ASCII. HTML uses Unicode characters. JavaScript uses Unicode's UTF-16 encoding. Your files might use Unicode's UTF-8 encoding.
ASCII does not have the character ♥.
Special characters in JavaScript include quote, double quote, backslash, and similar. If you wish or need to, you can escape UTF-16 code units using the "\uABCD" notation. Special characters in HTML are <, >, & and similar. If you wish or need to, you can use named or numeric character entity references like & or 🚲
♥ is not special; It's just a character with no particular purpose, just like tens of thousands of others.
Conversions from typed characters to other characters is an input function, typically performed by the OS or other input software. So, that's generally outside the scope of HTML and JavaScript.
A text file has an encoding. Some programs help you when opening a file by guessing; You then have to correct them.
It's generally easiest if all files are UTF-8. Sometimes a BOM helps, sometimes not. The fundamental rule about character encodings is to read using the encoding that was used to write with.
The list of Unicode characters is here. There several other good sites for searching and coding including http://www.fileformat.info/.

Windows encoding clarification

I would like to translate a game, this game loads the strings from a text file.
The destination language uses non-ascii characters, so I naïvely saved my file in utf8, but it does not work as letters with diacritics are not shown correctly.
Studying better in the configuration file where the string text filename is stored, I found a CHARSET option that can assume any of those values:
ANSI_CHARSET DEFAULT_CHARSET SYMBOL_CHARSET MAC_CHARSET SHIFTJIS_CHARSET HANGEUL_CHARSET JOHAB_CHARSET GB2312_CHARSET CHINESEBIG5_CHARSET GREEK_CHARSET TURKISH_CHARSET VIETNAMESE_CHARSET HEBREW_CHARSET ARABIC_CHARSET BALTIC_CHARSET RUSSIAN_CHARSET THAI_CHARSET EASTEUROPE_CHARSET OEM_CHARSET
That as far as I understood are fairly standard values in WinAPIs and charset and character encoding are synonymous.
So my question is, is there a correspondence between this names and standard names like utf8 or iso-8859-2? If it is the case what is it?
Try using EASTEUROPE_CHARSET
ISO 8859-2 is mostly equivalent to Windows-1250. According to this MSDN article, the 1250 code page is accessed using EASTEUROPE_CHARSET.
Note that you will need to save your text file in the 1250 code page as ISO 8859-2 is not exactly equivalent. From Wikipedia:
Windows-1250 is similar to ISO-8859-2 and has all the printable characters it has and more. However a few of them are rearranged (unlike Windows-1252, which keeps all printable characters from ISO-8859-1 in the same place). Most of the rearrangements seem to have been done to keep characters shared with Windows-1252 in the same place as in Windows-1252 but three of the characters moved (Ą,Ľ,ź) cannot be explained this way.
The names are symbolic identifiers for Windows code pages, which are character encodings (= charsets) defined or adopted by Microsoft. Many of them are registered at IANA with the prefix windows-. For example, EASTEUROPE_CHARSET stands for code page 1250, which has been registered as windows-1250 and is often called Windows Latin 2.
UTF-8 is something different. You need special routines to read and write UTF-8 encoded data. UTF-8 or UTF-16 is generally the only sensible choice for character encoding when you want to be truly global (support different languages and writing systems). For a single specific language, some of the code pages might be more practical in some cases.
You can get the the standard encoding names (as registered by IANA) using the table under the remarks section of this MSDN page.
Just find the Character set row and read the Code page number, the standard name is windows-[code page number].

Things to take into account when internationalizing web app to handle chinese language

I have a MVC3 web app with i18n in 4 latin languages... but I would like to add CHINESE in the future.
I'm working with standard resource file.
Any tips?
EDIT: Anything about reading direction? Numbers? Fonts?
I would start with these observations:
Chinese is a non-character-based language, meaning that a search engine (if needed) must not use only punctuation and whitespace to find words (basically, each character is a word); also, you might have mixed Latin and Chinese words
make sure to use UTF-8 for all your HTML documents (.resx files are UTF-8 by default)
make sure that your database collation supports Chinese - or use a separate database with an appropriate collation
make sure you don't reverse strings or do other unusual text operations that might break with multi-byte characters
make sure you don't call ToLower and ToUpper to check user-input text because again this might break with other alphabets (or rather scripts) - aka the Turkey Test
To test for all of the above and other possible issues, a good way is pseudolocalization.

Extended charachter code pages

I was trying to Validate characters (the extended ones) and i see that in various PC's they have in deferent places the extended characters. I meane we are not see the same ASCII code number for a certain character (not in Latins).
Now My issue is what i have to do when my program starts to use always a certain ASCII code table?
For extended character of course.
This issue generally relates (since .NET strings are UTF-16) only to reading and writing text files. In which case, just use Encoding.GetEncoding(codePage) to choose the appropriate encoding, and use this when access any text files. All standard inbuilt text/file utility operations will take an encoding, for example:
string contents = File.ReadAllText("foo.txt", encoding);

Converting Multibyte characters to UTF-8

My application has to write data to an XML file which will be read by a swf file. The swf expects the data in the XML to be in UTF-8 encoding. I have to convert some Multibyte characters in my app(Chinese simplified, Japanese, Korean etc..) to UTF-8.
Are there any API calls which could allow me to do this?I would prefer not to use any 3rd party dlls. I need to do it both on Windows and on Mac and would prefer any system API's if available.
Thanks
jbsp72
UTF-8 is a multibyte encoding (Well, a variable byte-length encoding to be precise). Stating that you need to convert from a multibyte encoding is not enough. You need to specify which multibye encoding your source is?
I have to convert some Multibyte
characters in my app(Chinese
simplified, Japanese, Korean etc..) to
UTF-8.
if your original string is in multibyte (chinese/arabic/thai/etc..) and you need to convert it to other multibyte (UTF-8), One way is to convert to WideCharacter(UTF-16) first, then convert back to multibyte.
multibyte(chinese/arabic/thai/etc) -> widechar(UTF-16) -> multibyte(UTF-8)
if your original string is already in Unicode(UTF-16), you can skip the first conversion in the above illustration
you can refer the codepage from MSDN.
Google Chrome has some string conversion implementations for Windows, Linux, and Mac. You can see it here or here. the files are under src/base:
+ sys_string_conversions.h
+ sys_string_conversions_linux.cc
+ sys_string_conversions_win.cc
+ sys_string_conversions_mac.mm
The code uses BSD license so you can use it for commercial projects.

Resources