In ASCII, symbol 158 is ₧.
What does this symbol stand for, and what was it used for?
ASCII only goes up to 126, so your question doesn't make sense. However, the symbol "₧" stands for Spanish pesetas, and is part of the old codepage 437. Nowadays it's part of Unicode as U+20A7.
Related
Lazarus Wiki states
Lazarus (actually its LazUtils package) takes advantage of that API
and changes it to UTF-8 (CP_UTF8). It means also Windows users now use
UTF-8 strings in the RTL
In our cross-platform and cross-compiler code, we'd like to detect this specific situation. GetACP() Windows API function still returns "1252", and so does GetDefaultTextEncoding() function in Lazarus. But the text (specifically, the filename returned by FindFirst() function) contains the string with UTF8-encoded filename, and the codepage of the string (variable) is 65001 too.
So, how do we figure out that the RTL operates with UTF8 strings by default? I've spent several hours trying to figure this out from Lazarus source code, but probably I am missing something ...
I understand that in many scenarios, we need to inspect the codepage of each specific string, but I am interested in the way to find out the default RTL codepage which is UTF8 in Lazarus, yet Windows-defined one in FPC/Windows without Lazarus.
Turns out, that there's no single code page variable or function. Results of the filesystem API calls are converted to the codepage, defined in DefaultRTLFileSystemCodePage variable. The only problem is that this variable is present in the source code and is supposed to be in system unit, but the compiler doesn't see it.
I'm teaching myself Pascal and thought mixing Pascal with Japanese sounded like a really good idea the other day, but it appears Pascal only accepts Japanese characters some of the time, and I don't really know why it accepts them at all. Is there something I need to include to allow writing in Japanese with Free Pascal?
You don't mention which Pascal, and you don't describe acceptance. As identifier or literals in source code, on standard input, as literal?
Delphi D2009 and higher support unicode identifiers in their utf8 sources.
Free Pascal hasn't implemented this yet. It does allow utf8 source encoding though, and thus unicode literals.
I've developed a little console C++ game, that uses ASCII graphics, using cout for the moment. But because I want to make things work better, I have to use pdcurses. The thing is curses functions like printw(), or mvprintw() don't use the regular ascii codes, and for this game I really need to use the smiley characters, heart, spades and so on.
Is there a way to make curses work with the regular ascii codes ?
You shouldn't think of characters like the smiley face as "regular ASCII codes", because they really aren't ASCII at all. (ASCII only covers characters 32-127, plus a handful of control codes under 32.) They're a special case, and the only reason you're able to see them in (I assume?) your Windows CMD shell is that it's maintaining backwards compatibility with IBM Code Page 437 (or similar) from ancient DOS systems. Meanwhile, outside of the DOS box, Windows uses a completely different mapping, Windows-1252 (a modified version of ISO-8859-1), or similar, for its 8-bit, so-called "ANSI" character set. But both of these types of character sets are obsolete, compared to Unicode. Confused yet? :)
With curses, your best bet is to use pure ASCII, plus the defined ACS_* macros, wherever possible. That will be portable. But it won't get you a smiley face. With PDCurses, there are a couple of ways to get that smiley face: If you can safely assume that your console is using an appropriate code page, then you can pass the A_ALTCHARSET attribute, or'ed with the character, to addch(); or you can use addrawch(); or you can call raw_output(TRUE) before printing the character. (Those are all roughly equivalent.) Alternatively, you can use the "wide" build of PDCurses, figure out the Unicode equivalents of the CP437 characters, and print those, instead. (That approach is also portable, although it's questionable whether the characters will be present on non-PCs.)
May a symbol in the ELF table use UTF8 characters or is it restricted to ASCII?
Note: It is not a problem that I am trying to solve, it is more something I am wondering.
ELF string tables use NUL-terminated strings, so you could possibly store UTF-8 encoded symbol names inside them.
That said, the tools that use such symbols would need to be Unicode-aware to work correctly. For example:
Whether your programming language tool chain correctly classifies a specified Unicode 'character' as a letter, a numeral or punctuation.
Whether scripts that are written right-to-left (or top-to-bottom) can be used.
Whether symbols written in complex scripts (Arabic, Thai, etc) are rendered correctly by your system.
Whether characters from different scripts can be mixed when creating a symbol.
Whether sorting works as expected, for those tools that have to produce sorted outputs.
... etc.
I have an inquiry about the "Character set" option in Visual Studio. The Character Set options are:
Not Set
Use Unicode Character Set
Use Multi-Byte Character Set
I want to know what the difference between three options in Character Set?
Also if I choose something of them, will affect the support for languages other than English (like RTL languages)?
It is a compatibility setting, intended for legacy code that was written for old versions of Windows that were not Unicode enabled. Versions in the Windows 9x family, Windows ME was the last and widely ignored one. With "Not Set" or "Use Multi-Byte Character Set" selected, all Windows API functions that take a string as an argument are redefined to a little compatibility helper function that translates char* strings to wchar_t* strings, the API's native string type.
Such code critically depends on the default system code page setting. The code page maps 8-bit characters to Unicode which selects the font glyph. Your program will only produce correct text when the machine that runs your code has the correct code page. Characters whose value >= 128 will get rendered wrong if the code page doesn't match.
Always select "Use Unicode Character Set" for modern code. Especially when you want to support languages with a right-to-left layout and you don't have an Arabic or Hebrew code page selected on your dev machine. Use std::wstring or wchar_t[] in your code. Getting actual RTL layout requires turning on the WS_EX_RTLREADING style flag in the CreateWindowEx() call.
Hans has already answered the question, but I found these settings to have curious names. (What exactly is not being set, and why do the other two options sound so similar?) Regarding that:
"Unicode" here is Microsoft-speak for UCS-2 encoding in particular. This is the recommended and non-codepage-dependent described by Hans. There is a corresponding C++ #define flag called _UNICODE.
"Multi-Byte Character Set" (aka MBCS) here the official Microsoft phrase for describing their former international text-encoding scheme. As Hans described, there are different MBCS codepages describing different languages. The encodings are "multi-byte" in that some or all characters may be represented by multiple bytes. (Some codepages use a variable-length encoding akin to UTF-8.) Your typical codepage will still represent all the ASCII characters as one-byte each. There is a corresponding C++ #define flag called _MBCS
"Not set" apparently refers to compiling with_UNICODE nor _MBCS being #defined. In this case Windows works with a strict one-byte per character encoding. (Once again there are several different codepages available in this case.)
Difference between MBCS and UTF-8 on Windows goes into these issues in a lot more detail.