Can ELF symbols be represented in UTF8? - utf-8

May a symbol in the ELF table use UTF8 characters or is it restricted to ASCII?
Note: It is not a problem that I am trying to solve, it is more something I am wondering.

ELF string tables use NUL-terminated strings, so you could possibly store UTF-8 encoded symbol names inside them.
That said, the tools that use such symbols would need to be Unicode-aware to work correctly. For example:
Whether your programming language tool chain correctly classifies a specified Unicode 'character' as a letter, a numeral or punctuation.
Whether scripts that are written right-to-left (or top-to-bottom) can be used.
Whether symbols written in complex scripts (Arabic, Thai, etc) are rendered correctly by your system.
Whether characters from different scripts can be mixed when creating a symbol.
Whether sorting works as expected, for those tools that have to produce sorted outputs.
... etc.

Related

can mvprintw(), curses function work with usual ascii codes?

I've developed a little console C++ game, that uses ASCII graphics, using cout for the moment. But because I want to make things work better, I have to use pdcurses. The thing is curses functions like printw(), or mvprintw() don't use the regular ascii codes, and for this game I really need to use the smiley characters, heart, spades and so on.
Is there a way to make curses work with the regular ascii codes ?
You shouldn't think of characters like the smiley face as "regular ASCII codes", because they really aren't ASCII at all. (ASCII only covers characters 32-127, plus a handful of control codes under 32.) They're a special case, and the only reason you're able to see them in (I assume?) your Windows CMD shell is that it's maintaining backwards compatibility with IBM Code Page 437 (or similar) from ancient DOS systems. Meanwhile, outside of the DOS box, Windows uses a completely different mapping, Windows-1252 (a modified version of ISO-8859-1), or similar, for its 8-bit, so-called "ANSI" character set. But both of these types of character sets are obsolete, compared to Unicode. Confused yet? :)
With curses, your best bet is to use pure ASCII, plus the defined ACS_* macros, wherever possible. That will be portable. But it won't get you a smiley face. With PDCurses, there are a couple of ways to get that smiley face: If you can safely assume that your console is using an appropriate code page, then you can pass the A_ALTCHARSET attribute, or'ed with the character, to addch(); or you can use addrawch(); or you can call raw_output(TRUE) before printing the character. (Those are all roughly equivalent.) Alternatively, you can use the "wide" build of PDCurses, figure out the Unicode equivalents of the CP437 characters, and print those, instead. (That approach is also portable, although it's questionable whether the characters will be present on non-PCs.)

Unicode in Bourne Shell source code

Is it safe to use UTF-8, and not just the 7-bit ASCII subset, in modern Bourne Shell interpreters, be it in comments (e.g., using box-drawing characters), or by passing arguments to a function or program? I'm considering whether filesystems can safely handle Unicode in path names outside of the scope of this question.
I know at least to not put a BOM in my shell scripts… ever, as that would break the kernel's shebang line parsing.
The thing about UTF-8 is that any old code that's just passing string data along and uses the C string convention of terminating strings with a null byte works fine. That generally characterizes how the shell handles command names and arguments.
Even if the shell does some string processing with special meanings for ascii characters, UTF-8 still mostly works fine because ascii characters encode exactly the same in UTF-8. So for example the shell will still be able to recognize all its keywords and syntax characters like []{}()<>/.?;'"$&* etc. That characterizes how string literals and other syntax bits of a script should be handled, for example.
You should be able to use UTF-8 in comments, string literals, command names, and command arguments. (of course the system will have to support UTF-8 file names to have UTF-8 commands, and the commands will have to handle UTF-8 command line arguments.)
You may not be able to use UTF-8 in function names or variables, since the shell may be looking for strings of ascii characters there. Although if your locale is UTF-8 then an interpreter that's using the locale based character classification functions internally might work with UTF-8 identifiers as well, but it's probably not portable.
It really depends on what you are trying to do... In general plain vanilla Bourne-derived shells cannot handle Unicode characters inside the scripts, which means your script text must be purely 8-bit ASCII(+) if you care for portability. At the same time pipes are completely encoding neutral, so you can have things like a | b where a outputs UTF-8 and b receives it. So, assuming find is capable of handling UTF-8 paths and your processing tool for them can work with UTF-8 strings, you should be OK.
Multi-Byte support was added in 1989 to the Bourne Shell and given that UNICODE was introduced in 1992, you cannot expect UTF-8 from a shell that is older than UNICODE. SunOS introduced UNICODE support when it has become available.
So any Bourne Shell that was derived from the SVr4 Bourne Shell and compiled and linked to a modern library environment should support UTF-8 in scripts.
If you like to verify that, you may get a portable version from the OpenSolaris Bourne Shell in the schily-tools: http://sourceforge.net/projects/schilytools/files/
osh is the original Bourne Shell made portable only.
sh is the Bourne Shell with modern enhancements.

About the "Character set" option in Visual Studio

I have an inquiry about the "Character set" option in Visual Studio. The Character Set options are:
Not Set
Use Unicode Character Set
Use Multi-Byte Character Set
I want to know what the difference between three options in Character Set?
Also if I choose something of them, will affect the support for languages ​​other than English (like RTL languages)?
It is a compatibility setting, intended for legacy code that was written for old versions of Windows that were not Unicode enabled. Versions in the Windows 9x family, Windows ME was the last and widely ignored one. With "Not Set" or "Use Multi-Byte Character Set" selected, all Windows API functions that take a string as an argument are redefined to a little compatibility helper function that translates char* strings to wchar_t* strings, the API's native string type.
Such code critically depends on the default system code page setting. The code page maps 8-bit characters to Unicode which selects the font glyph. Your program will only produce correct text when the machine that runs your code has the correct code page. Characters whose value >= 128 will get rendered wrong if the code page doesn't match.
Always select "Use Unicode Character Set" for modern code. Especially when you want to support languages with a right-to-left layout and you don't have an Arabic or Hebrew code page selected on your dev machine. Use std::wstring or wchar_t[] in your code. Getting actual RTL layout requires turning on the WS_EX_RTLREADING style flag in the CreateWindowEx() call.
Hans has already answered the question, but I found these settings to have curious names. (What exactly is not being set, and why do the other two options sound so similar?) Regarding that:
"Unicode" here is Microsoft-speak for UCS-2 encoding in particular. This is the recommended and non-codepage-dependent described by Hans. There is a corresponding C++ #define flag called _UNICODE.
"Multi-Byte Character Set" (aka MBCS) here the official Microsoft phrase for describing their former international text-encoding scheme. As Hans described, there are different MBCS codepages describing different languages. The encodings are "multi-byte" in that some or all characters may be represented by multiple bytes. (Some codepages use a variable-length encoding akin to UTF-8.) Your typical codepage will still represent all the ASCII characters as one-byte each. There is a corresponding C++ #define flag called _MBCS
"Not set" apparently refers to compiling with_UNICODE nor _MBCS being #defined. In this case Windows works with a strict one-byte per character encoding. (Once again there are several different codepages available in this case.)
Difference between MBCS and UTF-8 on Windows goes into these issues in a lot more detail.

how does windows wchar_t handle unicode characters outside the basic multilingual plane?

I've looked at a number of other posts here and elsewhere (see below), but I still don't have a clear answer to this question: How does windows wchar_t handle unicode characters outside the basic multilingual plane?
That is:
many programmers seem to feel that UTF-16 is harmful because it is a variable-length code.
wchar_t is 16-bits wide on windows, but 32-bits wide on Unix/MacOS
The Windows APIs use wide-characters, not Unicode.
So what does Windows do when you want to code something like 𠂊 (U+2008A) Han Character on Windows?
The implementation of wchar_t under the Windows stdlib is UTF-16-oblivious: it knows only about 16-bit code units.
So you can put a UTF-16 surrogate sequence in a string, and you can choose to treat that as a single character using higher level processing. The string implementation won't do anything to help you, nor to hinder you; it will let you include any sequence of code units in your string, even ones that would be invalid when interpreted as UTF-16.
Many of the higher-level features of Windows do support characters made out of UTF-16 surrogates, which is why you can call a file 𐐀.txt and see it both render correctly and edit correctly (taking a single keypress, not two, to move past the character) in programs like Explorer that support complex text layout (typically using Windows's Uniscribe library).
But there are still places where you can see the UTF-16-obliviousness shining through, such as the fact you can create a file called 𐐀.txt in the same folder as 𐐨.txt, where case-insensitivity would otherwise disallow it, or the fact that you can create [U+DC01][U+D801].txt programmatically.
This is how pedants can have a nice long and basically meaningless argument about whether Windows “supports” UTF-16 strings or only UCS-2.
Windows used to use UCS-2 but adopted UTF-16 with Windows 2000. Windows wchar_t APIs now produce and consume UTF-16.
Not all third party programs handle this correctly and so may be buggy with data outside the BMP.
Also, note that UTF-16, being a variable length encoding, does not conform to the C or C++ requirements for an encoding used with wchar_t. This causes some problems such as some standard functions that take a single wchar_t, such as wctomb, can't handle characters beyond the BMP on Windows, and Windows defining some additional functions that use a wider type in order to be able to handle single characters outside the BMP. I forget what function it was, but I ran into a Windows function that returned int instead of wchar_t (and it wasn't one where EOF was a possible result).

What are some example use cases for symbol literals in Scala?

The use of symbol literals is not immediately clear from what I've read up on Scala. Would anyone care to share some real world uses?
Is there a particular Java idiom being covered by symbol literals? What languages have similar constructs? I'm coming from a Python background and not sure there's anything analogous in that language.
What would motivate me to use 'HelloWorld vs "HelloWorld"?
Thanks
In Java terms, symbols are interned strings. This means, for example, that reference equality comparison (eq in Scala and == in Java) gives the same result as normal equality comparison (== in Scala and equals in Java): 'abcd eq 'abcd will return true, while "abcd" eq "abcd" might not, depending on JVM's whims (well, it should for literals, but not for strings created dynamically in general).
Other languages which use symbols are Lisp (which uses 'abcd like Scala), Ruby (:abcd), Erlang and Prolog (abcd; they are called atoms instead of symbols).
I would use a symbol when I don't care about the structure of a string and use it purely as a name for something. For example, if I have a database table representing CDs, which includes a column named "price", I don't care that the second character in "price" is "r", or about concatenating column names; so a database library in Scala could reasonably use symbols for table and column names.
If you have plain strings representing say method names in code, that perhaps get passed around, you're not quite conveying things appropriately. This is sort of the Data/Code boundary issue, it's not always easy to the draw the line, but if we were to say that in that example those method names are more code than they are data, then we want something to clearly identify that.
A Symbol Literal comes into play where it clearly differentiates just any old string data with a construct being used in the code. It's just really there where you want to indicate, this isn't just some string data, but in fact in some way part of the code. The idea being things like your IDE would highlight it differently, and given the tooling, you could refactor on those, rather than doing text search/replace.
This link discusses it fairly well.
Note: Symbols will be deprecated and then removed in Scala 3 (dotty).
Reference: http://dotty.epfl.ch/docs/reference/dropped-features/symlits.html
Because of this, I personally recommend not using Symbols anymore (at least in new scala code). As the dotty documentation states:
Symbol literals are no longer supported
it is recommended to use a plain string literal [...] instead
Python mantains an internal global table of "interned strings" with the names of all variables, functions, modules, etc. With this table, the interpreter can make faster searchs and optimizations. You can force this process with the intern function (sys.intern in python3).
Also, Java and Scala automatically use "interned strings" for faster searchs. With scala, you can use the intern method to force the intern of a string, but this process don't works with all strings. Symbols benefit from being guaranteed to be interned, so a single reference equality check is both sufficient to prove equality or inequality.

Resources