How many valid utf8 characters are there? - utf-8

I know that this is a little vague, so for context, think of it as "a character you could tweet," or something like that. My question is how many valid unicode characters are there that a browser or a service that supports utf8 could resolve, in such a way that a utf8 browser could copy and paste it around without any issues.
I guess what I don't want is the full character space, because I know a lot of it is reserved for command characters or reserved characters that wouldn't be shown (unless I'm super wrong!).

UTF-8 isn't the important factor, since all of the standard Unicode encodings (UTF-8, UTF-16, UTF-32) encode the same character space, just in different ways.
From your explanation I see you don't just want the 1,112,064 valid Unicode code points?
Unicode 6.0 and ISO/IEC 10646:2010 define 109,449 characters, but a handful of those are what you're calling "control characters". Which ones do or don't fall into that category depends on how you're counting. Copying and pasting may result in some characters being treated as identical to one another, or ignore altogether, depending on the OS and the programs doing the copying and pasting.
However because Unicode is forward compatible, some systems will correctly preserve characters which haven't yet been assigned. After all, just because you're running Windows XP and you copy and paste a document with characters that weren't standardised until 2009 doesn't mean you expect them to vanish. There could be a million or so extra possible characters by this way of thinking, although their visual appearance may be indistinguishable in some places.

Related

How to create filename with characters that are not part of UTF-8 on Windows?

[Edit/Disclaimer]: Comments pointed out that I have to clarify the encoding the user uses. Will update accordingly
I have a customer from China who recently reported an issue with their filenames on Windows. The software works with most Chinese characters, but it seems he has found one file that fails.
Unfortunately, they are not able to send me over the filename as neither zipping nor transmitting the file through other mediums seem to preserve the filename.
What is the easiest way (e.g. through Python) to generate a filename on Windows that is covered by the NTFS file system encoding but not UTF8?
Unicode strings are encoded as a series of bytes. The rules of what a series of bytes visually looks like to you in an operating system, is what operating systems use to turn bytes into characters.
Given that Windows uses a (variation of-) Unicode, and you say you have a character that's not in unicode, it also means that there is simply no way to represent that character.
Imagine if unicode only contained the numbers 0-9, and you ask someone how to encode the letter A. There's no answer to this, because only 0-9 are defined.
You could make up a new unicode codepoint for your character, but then operating systems won't know what to do with that unless you also make your own font files.
I somehow doubt that that's what you want to do though, but it's an option. Could your customer rename the file before sending it to you?

Ensuring consistency when encoding to UTF8 from extended ASCII

Maybe this is a non-issue but I look to the collected wisdom of SO to help me find out.
We're trying to ensure encodings are consistent across platforms. The way to go is clearly UTF8. However, some platforms unfortunately use extended ASCII (typically some form of Windows codepage), We're concerned that when encoding something with say, an umlaut, from a Windows codepage to UTF8, there are multiple possible choices within UTF8 for the character.
On a different platform (Linux, Mac OS), how do we ensure that the UTF8 character chosen there is consistent?
As I said, maybe this is a non-issue. Maybe there is some standard mapping I'm unaware of. We haven't seen any problems but a colleague just raised the concern so I'm on the hunt for information.
Thank you all in advance.
As long as you properly convert original text to Unicode first and than use Utf8 to store/transfer data there should be no problems.
The Unicode Consortium has compiled a set of mapping tables. Nominally informational, they constitute a de facto standard. Moreover, many of the mappings there reflect formal standards, as it has become normal to define any new character encoding in terms of Unicode, i.e. by specifying the Unicode number (and/or Unicode name) of each character.
Once a character has been mapped to Unicode (i.e., to a Unicode code point, or Unicode number), its encoding in each Unicode encoding, such as UTF-8, has been defined unambiguously.
So the issue is how you ensure that the conversion routines you use work according to those tables. Using ICU can be regarded as safe in this respect.
P.S. There is no extended ASCII. There are various character encodings, some of which coincide with ASCII in the range from 0 to 0x7F, some don’t.

Right single apostrophe vs. apostrophe?

Right single quotation mark (U+2019)
vs.
Apostrophe (U+0027)
What is the difference between these two characters?
I ran into this issue where I use CAtlString to load a string from a resource file, and on some Windows installations, the LoadString fails when trying to load a string that contains U+2019, but it works on some other Windows installations. The U+2019 character appears in strings in my resource file that I copied from Word, and U+0027 appears in stirngs that I hand coded. Why does LoadString (sometimes) choke on this?
What is the difference between these two characters?
Arguable!
Going by the names, one would imagine that the curly ‹’› is only for use as a quotation mark, and that the straight ‹'› is only for use as a real apostrophe, an indicator of omitted letters.
However traditional typesetting practice in English is always to use a curly ‹’› to render an apostrophe. Personally—and I may be alone here—I don't like this. It can make for more ambiguous reading:
“He said, ‘It’s fish ’n’ chips’...”
with the apostrophes being straight it's (marginally) clearer where the quotation ends:
“He said, ‘It's fish 'n' chips’...”
and the apostrophe being ‘straight’ makes more sense to me because its purpose of indicating omitted letters has no inherent directionality, whereas quotation marks are clearly asymmetrical in purpose.
In traditional ASCII, of course, there are no smart quotes, so the apostrophe is always used for both...
on some Windows installations, the LoadString fails when trying to load a string that contains U+2019, but it works on some other Windows installations.
Here you are meeting the horror of the ‘ANSI’ code page. This is a default character encoding that is different across different Windows install locales. So on a machine in the Western region, you get different results when you read a resource to when you read it on a Japanese Windows.
It is highly unfortunate that Windows has varying default code pages instead of using a single global encoding like UTF-8, but it's too late to fix now. If you compile your whole application as a Unicode app (so you'll be using LoadStringW rather than LoadStringA) then you can cope with non-ASCII characters like the smart quotes much better.
If you can't move to a Unicode application you're a bit stuck. You won't be able to handle non-ASCII characters like the smart quotes globally, so stick with ASCII characters like the straight apostrophe ‹'› alone.
The U+2019 character appears in strings in my resource file that I copied from Word
Yes, Word has an annoying AutoCorrect feature that replaces all apostrophes you type with smart quotes. This is especially undesirable when you are dealing with code, where ‹’› will break the program; but it's also wrong even for plain old English, as it's not possible to correctly guess the desired direction of the quote. (It'll get one of the apostrophes in “fish 'n' chips” the wrong way round, for example.)
I suggest turning off the automatic-replace-with-smart-quotes feature. If you want the smart quotes, it's better to type them deliberately. Unfortunately they are inconvenient to type on most keyboard layouts, often requiring obscure Alt+numpad sequences. Personally I use this one to drop them onto Alt+[] keys.
Historically, single-quote and double-quote come in pairs, left (open) and right (close).
For many years the character sets of computers were limited, having a single form of each.
Now, with the advent of Unicode, the full forms are available, but support for them is still limited. Programming languages still use the simple forms, and the full forms can still cause problems.

Windows API: ANSI and Wide-Character Strings -- Is it UTF8 or ASCII? UTF-16 or UCS-2 LE?

I'm not quite pro with encodings, but here's what I think I know (though it may be wrong):
ASCII is a 7-bit, fixed-length encoding, with the characters you can find in ASCII charts.
UTF8 is an 8-bit, variable-length encoding. All characters can be written in UTF8.
UCS-2 LE/BE are fixed-length, 16-bit encodings that support most common characters.
UTF-16 is a 16-bit, variable-length encoding. All characters can be written in UTF16.
Are those above all correct?
Now, for the questions:
Do the Windows "A" functions (like SetWindowTextA) take in ASCII strings? Or "multi-byte strings" (more questions on this below)?
Do the Windows "W" functions take in UTF-16 strings or UCS-2 strings? I thought they take in UCS-2, but the names confuse me.
In WideCharToMultiByte, Microsoft uses the word "wide-character string" to mean UTF-16. In that context, then what is considered a "multi-byte string"? UTF-8?
Is LPWSTR a "wide-character string"? I would say it is, but then, wouldn't that mean it's UTF-16? And wouldn't that mean that it could be used to display, say, 4-byte characters? If not, then... is displaying 4-byte characters impossible? (Windows doesn't seem to have APIs for those.)
Is the functionality of WideCharToMultiByte a superset of that of wcstombs, and do they both work on the same type of string? Or does one, say, work on UTF-16 while the other works on UCS-2?
Are file paths in UTF-16 or UCS-2? I know Windows treats it as an "opaque array of characters" from Microsoft's documentation, but per the C standard for functions like fwprintf, is there any standardized encoding?
What is "ANSI" encoding? Is that even a correct term? And how does it relate to ASCII?
(I had more questions, but this is enough... I forgot some of them anyway...)
These are a lot of questions, so any links to explanations about how all these connect (aside from reading the Unicode standard, which won't help with the Windows API anyway) would also be greatly appreciated.
Thank you!
Are those above all correct?
Yes, if you don't assume the existence of characters not encoded in Unicode (for most practical applications, this assumption is fine).
Do the Windows "A" functions (like SetWindowTextA) take in ASCII strings? Or "multi-byte strings" (more questions on this below)?
They take byte strings (i.e., strings whose code unit is a byte, which is always an octet on Windows) encoded in the current "ANSI"/MBCS/legacy encoding. "ANSI" is the historical terms for these encodings, but not correct. For Western Windows systems, this encoding is usually Windows-1252.
Do the Windows "W" functions take in UTF-16 strings or UCS-2 strings? I thought they take in UCS-2, but the names confuse me.
Since Windows 2000, most of them support UTF-16. The name "wide" and the rest of the Microsoft terminology (e.g., "Unicode" meaning "UTF-16" or "UCS") were chosen before the modern Unicode standard unified the terminology.
In WideCharToMultiByte, Microsoft uses the word "wide-character string" to mean UTF-16. In that context, then what is considered a "multi-byte string"? UTF-8?
Every other encoding that WideCharToMultiByte supports is a "multi-byte encoding" in this context, including Windows-1251 and UTF-8.
Is LPWSTR a "wide-character string"? I would say it is, but then, wouldn't that mean it's UTF-16? And wouldn't that mean that it could be used to display, say, 4-byte characters? If not, then... is displaying 4-byte characters impossible? (Windows doesn't seem to have APIs for those.)
LPWSTR is a pointer to wchar_t which is always a 16-bit unsigned integer on Windows. Which characters can be displayed is unrelated to the encoding as long as that encoding can encode all Unicode characters. Windows is generally able to display non-BMP characters, but not everywhere (e.g., the console cannot).
Is the functionality of WideCharToMultiByte a superset of that of wcstombs, and do they both work on the same type of string? Or does one, say, work on UTF-16 while the other works on UCS-2?
Don't really know, but I don't think they differ too much. I suppose you just try to convert some non-BMP character to UTF-8 and look whether the result is correct.
Are file paths in UTF-16 or UCS-2? I know Windows treats it as an "opaque array of characters" from Microsoft's documentation, but per the C standard for functions like fwprintf, is there any standardized encoding?
File paths are indeed opaque arrays of UTF-16 characters, meaning that Windows doesn't perform any kind of translation when storing or reading file names (like Linux and unlike Mac OS X). But Windows still has its weird mostly-undefined case insensitive behavior which causes much trouble because file names that are treated equivalent aren't necessarily equal. That breaks many invariants; for example, on Linux without interference from other threads, if you successfully create two files A and a in some directory, you'll end up with two distinct files, while on Windows you get only one file (and in general, an unpredictable number of files).
What is "ANSI" encoding? Is that even a correct term? And how does it relate to ASCII?
ANSI is the American standardization organization. Using this word when referring to encodings is a misnomer, but a frequent one, so you should be aware of it. I prefer the term legacy 8-bit encoding, because I think that's essentially what it is: a non-Unicode encoding that is kept only for compatibility with legacy (Windows 9x) applications. On Western systems, this is usually Windows-1252, which is a proper superset of ASCII.
Wide strings used to be UCS-2. From Windows 2000, wide strings are UTF-16. Good to know if you need to maintain some old legacy system.
*A functions used the active ANSI codepage.
*W function use UTF-16.
Multi-byte refers to whatever is passed in the CodePage parameter. It is most commonly either the active ANSI codepage or UTF-8.
LPWSTR is a UTF-16 string which may or may not be null-terminated (see MSDN)
I don't know anything about wcstombs, I always use WideCharToMultiByte.
File paths are in UTF-16. In fact all text is UTF-16 internally in Windows.
For ANSI encoding you will need to read up on that in some detail. You could do worse than to start with Wikipedia and follow the links from there.
I hope that helps and that if I've got anything wrong, anyone who knows more please do edit this to correct any errors!
First of all you'll find plenty of information in this SO topic.
ASCII is a charset, not encoding. Now, there's a number of 8-bit charsets, one of them being set as default in the system (you can change it in Regional Settings). *A functions accept 8-bit characters in that charset. UTF-8 is not a charset, but encoding of Unicode charset. *W functions, as I understand, use UTF-16 rather than UCS-2.

ASCII in Windows XP and Ubuntu Linux

I've made a program in MVSC++ which outputs memory contents (in ASCII). The ASCII I see in windows console seem to match what I see in various ASCII tables (smiley, diamond, club, right arrow etc). This program needs to compile under Linux (which is does), but the ASCII output looks completely different. A few symbols are the same but the rest are so different. Is there any way to change how terminal displays ASCII code?
EDIT: The program executes correctly, it's just the ASCII that is being displayed differently.
ASCII defines character codes from 0x00 through 0x7f. Everything else (0x80-0xff) is not part of the ASCII standard and depends on what the operating system defines as the characters to display. However, the characters you mention (smiley, diamond, club, etc) are the representations of the ASCII "control characters" that don't normally have a visual representation. Windows lets you print such characters and see the glyphs it has defined for them, but your Linux is probably interpreting the control characters as formatting control codes (which they are) instead of printing corresponding glyphs.
What you are seeing is the "extended" character set that IBM initially included when PCs were first unleashed upon the world. Yes, we are going back to the age of mighty dinosaurs, so bear with me. These characters live above $7F and the interpretation of their symbols on the screen can even be influenced by the font chosen. Most linux distros are now using UTF-8 (or something close) and as such, the fonts installed may have completely different symbols, or even missing glyphs. In cases where you are comparing "ASCII" representations (which is a misnomer, as it's not really true ASCII) of the same data, it may or may not exactly match, as you must have the same "glyph" renderings in both display fonts to correctly see similar representations. Try getting both your Windows and Linux installs to use the same font if possible, and then see if there is a change.
If your browser supports Unicode (and you have the correct fonts installed), you will see them bellow.
You can copy and paste into an editor with unicode support(Notepad). Save as UTF-16BE
Then if you open in a HexEditor you will see all the unicode codes for each char visible glyph.
In example the first ascii char Null has Unicode visible glyph 0x2639
in c\c++\java you can use it like \u2639.
Its not a null char but the visual representation.
http://en.wikipedia.org/wiki/Code_page_437
☹☺☻♥♦♣♠•◘○◙♂♀♪♫☼►◄↕‼¶§▬↨↑↓→←∟↔▲▼ !"#$%&'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[]^_`abcdefghijklmnopqrstuvwxyz{|}~⌂ÇüéâäàåçêëèïîìÄÅÉæÆôöòûùÿÖÜ¢£¥₧ƒáíóúñѪº¿⌐¬½¼¡«»░▒▓│┤╡╢╖╕╣║╗╝╜╛┐└┴┬├─┼╞╟╚╔╩╦╠═╬╧╨╤╥╙╘╒╓╫╪┘┌█▄▌▐▀αßΓπΣσµτΦΘΩδ∞φε∩≡±≥≤⌠⌡÷≈°∙·√ⁿ²■⓿

Resources