Why UTF-8 cannot handle Chinese character correctly in Windows10's CMD? - cmd

I have already set the UTF-8 as the overall encoding:
when I set encoding to 936(the one for GBK, which is for Chinese)
Chinese characters are displayed properly.
When I changed encoding to 65001(UTF-8, which is for all chars, text)
Chinese characters are displayed improperly.
My question
How to make Windows cmd handle all characters properly?
I don't want to use GBK, which can only handle Chinese, which means I have to switch to another encoding when I handle other languages (e.g. Japanese, Korean). So I want to change all of them to UTF-8 and get out of the encoding hell.

Related

Encoding Special Characters for ISO-8859-1 API

I'm writing a Go package for communicating with a 3rd-party vendor's API. Their documentation states roughly this:
Our API uses the ISO-8859-1 encoding. If you fail to use ISO-8859-1 for encoding special characters, this will result in unexpected errors or malformed strings.
I've been doing research on the subject of charsets and encodings, trying to figure out how to "encode special characters" in ISO-8859-1, but based on what I've found this seems to be a red herring.
From StackOverflow, emphasis mine:
UTF-8 is a multibyte encoding that can represent any Unicode character. ISO 8859-1 is a single-byte encoding that can represent the first 256 Unicode characters. Both encode ASCII exactly the same way.
ISO-8859-1 is a binary encoding format where each possible value of a single byte maps to a specific character. It's certainly within my power to have my HTTP POST body encoded in this way, but not any characters beyond the 256 defined in the spec.
I gather that, to encode a special character (such as the Euro symbol) in ISO-8859-1, it would first need to be escaped in some way.
Is there some kind of standard ISO-8859-1 escaping? Would it suffice to URL-encode any special characters and then encode my POST body in ISO-8859-1?

Difference between UTF-8 and en_AU.UTF-8

I have some text in UTF-8 and it still shows weird in text editor (text editor has UTF-8 encoding set). I know that for instance ISO8859-2 is one byte encoding compatible with ascii that has high 128 values specific for territority, so ppl from that territority can still use one byte encoding to show characters that are not part of ascii and doesn't need to use multibyte encoding like UTF-8. What purpose has that en_AU part of en_AU.UTF-8? Couldn't it be somehow the reason why i still see my text messed even if it is in UTF-8? I mean that some values should be mapped to different characters when the en_AU is used? As i understand UTF-8 it is not possible, but that is the last thing it can be the reason why the text is messed.
output from locale command on linux
LANG=en_US.UTF-8
LANGUAGE=en
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC=cs_CZ.UTF-8
LC_TIME=cs_CZ.UTF-8
LC_COLLATE="en_US.UTF-8"
LC_MONETARY=cs_CZ.UTF-8
LC_MESSAGES="en_US.UTF-8"
LC_PAPER=cs_CZ.UTF-8
LC_NAME=cs_CZ.UTF-8
LC_ADDRESS=cs_CZ.UTF-8
LC_TELEPHONE=cs_CZ.UTF-8
LC_MEASUREMENT=cs_CZ.UTF-8
LC_IDENTIFICATION=cs_CZ.UTF-8
LC_ALL=
In UNIX systems, locales are files on disk and they're encoded in a specific encoding. So you may have the same locale in different encodings, e.g. en_AU.iso55891 and en_AU.UTF-8. This is not some variation of UTF-8, rather it's a variation of this specific locale file. If your locales are using the UTF-8 variant of the locale then anything that's using the locale system will output UTF-8 encoded values.

Strange characters in text saved Brackets

I close it and reopen it I get some letters and strange characters, especially where words have accent.
look at an example:
Este exto es la creación de mí probia autoría
google translation
What you see is a byte representation of your string, which is UTF-8. UTF-8 is multibyte encoding, that means that some characters (eg. those with accents) are saved as several bytes, usually starting with Ã.
Your application probably doesn't understand that the string is UTF-8 and prints it as byte sequence. You should use a different text editor which will be able to display your UTF-8 text correctly.

German Umlaut displayed wrong despite correct Charset

I am encountering a weird problem regarding the encoding of my files.
I have a site which is multilingual; Users can set this viá a dropdown on the site itself, the default value being German.
When the user logs in, some settings are being set depending on the language (charset, codepage and LCID). At this point I also want to point out, that all my files are ANSI-encoded.
Recently, I had to make some changes.
So I fire up Visual Studio 2010, edit the files in question and upload them to my server using Filezilla.
And now, all of a sudden, the German umlauts (Ää, Öö, Üü, ß) are being displayed incorrectly (something like ä) - but only on the files I opened with VS2010.
I checked the charset on the site itself and also displaying it with Response.CharSet and it was ISO-8859-1, which is correct.
So I tried some converting with notepad++, but no success.
I know that setting the charset to UTF-8 would solve this problem, but a) the charset is set from a database-value and b) it kind of messes things up in other languages.
You are displaying a utf-8 encoded file with a iso-8859-1 view. Usually you want to see just one character, but why do you see two instead of one? This is because in utf-8 a german small 'a' letter with 'two dots' is a 2-byte sequence with utf-8 (0xC3 and 0xA4). If this gets NOT displayed as utf-8 but as iso-8859-1 encoding - which means one byte one character - you'll get that what you have mentioned. You'll get the startbyte 0xC3 as a single iso-8859-1 character and the following byte 0xA4 as as a single iso-8859-1 character. In utf-8 this 2-byte sequence must become decoded by extracting the payload bits of the startbyte and the following byte like this:
Startbyte: 11000011
Following: 10100100
So 110 of the startbyte must get stripped off, so 11 is left.
So 10 of the following byte must get stripped off, so 100100 is left.
Chained together this becomes 11100100 which is decimal 228 which should be equal to the german character 'a with two dots' unicode codepoint.
I recommend to let the encoding as it is, utf-8. It is just the encoding of your viewer/editor that should display utf-8 encoded files as utf-8 and not as iso-8859-1. Configure your viewer/editor with utf-8. In other words, configure the viewer's/editor's encoding according to the encoding of the file's content (which is in your case utf-8 and NOT iso-8859-1).
To convert your files or check them for a certain encoding, just use madedit. madedit has a built-in hex-editor which wraps a rectangle around utf-8 sequences, displaying just one character on the right side (the encoded codepoint). It's easy to identify single-byte characters and/or 2/3/4-byte sequences within utf-8 encoded files. It also wraps a rectangle around the 3-byte utf-8 BOM (if any).
Encoding problems have several failure points:
Check template file encoding
Check response encoding
Check database encoding
Check that they are coherent to what you want to output.
Also note that Notepad++ has a "Encode as..." and a "Convert to..."
1st one reads file as encoding specified and 2nd reads file and writes it back to selected encoding (changing file)

Different querystring urlencoding based on codepage. ASP classic

We are currently converting our webapp to UTF-8 from ISO-8859-1. And everything works great but requesting get/post variables from other sites (Signup forms).
Some of this sites that post to our site have ISO-8859-1 encoding and som have UTF-8.
The problem is that special characters gets URLencoded differently depending on the site charset.
For example:
ø = %F8 in ISO-8859-1
ø = %C3%B8 in UTF-8
I cant get %F8 right when i have UTF-8 charset. I only get a Unicode Character 'REPLACEMENT CHARACTER' (U+FFFD).
Any tips on how to fix this would be much appreciated:)
Torbjørn
You can specify the encoding explicitly using <form accept-charset="UTF-8">.
If you don't want to do that, the browser has to guess the encoding you want. For that it usually takes the encoding of the page in which the form is. So if you serve the HTML files as UTF-8 your forms will be sent back as UTF-8, too.
I'd suggest you did a preanalysis of the inputs before converting them. Essentially, scan for the iso-8859-1 codes for Æ, Ø and Å (upper and lower case). If you find any, do a search/replace for the entire request, where you swap the iso-char codes to the UTF-8 charcodes.

Resources