Difference between UTF-8 and en_AU.UTF-8 - utf-8

I have some text in UTF-8 and it still shows weird in text editor (text editor has UTF-8 encoding set). I know that for instance ISO8859-2 is one byte encoding compatible with ascii that has high 128 values specific for territority, so ppl from that territority can still use one byte encoding to show characters that are not part of ascii and doesn't need to use multibyte encoding like UTF-8. What purpose has that en_AU part of en_AU.UTF-8? Couldn't it be somehow the reason why i still see my text messed even if it is in UTF-8? I mean that some values should be mapped to different characters when the en_AU is used? As i understand UTF-8 it is not possible, but that is the last thing it can be the reason why the text is messed.
output from locale command on linux
LANG=en_US.UTF-8
LANGUAGE=en
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC=cs_CZ.UTF-8
LC_TIME=cs_CZ.UTF-8
LC_COLLATE="en_US.UTF-8"
LC_MONETARY=cs_CZ.UTF-8
LC_MESSAGES="en_US.UTF-8"
LC_PAPER=cs_CZ.UTF-8
LC_NAME=cs_CZ.UTF-8
LC_ADDRESS=cs_CZ.UTF-8
LC_TELEPHONE=cs_CZ.UTF-8
LC_MEASUREMENT=cs_CZ.UTF-8
LC_IDENTIFICATION=cs_CZ.UTF-8
LC_ALL=

In UNIX systems, locales are files on disk and they're encoded in a specific encoding. So you may have the same locale in different encodings, e.g. en_AU.iso55891 and en_AU.UTF-8. This is not some variation of UTF-8, rather it's a variation of this specific locale file. If your locales are using the UTF-8 variant of the locale then anything that's using the locale system will output UTF-8 encoded values.

Related

Why UTF-8 cannot handle Chinese character correctly in Windows10's CMD?

I have already set the UTF-8 as the overall encoding:
when I set encoding to 936(the one for GBK, which is for Chinese)
Chinese characters are displayed properly.
When I changed encoding to 65001(UTF-8, which is for all chars, text)
Chinese characters are displayed improperly.
My question
How to make Windows cmd handle all characters properly?
I don't want to use GBK, which can only handle Chinese, which means I have to switch to another encoding when I handle other languages (e.g. Japanese, Korean). So I want to change all of them to UTF-8 and get out of the encoding hell.

German Umlaut displayed wrong despite correct Charset

I am encountering a weird problem regarding the encoding of my files.
I have a site which is multilingual; Users can set this viá a dropdown on the site itself, the default value being German.
When the user logs in, some settings are being set depending on the language (charset, codepage and LCID). At this point I also want to point out, that all my files are ANSI-encoded.
Recently, I had to make some changes.
So I fire up Visual Studio 2010, edit the files in question and upload them to my server using Filezilla.
And now, all of a sudden, the German umlauts (Ää, Öö, Üü, ß) are being displayed incorrectly (something like ä) - but only on the files I opened with VS2010.
I checked the charset on the site itself and also displaying it with Response.CharSet and it was ISO-8859-1, which is correct.
So I tried some converting with notepad++, but no success.
I know that setting the charset to UTF-8 would solve this problem, but a) the charset is set from a database-value and b) it kind of messes things up in other languages.
You are displaying a utf-8 encoded file with a iso-8859-1 view. Usually you want to see just one character, but why do you see two instead of one? This is because in utf-8 a german small 'a' letter with 'two dots' is a 2-byte sequence with utf-8 (0xC3 and 0xA4). If this gets NOT displayed as utf-8 but as iso-8859-1 encoding - which means one byte one character - you'll get that what you have mentioned. You'll get the startbyte 0xC3 as a single iso-8859-1 character and the following byte 0xA4 as as a single iso-8859-1 character. In utf-8 this 2-byte sequence must become decoded by extracting the payload bits of the startbyte and the following byte like this:
Startbyte: 11000011
Following: 10100100
So 110 of the startbyte must get stripped off, so 11 is left.
So 10 of the following byte must get stripped off, so 100100 is left.
Chained together this becomes 11100100 which is decimal 228 which should be equal to the german character 'a with two dots' unicode codepoint.
I recommend to let the encoding as it is, utf-8. It is just the encoding of your viewer/editor that should display utf-8 encoded files as utf-8 and not as iso-8859-1. Configure your viewer/editor with utf-8. In other words, configure the viewer's/editor's encoding according to the encoding of the file's content (which is in your case utf-8 and NOT iso-8859-1).
To convert your files or check them for a certain encoding, just use madedit. madedit has a built-in hex-editor which wraps a rectangle around utf-8 sequences, displaying just one character on the right side (the encoded codepoint). It's easy to identify single-byte characters and/or 2/3/4-byte sequences within utf-8 encoded files. It also wraps a rectangle around the 3-byte utf-8 BOM (if any).
Encoding problems have several failure points:
Check template file encoding
Check response encoding
Check database encoding
Check that they are coherent to what you want to output.
Also note that Notepad++ has a "Encode as..." and a "Convert to..."
1st one reads file as encoding specified and 2nd reads file and writes it back to selected encoding (changing file)

Find out character encoding of straße

I'm struggling with the encoding of the content of an external interface. In the MySQL database the collation is latin1_swedish_ci. Also the collation of the field ist latin1_swedish_ci. The php script is encoded in UTF-8 and the output in the browser gives me UTF-8. Everything is working fine except the content of this database. The database connection should be UTF-8 (Typo3 4.7) and the content is
straße
but it should be straße.
mb_detect_encoding($data['street'],'UTF-8') says it is UTF-8. If I use utf8_decode() I get
stra�?e
If I use utf8_encode() I get
straße
My assumption was that UTF-8 encoded data is stored in ISO-8859-1, but if this would be the case this shouldn't make such problems here. How do I find out what the real encoding is?
PS: I cannot change the encoding of the source!
My solution for my initial problem:
I had to set the datbase connection from UTF-8 to ISO-8859-1 with this line of code
$res = $GLOBALS['TYPO3_DB']->sql_query("SET NAMES latin1");
The character ß 'LATIN SMALL LETTER SHARP S' (U+00DF) exist in UTF-8 of bytes 0xC3 and 0x9F as per the linked site:
UTF-8 (hex) 0xC3 0x9F (c39f)
If we look at the ISO-8859-1 codepage layout, then those bytes represent the characters à and a character not definied in the ISO-8859-1 codepage layout. This is thus not it. Another common character encoding which has some overlap with ISO-8859-1 is Windows CP1252 (also known as ANSI, used by default when saving a text file in Notepad — which is overridable by using Save As instead). If we look at CP1252 codepage layout, then those bytes represent the characters à and Ÿ which confirms what you're initially retrieving.
So, it's most likely CP1252 encoded.
What you see as “ß” is really the windows-1252 (also known as CP1252) interpretation of the two bytes 0xC3 and 0x9F that constitute the UTF-8 encoding of “ß”. But this seems to mean that the data is actually UTF-8 encoded and just gets misinterpreted as windows-1252 encoded. So I think it should be simply processed as UTF-8, with due precautions.
i recommend that you proceed to verify what charset is being used by your sql connection. it is NOT necessarily the same as the charset that you define for your databse.
FROM PHP
// Opens a connection to a MySQL server
$connection = mysql_connect ($server, $username, $password);
$charset = mysql_client_encoding($connection);
$flagChange = mysql_set_charset('utf8', $connection);
echo "The character set is: $charset</br>mysql_set_charset result:$flagChange</br>";
INSIDE PHPMYADMIN
open database information_schema
open table schemata
check out your mysql default collation
you may or may not be able to change these parameters, depending on user privileges.
as shown above, i solved my conflicting character set problems in mysql by appending the following line to my connection.php file (which i call at the beginning of every page that uses db access):
$flagChange = mysql_set_charset('utf8', $connection);

ANSI to UTF-8 conversion

I would like to know if :
all characters encoded in ANSI (1252) could be converted to UTF-8 without any problem.
all characters encoded in UTF-8 couldn't be converted to ANSI (1252) without any problem (example : Ǣ couldn't be converted to ANSI encoding).
Could you confirm for me that it corrects ?
Thanks !
Yes, all characters representable in Windows-1252 have Unicode equivalents, and can therefore be converted to UTF-8. See this Wikipedia article for a table showing the mapping to Unicode code points.
And since Windows-1252 is an 8-bit character set, and UTF-8 can represent many thousands of distinct characters, there are obviously plenty of characters representable as UTF-8 and not representable as Windows-1252.
Note that the name "ANSI" for the Windows-1252 encoding is strictly incorrect. When it was first proposed, it was intended to be an ANSI standard, but that never happened. Unfortunately, the name stuck. (Microsoft-related documentation also commonly refers to UTF-16 as "Unicode", another misnomer; UTF-16 is one representation of Unicode, but there are others.)

Different querystring urlencoding based on codepage. ASP classic

We are currently converting our webapp to UTF-8 from ISO-8859-1. And everything works great but requesting get/post variables from other sites (Signup forms).
Some of this sites that post to our site have ISO-8859-1 encoding and som have UTF-8.
The problem is that special characters gets URLencoded differently depending on the site charset.
For example:
ø = %F8 in ISO-8859-1
ø = %C3%B8 in UTF-8
I cant get %F8 right when i have UTF-8 charset. I only get a Unicode Character 'REPLACEMENT CHARACTER' (U+FFFD).
Any tips on how to fix this would be much appreciated:)
Torbjørn
You can specify the encoding explicitly using <form accept-charset="UTF-8">.
If you don't want to do that, the browser has to guess the encoding you want. For that it usually takes the encoding of the page in which the form is. So if you serve the HTML files as UTF-8 your forms will be sent back as UTF-8, too.
I'd suggest you did a preanalysis of the inputs before converting them. Essentially, scan for the iso-8859-1 codes for Æ, Ø and Å (upper and lower case). If you find any, do a search/replace for the entire request, where you swap the iso-char codes to the UTF-8 charcodes.

Resources