Special French characters in HTML - utf-8

French characters in HTML with utf-8 charset still display incorrectly. I have a small sample page in ShopAndBind.com/Sample.asp with META HTTP-EQUIV='Content-Type' CONTENT='text/html;charset=utf-8' that still does not display Véhicules Terrestres à Moteur correctly, whether it is in the source or loaded from MySQL data in a database. It displays fine everywhere else. I'm using Visual InterDev 6.0 from Visual Studio 2008 for development. NotePad, Kedit works. The hex in the file is'E0' and 'E9' respectively for é and à.

The page http://shopandbind.com/Sample.asp is served with HTTP headers that do not specify character encoding, the data does not start with BOM, but it contains a meta tag that specifies UTF-8 as the character encoding. However, the data contains bytes that are invalid in UTF-8. This explains the failure.
The data is in fact in ISO-8859-1 (or compatible) encoding, as you can see by manually selecting that encoding (often under the name “Western European”) in the View → Encoding menu of your browser. Byes E0 and E9 denote é and à in ISO-8859-1, byt definitely not in UTF-8.
Thus, the minimal fix is to replace UTF-8 by ISO-8859-1 in the meta tag. A better fix might be to make the process that produces the HTML file to generate UTF-8 encoded data.

Related

Character encoding of Microsoft Word DOC and DOCX files?

I'm not too familiar with the encoding that Microsoft Word uses. If someone where to save a .doc or .docx file from Word, what is the standard encoding that is used?
I'm guessing it's not UTF-8 as the resulting text (pasted in a UTF-8 encoded text file) does not honour certain punctuation (e.g quotes).
For example, an opening Word 'smart quote' when pasted in a UTF-8 text file, results in an ì symbol. If Word does indeed encode in UTF-8, then how does Word attempt to render the actual UTF-8 character?
Edit
After doing a little digging, I can see that a Microsoft Word .docx file is actually a compressed format. Unzipping it results in a number of .xml files to be unpacked.
However, the inability for a UTF-8 encoded text file to honour these 'smart' quotes is still perplexing. Any enlightening information would be helpful.
These days a docx file is really a bunch of compressed xml files. One of these files, is the document.xml file, which starts with the following line (i.e. an xml prolog):
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
As you can see, it's an UTF-8 encoding.
EDIT
UTF-8 supports the full set of Unicode characters. Just for the sake of completeness, that does not mean that all UTF-8 characters can actually be used in an xml file. Even a CDATA block has its limitations. But having said all that, storing an ` or an ì isn't a problem.
And more importantly, the file format does not really have anything to do with copy-paste behavior of the application itself.
Nevertheless, here's how word would store an ` and ì symbol.
CORRECTION
A bit confusing, but I just realized that by "smart quote" you probably refer to the mechanism that Word has to represent the curly quotes. In my previous answer I thought you meant "backticks", which is a different thing. - Sorry for the confusion.
Well, anyway, here are the unicodes for these smart quotes:
Let's put them in a simple UTF-8 encoded text file.
The result is not that spectacular:
U+2018 is encoded in UTF-8 as E2 80 98
U+2019 is encoded in UTF-8 as E2 80 99
U+201C is encoded in UTF-8 as E2 80 9C
U+201D is encoded in UTF-8 as E2 80 9D
So, I went 1 step further and put them in a word file.
I entered a line with regular quotes, and one with smart quotes.
“ this is a test “
“ this is another test ”
And then, I saved the thing and looked how it was stored in Word's xml structure. And actually it is exactly stored as expected.

German Umlaut displayed wrong despite correct Charset

I am encountering a weird problem regarding the encoding of my files.
I have a site which is multilingual; Users can set this viá a dropdown on the site itself, the default value being German.
When the user logs in, some settings are being set depending on the language (charset, codepage and LCID). At this point I also want to point out, that all my files are ANSI-encoded.
Recently, I had to make some changes.
So I fire up Visual Studio 2010, edit the files in question and upload them to my server using Filezilla.
And now, all of a sudden, the German umlauts (Ää, Öö, Üü, ß) are being displayed incorrectly (something like ä) - but only on the files I opened with VS2010.
I checked the charset on the site itself and also displaying it with Response.CharSet and it was ISO-8859-1, which is correct.
So I tried some converting with notepad++, but no success.
I know that setting the charset to UTF-8 would solve this problem, but a) the charset is set from a database-value and b) it kind of messes things up in other languages.
You are displaying a utf-8 encoded file with a iso-8859-1 view. Usually you want to see just one character, but why do you see two instead of one? This is because in utf-8 a german small 'a' letter with 'two dots' is a 2-byte sequence with utf-8 (0xC3 and 0xA4). If this gets NOT displayed as utf-8 but as iso-8859-1 encoding - which means one byte one character - you'll get that what you have mentioned. You'll get the startbyte 0xC3 as a single iso-8859-1 character and the following byte 0xA4 as as a single iso-8859-1 character. In utf-8 this 2-byte sequence must become decoded by extracting the payload bits of the startbyte and the following byte like this:
Startbyte: 11000011
Following: 10100100
So 110 of the startbyte must get stripped off, so 11 is left.
So 10 of the following byte must get stripped off, so 100100 is left.
Chained together this becomes 11100100 which is decimal 228 which should be equal to the german character 'a with two dots' unicode codepoint.
I recommend to let the encoding as it is, utf-8. It is just the encoding of your viewer/editor that should display utf-8 encoded files as utf-8 and not as iso-8859-1. Configure your viewer/editor with utf-8. In other words, configure the viewer's/editor's encoding according to the encoding of the file's content (which is in your case utf-8 and NOT iso-8859-1).
To convert your files or check them for a certain encoding, just use madedit. madedit has a built-in hex-editor which wraps a rectangle around utf-8 sequences, displaying just one character on the right side (the encoded codepoint). It's easy to identify single-byte characters and/or 2/3/4-byte sequences within utf-8 encoded files. It also wraps a rectangle around the 3-byte utf-8 BOM (if any).
Encoding problems have several failure points:
Check template file encoding
Check response encoding
Check database encoding
Check that they are coherent to what you want to output.
Also note that Notepad++ has a "Encode as..." and a "Convert to..."
1st one reads file as encoding specified and 2nd reads file and writes it back to selected encoding (changing file)

Find out character encoding of straße

I'm struggling with the encoding of the content of an external interface. In the MySQL database the collation is latin1_swedish_ci. Also the collation of the field ist latin1_swedish_ci. The php script is encoded in UTF-8 and the output in the browser gives me UTF-8. Everything is working fine except the content of this database. The database connection should be UTF-8 (Typo3 4.7) and the content is
straße
but it should be straße.
mb_detect_encoding($data['street'],'UTF-8') says it is UTF-8. If I use utf8_decode() I get
stra�?e
If I use utf8_encode() I get
straße
My assumption was that UTF-8 encoded data is stored in ISO-8859-1, but if this would be the case this shouldn't make such problems here. How do I find out what the real encoding is?
PS: I cannot change the encoding of the source!
My solution for my initial problem:
I had to set the datbase connection from UTF-8 to ISO-8859-1 with this line of code
$res = $GLOBALS['TYPO3_DB']->sql_query("SET NAMES latin1");
The character ß 'LATIN SMALL LETTER SHARP S' (U+00DF) exist in UTF-8 of bytes 0xC3 and 0x9F as per the linked site:
UTF-8 (hex) 0xC3 0x9F (c39f)
If we look at the ISO-8859-1 codepage layout, then those bytes represent the characters à and a character not definied in the ISO-8859-1 codepage layout. This is thus not it. Another common character encoding which has some overlap with ISO-8859-1 is Windows CP1252 (also known as ANSI, used by default when saving a text file in Notepad — which is overridable by using Save As instead). If we look at CP1252 codepage layout, then those bytes represent the characters à and Ÿ which confirms what you're initially retrieving.
So, it's most likely CP1252 encoded.
What you see as “ß” is really the windows-1252 (also known as CP1252) interpretation of the two bytes 0xC3 and 0x9F that constitute the UTF-8 encoding of “ß”. But this seems to mean that the data is actually UTF-8 encoded and just gets misinterpreted as windows-1252 encoded. So I think it should be simply processed as UTF-8, with due precautions.
i recommend that you proceed to verify what charset is being used by your sql connection. it is NOT necessarily the same as the charset that you define for your databse.
FROM PHP
// Opens a connection to a MySQL server
$connection = mysql_connect ($server, $username, $password);
$charset = mysql_client_encoding($connection);
$flagChange = mysql_set_charset('utf8', $connection);
echo "The character set is: $charset</br>mysql_set_charset result:$flagChange</br>";
INSIDE PHPMYADMIN
open database information_schema
open table schemata
check out your mysql default collation
you may or may not be able to change these parameters, depending on user privileges.
as shown above, i solved my conflicting character set problems in mysql by appending the following line to my connection.php file (which i call at the beginning of every page that uses db access):
$flagChange = mysql_set_charset('utf8', $connection);

€ is coming instead of regular Euro sign in ISO-8859-1 in Magento

€ is displayed instead of Euro sign in ISO-8859-1
I am using this character set for my French, Spanish, German and Italian stores.
Please tell me how to fix this euro sign problem or any other solution to display special characters of above listed languages.
There is no euro sign character in ISO 8859-1; it was introduced in ISO 8859-15 and it is present in UTF-8, however it seems you just need to use € html entity.
Magento uses UTF-8 everywhere: Templates, database, translation files. If you send a content-type header for ISO-8859-1, all data is still UTF-8 encoded but will be displayed incorrect (that's what you see, a UTF-8 euro sign, interpreted as ISO-8859-1).
There is no reason to prefer ISO-8859-1 over UTF-8. If you add own files or data which is in ISO-8859-1, convert them first.
I have done like,
<?php echo mb_convert_encoding($this->__('Careers'), "UTF-8", "HTML-ENTITIES"); ?>
and keep charset default UTF-8.

Different querystring urlencoding based on codepage. ASP classic

We are currently converting our webapp to UTF-8 from ISO-8859-1. And everything works great but requesting get/post variables from other sites (Signup forms).
Some of this sites that post to our site have ISO-8859-1 encoding and som have UTF-8.
The problem is that special characters gets URLencoded differently depending on the site charset.
For example:
ø = %F8 in ISO-8859-1
ø = %C3%B8 in UTF-8
I cant get %F8 right when i have UTF-8 charset. I only get a Unicode Character 'REPLACEMENT CHARACTER' (U+FFFD).
Any tips on how to fix this would be much appreciated:)
Torbjørn
You can specify the encoding explicitly using <form accept-charset="UTF-8">.
If you don't want to do that, the browser has to guess the encoding you want. For that it usually takes the encoding of the page in which the form is. So if you serve the HTML files as UTF-8 your forms will be sent back as UTF-8, too.
I'd suggest you did a preanalysis of the inputs before converting them. Essentially, scan for the iso-8859-1 codes for Æ, Ø and Å (upper and lower case). If you find any, do a search/replace for the entire request, where you swap the iso-char codes to the UTF-8 charcodes.

Resources