We are currently converting our webapp to UTF-8 from ISO-8859-1. And everything works great but requesting get/post variables from other sites (Signup forms).
Some of this sites that post to our site have ISO-8859-1 encoding and som have UTF-8.
The problem is that special characters gets URLencoded differently depending on the site charset.
For example:
ø = %F8 in ISO-8859-1
ø = %C3%B8 in UTF-8
I cant get %F8 right when i have UTF-8 charset. I only get a Unicode Character 'REPLACEMENT CHARACTER' (U+FFFD).
Any tips on how to fix this would be much appreciated:)
Torbjørn
You can specify the encoding explicitly using <form accept-charset="UTF-8">.
If you don't want to do that, the browser has to guess the encoding you want. For that it usually takes the encoding of the page in which the form is. So if you serve the HTML files as UTF-8 your forms will be sent back as UTF-8, too.
I'd suggest you did a preanalysis of the inputs before converting them. Essentially, scan for the iso-8859-1 codes for Æ, Ø and Å (upper and lower case). If you find any, do a search/replace for the entire request, where you swap the iso-char codes to the UTF-8 charcodes.
Related
French characters in HTML with utf-8 charset still display incorrectly. I have a small sample page in ShopAndBind.com/Sample.asp with META HTTP-EQUIV='Content-Type' CONTENT='text/html;charset=utf-8' that still does not display Véhicules Terrestres à Moteur correctly, whether it is in the source or loaded from MySQL data in a database. It displays fine everywhere else. I'm using Visual InterDev 6.0 from Visual Studio 2008 for development. NotePad, Kedit works. The hex in the file is'E0' and 'E9' respectively for é and à.
The page http://shopandbind.com/Sample.asp is served with HTTP headers that do not specify character encoding, the data does not start with BOM, but it contains a meta tag that specifies UTF-8 as the character encoding. However, the data contains bytes that are invalid in UTF-8. This explains the failure.
The data is in fact in ISO-8859-1 (or compatible) encoding, as you can see by manually selecting that encoding (often under the name “Western European”) in the View → Encoding menu of your browser. Byes E0 and E9 denote é and à in ISO-8859-1, byt definitely not in UTF-8.
Thus, the minimal fix is to replace UTF-8 by ISO-8859-1 in the meta tag. A better fix might be to make the process that produces the HTML file to generate UTF-8 encoded data.
I am encountering a weird problem regarding the encoding of my files.
I have a site which is multilingual; Users can set this viá a dropdown on the site itself, the default value being German.
When the user logs in, some settings are being set depending on the language (charset, codepage and LCID). At this point I also want to point out, that all my files are ANSI-encoded.
Recently, I had to make some changes.
So I fire up Visual Studio 2010, edit the files in question and upload them to my server using Filezilla.
And now, all of a sudden, the German umlauts (Ää, Öö, Üü, ß) are being displayed incorrectly (something like ä) - but only on the files I opened with VS2010.
I checked the charset on the site itself and also displaying it with Response.CharSet and it was ISO-8859-1, which is correct.
So I tried some converting with notepad++, but no success.
I know that setting the charset to UTF-8 would solve this problem, but a) the charset is set from a database-value and b) it kind of messes things up in other languages.
You are displaying a utf-8 encoded file with a iso-8859-1 view. Usually you want to see just one character, but why do you see two instead of one? This is because in utf-8 a german small 'a' letter with 'two dots' is a 2-byte sequence with utf-8 (0xC3 and 0xA4). If this gets NOT displayed as utf-8 but as iso-8859-1 encoding - which means one byte one character - you'll get that what you have mentioned. You'll get the startbyte 0xC3 as a single iso-8859-1 character and the following byte 0xA4 as as a single iso-8859-1 character. In utf-8 this 2-byte sequence must become decoded by extracting the payload bits of the startbyte and the following byte like this:
Startbyte: 11000011
Following: 10100100
So 110 of the startbyte must get stripped off, so 11 is left.
So 10 of the following byte must get stripped off, so 100100 is left.
Chained together this becomes 11100100 which is decimal 228 which should be equal to the german character 'a with two dots' unicode codepoint.
I recommend to let the encoding as it is, utf-8. It is just the encoding of your viewer/editor that should display utf-8 encoded files as utf-8 and not as iso-8859-1. Configure your viewer/editor with utf-8. In other words, configure the viewer's/editor's encoding according to the encoding of the file's content (which is in your case utf-8 and NOT iso-8859-1).
To convert your files or check them for a certain encoding, just use madedit. madedit has a built-in hex-editor which wraps a rectangle around utf-8 sequences, displaying just one character on the right side (the encoded codepoint). It's easy to identify single-byte characters and/or 2/3/4-byte sequences within utf-8 encoded files. It also wraps a rectangle around the 3-byte utf-8 BOM (if any).
Encoding problems have several failure points:
Check template file encoding
Check response encoding
Check database encoding
Check that they are coherent to what you want to output.
Also note that Notepad++ has a "Encode as..." and a "Convert to..."
1st one reads file as encoding specified and 2nd reads file and writes it back to selected encoding (changing file)
€ is displayed instead of Euro sign in ISO-8859-1
I am using this character set for my French, Spanish, German and Italian stores.
Please tell me how to fix this euro sign problem or any other solution to display special characters of above listed languages.
There is no euro sign character in ISO 8859-1; it was introduced in ISO 8859-15 and it is present in UTF-8, however it seems you just need to use € html entity.
Magento uses UTF-8 everywhere: Templates, database, translation files. If you send a content-type header for ISO-8859-1, all data is still UTF-8 encoded but will be displayed incorrect (that's what you see, a UTF-8 euro sign, interpreted as ISO-8859-1).
There is no reason to prefer ISO-8859-1 over UTF-8. If you add own files or data which is in ISO-8859-1, convert them first.
I have done like,
<?php echo mb_convert_encoding($this->__('Careers'), "UTF-8", "HTML-ENTITIES"); ?>
and keep charset default UTF-8.
I'm having a trouble with a mobile addon: it shows me the new elements added by scripting with a different charset of the page. E.g. I can read "cuadrúpedo" but the same word in my plugin show "cuadr¡pedo".
I tryed writing the next line to the beginning of my addon, but it didn't work:
document.getElementsByTagName("html")[0].setAttribute("lang", "es");
Then, I wrote a "converter function" which replaces the special characters with unicode, like the next line, but it didn't work.
str.replace( /ú/g, "/xfa־" );
What can I do?
Probably it's a matter of text encoding.
Make sure the file that contains the literal "cuadrúpedo" is saved as utf-8, not ansi.
Keep in mind that a few key files must be ansi encoded. These are install.rdf, chrome.manifest and bootstrap.js. In this case use unicode escapes, "cuadr\u00fapedo".
When the JavaScript file is loaded (in Gecko 1.8 and later) from a chrome:// URL, a Byte Order Mark is used to determine the character encoding of the script. Otherwise, the character encoding will be the same as the one used by the XUL file. So, one solution is the HTTP header can contain a character encoding declaration as part of the Content-Type header, for example:
Content-Type: application/javascript; charset=UTF-8
For cross version compatibility you must limit yourself to ASCII. However, you can use unicode escapes – the earlier example rewritten using them would be:
var text = "Ein sch\u00F6nes Beispiel eines mehrsprachigen Textes: \u65E5\u672C\u8A9E";
JavaScript and Navigator support for UTF-8/Unicode means you can use non-Latin, international, and localized characters, plus special technical symbols in JavaScript programs. Unicode provides a standard way to encode multilingual text: since the UTF-8 encoding of Unicode is compatible with ASCII, programs can use ASCII characters. To receive non-ASCII character input, the client needs to send the input as Unicode.
There is a webpage for text escaping and unescaping in Javascript:
http://0xcc.net/jsescape/
Sources:
https://developer.mozilla.org/en-US/docs/International_characters_in_XUL_JavaScript
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Values,_variables,_and_literals#Unicode
The ¬ character (0xAC in ISO-8859-1) works for normal text if I ensure that ISO-8859-1 is always used as the encoding throughout. However, when using it in attributes it is escaped to: %C2%AC. I understand that it needs to be escaped for urls, but not why it escapes it in the same way as it would for UTF-8, rather than just %AC as I'd expect it to for ISO-8859-1.
Since the escapes are in the output html file the only conclusion is that the xslt processor is the cause.
Example:
input.xml
stylesheet.xslt
makefile
Which for me generates:
output.html
Output was generated using xsltproc, compiled against libxml 20707, libxslt 10126 and libexslt 815. This was on #! Linux (amd64). I have also tried: xmlstarlet tr (also uses libxml), xalan and google chrome (by adding an <?xml-stylesheet ... >, see input_ss.xml tag) with the same result.
Opera doesn't escape it at all, and it allows ¬ to be used literally in the url and attribute.
Is this standard behaviour for xslt or is this a bug in the way the attributes are escaped? And either way, is there a solution other than replacing %C2%AC with %AC bearing in mind it is almost certainly the same for other characters that are valid ISO-8859-1 and invalid in UTF-8.
There are 3 different text-based technologies in use here, XML, HTML and URIs.
All of these have escape mechanisms - that is to say, ways to use text to indicate other text that it is impossible or difficult to indicate in a given context.
The not-sign character ¬ (U+00AC) could be escaped in the first two as ¬ or ¬ perhaps with some leading zeros, in both XML and HTML (¬ would also work in HTML). This escape would be used no matter what encoding the XML or HTML was in, because it relates to the character ¬, not to its set of octets in a given character encoding - indeed, we would generally only use it in the case where there was no such set of octets in the encoding being used.
In this case, this is unnecessary, since the output is in a character encoding in which there is no need to escape it, and so in the source you can see The ¬ character unescaped.
This HTML includes the text of a URI. The encoding of the HTML has nothing to do with this, because the encoding is how we get the text of the HTML from one machine to another, but when the HTML is being parsed to read this URI we're past that point and are dealing with some text at the level of text - that is to say, it doesn't have an encoding any more.
Now, URIs have their own escape mechanisms. This must be used in the case of ¬, as it is not a character allowed in URIs (as opposed to IRIs). Sadly, unlike the escapes in XML and HTML, these escapes are based on octets in a given encoding rather than the code-point of the character itself.
It's easy to see this as a mistake now, but URIs were specified in 1994 and that formalised work going back to 1989/1990 while Unicode 1.0 was released in 1991 and didn't have the ground-breaking 2.0 until 1996, so hindsight has considerably more benefits than URI's inventors. (HTML had the same problem many years ago, but the format of its encodings made it much easier to fix this without as many backwards-compatibility issues).
So, what encoding should we use for those octets? The original specs left this undefined, but really the only possible choice is UTF-8. It's the only encoding that gives those escapes commonly used for chracters special to URIs their escapes in the range 0x20 - 0x7F while also covering all of the UCS.
There's also no way to indicate another choice could be more appropriate. Remember, we're working at the level of text, so your use of ISO-8859-1 is completely irrelevant. Even if we kept track of the encoding while parsing the HTML, the URI is going to be made use of in a way that is nothing to do with the document, so we still couldn't use it. In all, if we have to make use of an octet-based encoding, and we have to keep characters in the ASCII range matching the octets they'd have in ASCII, the only possible basis for the encoding is UTF-8.
For that reason, the escape in any URI for ¬ must always be %C2%AC.
There can be some legacy systems that expect URIs to use other encodings, but the solution is to fix the bit that's broken, not the bit that works, so if something expects ¬ to be %AC then catch it close to that by converting %C2%AC close to its use (and if it outputs %AC itself then of course you'll need to fix it to %C2%AC before it hits the outside world).
The XSLT spec says that when serializing URI-valued attributes, all non-ASCII characters are escaped using the %HH-escaping of the UTF-8 octets that represent the character. Although %HH-escaping of other encodings has been used in the past, it is no longer used today. This is quite independent of the encoding of the document itself.