We need our web application to handle additional characters - and so need to move from ISO-8859-1 to UTF-8. So my q is UTF-8 backwards compatible with ISO-8859-1?
I have made the following changes, and can now handle all characters, but want to make sure there's no edge cases I'm missing.
Changed Content-Type:
from "text/html; charset=ISO-8859-1"
to "text/html; charset=UTF-8"
Tomcat Connector URIEncoding from ISO-8859-1 to UTF-8
Thanks
is UTF-8 backwards compatible with ISO-8859-1?
Unicode is a superset of the code points contained in ISO-8859-1 so all the "characters" can be represented in UTF-8 but how they map to byte values is different. There is overlap between the encoded values but it is not 100%.
In terms of serving content or processing forms submissions you are unlikely to have many issues.
It may mean a breaking change for URL handling. For example, for a parameter value naïve there would be two incompatible forms:
http://example.com/foo?p=na%EFve
http://example.com/foo?p=na%C3%AFve
This is only likely to be an issue if there are external applications relying on the old form.
Related
I am using Spring Boot version 2.0.5. and liquid template version 0.7.8
My problem is when I am using German text in the template file and when sending mail then few German characters converted into ? mark.
So what is the solution for this?
Somewhere along the path from the text file template, through processing and sending out as an email the character encoding is being mangled, so that the German characters, encoded in one scheme, are being incorrectly rendered as the wrong "glyph" in the other scheme, in the email.
The first things to check are what the encoding is for the template file. Then investigate how the email is being rendered. For example if it is an HTML email see if there is a character encoding reference in the header with a different encoding, e.g.:
<head><meta charset="utf-8" /></head>
If this differs from the encoding of the file, e.g. ISO-8859-1, then the first thing I would try is to resave the template in UTF-8, you should be able to do that within most IDEs or advanced text editors such as Notepad++
(As the glyphs are question marks it may be that the template is UTF-8 or UTF-16 and the HTML is in a more limited charset.)
If that doesn't work then you may need to look at your code and pay attention to how the raw bytes from the template are converted to Strings. For example:
String template = new String(bytesFromFile);
Would use the system default Charset, which might be different from the file. The safe way to convert the bytes to the String is to specify the character set:
String template = new String(bytesFromFile, "UTF-8");
<? xml version="1.0" encoding="utf-8"?>
As per W3C standards we have to use utf-8 encoding, Why can not we use utf-16 or any other which are in the encoding format?
Whats the difference between utf-8 encoding and rest of the other encoding formats.
XHTML doesn't require UTF-8 encoding. As explained in this section of the specification, any character encoding can be given -- but the default is UTF-8 or UTF-16.
According to w3 school there are lots of character encodings to help browser to understand.
UTF-8 - Character encoding for Unicode
ISO-8859-1 - Character encoding for the Latin alphabet.
There are several ways to specify which character encoding is used in the document. First, the web server can include the character encoding or "charset" in the Hypertext Transfer Protocol (HTTP) Content-Type header, which would typically look like this:[1]
charset=ISO-8859-4
This method gives the HTTP server a convenient way to alter document's encoding according to content negotiation; certain HTTP server software can do it, for example Apache with the module mod_charset_lite.[2]
Why there is no utf-8 encoding in firefox?
Maybe it is wrong for firefox to write encoding unicode in the encoding line,should utf-8 or unicode encoding be displayed in the encdoing line?
What is the reason?
This option is UTF-8, yes. It used to say “Unicode (UTF-8)” which was clearer.
It seems when the encoding menu was tidied (bug 805374 I think) the encoding labels were made ‘friendlier’ by replacing the technical encoding name with a more general description, or removing it when it's the only selectable option.
It makes sense that other UTF encodings are not included: as non-ASCII-compatible encodings they can't easily be mistaken and switched between; UTF-8 is the only Unicode-family encoding that fits here. But the result of calling UTF-8 just “Unicode” is unfortunate in that Microsoft have always (misleadingly) used the term “Unicode” to mean UTF-16LE.
The reasoning(as per my understanding) not to add it as utf-8 might be because it allows the user to set the utf encoding as per your need like utf-16 or utf-8 etc.
Firefox uses Unicode and to use it it uses charset=utf-8
You need to understand that Firefox will use the encoding specified in a meta tag if the server does not send encoding via the HTTP response headers.
Its specified like this:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
My task is to send emails which most recipients will read as HTML. A MIME text/plain alternative will be included for those who cannot read the HTML or choose not to.
The HTML is in English and has characters from Latin-1 Supplement and General Punctuation, so US-ASCII or ISO-8859-1 would not retain all of them. I could mitigate by substituting characters before encoding.
My question is which charset to use for the text/plain part? US-ASCII, ISO-8859-1 or UTF-8. Related questions are what text based email clients are still being used, and do they support these charsets?
I've had no answers about how well text based email clients read charsets, so I had a look at how common email senders encode their alternative text.
Both GMail and Outlook (2007) choose the smallest charset that can represent the content. In other words they use US-ASCII if the text is simple, ISO-8859-* if European characters are present or UTF-8 for a large span of characters.
Outlook was a bit buggy on one of my tests. I put in some General Punctuation. Outlook encoded it using WINDOWS-1252 but tagged it as ISO-8859-1.
The answer to the question in pseudo code is
for charset in us-ascii, iso-8859-1, utf-8
if encode(text, charset)
break
The list of charsets is appropriate for the input I am expecting.
Is there a way to identify whether the browser encoding is set to/supports "UTF-8" from Javascript?
I want to send "UTF-8" or "English" letters based on browser setting transparently (i.e. without asking the User)
Edit: Sorry I was not very clear on the question. In a Browser the encoding is normally specified as Auto-Detect (or) Western (Windows/ISO-9959-1) (or) Unicode (UTF-8). If the user has set the default to Western then the characters I send are not readable. In this situation I want to inform the user to either set the encoding to "Auto Detect" (or) "UTF-8".
First off, UTF-8 is an encoding of the Unicode character set. English is a language. I assume you mean 'ASCII' (a character set and its encoding) instead of English.
Second, ASCII and UTF-8 overlap; any ASCII character is sent as exactly the same bits when sent as UTF-8. I'm pretty sure all modern browsers support UTF-8, and those that don't will probably just treat it as latin1 or cp1252 (both of which overlap ASCII) so it'll still work.
In other words, I wouldn't worry about it.
Just make sure to properly mark your documents as UTF-8, either in the HTTP headers or the meta tags.
I assume the length of the output (that you read back after outputting it) can tell you what happened (or, without JavaScript, use the Accept-Charset HTTP header, and assume the UTF-8 encoding is supported when Unicode is accepted).
But you'd better worry about sending the correct UTF-8 headers et cetera, and fallback scenarios for accessibility, rather than worrying about the current browsers' UTF-8 capabilities.