Why do we use UTF-8 encoding in XHTML? - utf-8

<? xml version="1.0" encoding="utf-8"?>
As per W3C standards we have to use utf-8 encoding, Why can not we use utf-16 or any other which are in the encoding format?
Whats the difference between utf-8 encoding and rest of the other encoding formats.

XHTML doesn't require UTF-8 encoding. As explained in this section of the specification, any character encoding can be given -- but the default is UTF-8 or UTF-16.

According to w3 school there are lots of character encodings to help browser to understand.
UTF-8 - Character encoding for Unicode
ISO-8859-1 - Character encoding for the Latin alphabet.
There are several ways to specify which character encoding is used in the document. First, the web server can include the character encoding or "charset" in the Hypertext Transfer Protocol (HTTP) Content-Type header, which would typically look like this:[1]
charset=ISO-8859-4
This method gives the HTTP server a convenient way to alter document's encoding according to content negotiation; certain HTTP server software can do it, for example Apache with the module mod_charset_lite.[2]

Related

why there is no utf-8 encoding in firefox?

Why there is no utf-8 encoding in firefox?
Maybe it is wrong for firefox to write encoding unicode in the encoding line,should utf-8 or unicode encoding be displayed in the encdoing line?
What is the reason?
This option is UTF-8, yes. It used to say “Unicode (UTF-8)” which was clearer.
It seems when the encoding menu was tidied (bug 805374 I think) the encoding labels were made ‘friendlier’ by replacing the technical encoding name with a more general description, or removing it when it's the only selectable option.
It makes sense that other UTF encodings are not included: as non-ASCII-compatible encodings they can't easily be mistaken and switched between; UTF-8 is the only Unicode-family encoding that fits here. But the result of calling UTF-8 just “Unicode” is unfortunate in that Microsoft have always (misleadingly) used the term “Unicode” to mean UTF-16LE.
The reasoning(as per my understanding) not to add it as utf-8 might be because it allows the user to set the utf encoding as per your need like utf-16 or utf-8 etc.
Firefox uses Unicode and to use it it uses charset=utf-8
You need to understand that Firefox will use the encoding specified in a meta tag if the server does not send encoding via the HTTP response headers.
Its specified like this:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

Changing encoding and charset to UTF-8

We need our web application to handle additional characters - and so need to move from ISO-8859-1 to UTF-8. So my q is UTF-8 backwards compatible with ISO-8859-1?
I have made the following changes, and can now handle all characters, but want to make sure there's no edge cases I'm missing.
Changed Content-Type:
from "text/html; charset=ISO-8859-1"
to "text/html; charset=UTF-8"
Tomcat Connector URIEncoding from ISO-8859-1 to UTF-8
Thanks
is UTF-8 backwards compatible with ISO-8859-1?
Unicode is a superset of the code points contained in ISO-8859-1 so all the "characters" can be represented in UTF-8 but how they map to byte values is different. There is overlap between the encoded values but it is not 100%.
In terms of serving content or processing forms submissions you are unlikely to have many issues.
It may mean a breaking change for URL handling. For example, for a parameter value naïve there would be two incompatible forms:
http://example.com/foo?p=na%EFve
http://example.com/foo?p=na%C3%AFve
This is only likely to be an issue if there are external applications relying on the old form.

Which charset to use for multipart/alternative text/plain?

My task is to send emails which most recipients will read as HTML. A MIME text/plain alternative will be included for those who cannot read the HTML or choose not to.
The HTML is in English and has characters from Latin-1 Supplement and General Punctuation, so US-ASCII or ISO-8859-1 would not retain all of them. I could mitigate by substituting characters before encoding.
My question is which charset to use for the text/plain part? US-ASCII, ISO-8859-1 or UTF-8. Related questions are what text based email clients are still being used, and do they support these charsets?
I've had no answers about how well text based email clients read charsets, so I had a look at how common email senders encode their alternative text.
Both GMail and Outlook (2007) choose the smallest charset that can represent the content. In other words they use US-ASCII if the text is simple, ISO-8859-* if European characters are present or UTF-8 for a large span of characters.
Outlook was a bit buggy on one of my tests. I put in some General Punctuation. Outlook encoded it using WINDOWS-1252 but tagged it as ISO-8859-1.
The answer to the question in pseudo code is
for charset in us-ascii, iso-8859-1, utf-8
if encode(text, charset)
break
The list of charsets is appropriate for the input I am expecting.

Smarty - Shift-jis encoding without using PHP

I want to show Shift-jis characters but only when displaying it. Store in UTF-8 and show in Shift-jis, so what is the solution to do that in Smarty?
You cannot mix different charsets/encodings in the output to the browser. So you can either send UTF-8 OR Shift-jis.
You can use UTF-8 internally and in an outputfilter convert the complete output from UTF-8 to Shift-jis (using mb_convert_encoding).
Smarty is not (really) equipped to deal with charsets other than ASCII supersets (like Latin1, UTF-8) internally.

How to verify browser support UTF-8 characters properly?

Is there a way to identify whether the browser encoding is set to/supports "UTF-8" from Javascript?
I want to send "UTF-8" or "English" letters based on browser setting transparently (i.e. without asking the User)
Edit: Sorry I was not very clear on the question. In a Browser the encoding is normally specified as Auto-Detect (or) Western (Windows/ISO-9959-1) (or) Unicode (UTF-8). If the user has set the default to Western then the characters I send are not readable. In this situation I want to inform the user to either set the encoding to "Auto Detect" (or) "UTF-8".
First off, UTF-8 is an encoding of the Unicode character set. English is a language. I assume you mean 'ASCII' (a character set and its encoding) instead of English.
Second, ASCII and UTF-8 overlap; any ASCII character is sent as exactly the same bits when sent as UTF-8. I'm pretty sure all modern browsers support UTF-8, and those that don't will probably just treat it as latin1 or cp1252 (both of which overlap ASCII) so it'll still work.
In other words, I wouldn't worry about it.
Just make sure to properly mark your documents as UTF-8, either in the HTTP headers or the meta tags.
I assume the length of the output (that you read back after outputting it) can tell you what happened (or, without JavaScript, use the Accept-Charset HTTP header, and assume the UTF-8 encoding is supported when Unicode is accepted).
But you'd better worry about sending the correct UTF-8 headers et cetera, and fallback scenarios for accessibility, rather than worrying about the current browsers' UTF-8 capabilities.

Resources