Which charset to use for multipart/alternative text/plain? - mime

My task is to send emails which most recipients will read as HTML. A MIME text/plain alternative will be included for those who cannot read the HTML or choose not to.
The HTML is in English and has characters from Latin-1 Supplement and General Punctuation, so US-ASCII or ISO-8859-1 would not retain all of them. I could mitigate by substituting characters before encoding.
My question is which charset to use for the text/plain part? US-ASCII, ISO-8859-1 or UTF-8. Related questions are what text based email clients are still being used, and do they support these charsets?

I've had no answers about how well text based email clients read charsets, so I had a look at how common email senders encode their alternative text.
Both GMail and Outlook (2007) choose the smallest charset that can represent the content. In other words they use US-ASCII if the text is simple, ISO-8859-* if European characters are present or UTF-8 for a large span of characters.
Outlook was a bit buggy on one of my tests. I put in some General Punctuation. Outlook encoded it using WINDOWS-1252 but tagged it as ISO-8859-1.
The answer to the question in pseudo code is
for charset in us-ascii, iso-8859-1, utf-8
if encode(text, charset)
break
The list of charsets is appropriate for the input I am expecting.

Related

Why do we use UTF-8 encoding in XHTML?

<? xml version="1.0" encoding="utf-8"?>
As per W3C standards we have to use utf-8 encoding, Why can not we use utf-16 or any other which are in the encoding format?
Whats the difference between utf-8 encoding and rest of the other encoding formats.
XHTML doesn't require UTF-8 encoding. As explained in this section of the specification, any character encoding can be given -- but the default is UTF-8 or UTF-16.
According to w3 school there are lots of character encodings to help browser to understand.
UTF-8 - Character encoding for Unicode
ISO-8859-1 - Character encoding for the Latin alphabet.
There are several ways to specify which character encoding is used in the document. First, the web server can include the character encoding or "charset" in the Hypertext Transfer Protocol (HTTP) Content-Type header, which would typically look like this:[1]
charset=ISO-8859-4
This method gives the HTTP server a convenient way to alter document's encoding according to content negotiation; certain HTTP server software can do it, for example Apache with the module mod_charset_lite.[2]

Windows encoding clarification

I would like to translate a game, this game loads the strings from a text file.
The destination language uses non-ascii characters, so I naïvely saved my file in utf8, but it does not work as letters with diacritics are not shown correctly.
Studying better in the configuration file where the string text filename is stored, I found a CHARSET option that can assume any of those values:
ANSI_CHARSET DEFAULT_CHARSET SYMBOL_CHARSET MAC_CHARSET SHIFTJIS_CHARSET HANGEUL_CHARSET JOHAB_CHARSET GB2312_CHARSET CHINESEBIG5_CHARSET GREEK_CHARSET TURKISH_CHARSET VIETNAMESE_CHARSET HEBREW_CHARSET ARABIC_CHARSET BALTIC_CHARSET RUSSIAN_CHARSET THAI_CHARSET EASTEUROPE_CHARSET OEM_CHARSET
That as far as I understood are fairly standard values in WinAPIs and charset and character encoding are synonymous.
So my question is, is there a correspondence between this names and standard names like utf8 or iso-8859-2? If it is the case what is it?
Try using EASTEUROPE_CHARSET
ISO 8859-2 is mostly equivalent to Windows-1250. According to this MSDN article, the 1250 code page is accessed using EASTEUROPE_CHARSET.
Note that you will need to save your text file in the 1250 code page as ISO 8859-2 is not exactly equivalent. From Wikipedia:
Windows-1250 is similar to ISO-8859-2 and has all the printable characters it has and more. However a few of them are rearranged (unlike Windows-1252, which keeps all printable characters from ISO-8859-1 in the same place). Most of the rearrangements seem to have been done to keep characters shared with Windows-1252 in the same place as in Windows-1252 but three of the characters moved (Ą,Ľ,ź) cannot be explained this way.
The names are symbolic identifiers for Windows code pages, which are character encodings (= charsets) defined or adopted by Microsoft. Many of them are registered at IANA with the prefix windows-. For example, EASTEUROPE_CHARSET stands for code page 1250, which has been registered as windows-1250 and is often called Windows Latin 2.
UTF-8 is something different. You need special routines to read and write UTF-8 encoded data. UTF-8 or UTF-16 is generally the only sensible choice for character encoding when you want to be truly global (support different languages and writing systems). For a single specific language, some of the code pages might be more practical in some cases.
You can get the the standard encoding names (as registered by IANA) using the table under the remarks section of this MSDN page.
Just find the Character set row and read the Code page number, the standard name is windows-[code page number].

Changing encoding and charset to UTF-8

We need our web application to handle additional characters - and so need to move from ISO-8859-1 to UTF-8. So my q is UTF-8 backwards compatible with ISO-8859-1?
I have made the following changes, and can now handle all characters, but want to make sure there's no edge cases I'm missing.
Changed Content-Type:
from "text/html; charset=ISO-8859-1"
to "text/html; charset=UTF-8"
Tomcat Connector URIEncoding from ISO-8859-1 to UTF-8
Thanks
is UTF-8 backwards compatible with ISO-8859-1?
Unicode is a superset of the code points contained in ISO-8859-1 so all the "characters" can be represented in UTF-8 but how they map to byte values is different. There is overlap between the encoded values but it is not 100%.
In terms of serving content or processing forms submissions you are unlikely to have many issues.
It may mean a breaking change for URL handling. For example, for a parameter value naïve there would be two incompatible forms:
http://example.com/foo?p=na%EFve
http://example.com/foo?p=na%C3%AFve
This is only likely to be an issue if there are external applications relying on the old form.

MIME subject decoding, when RFC are not respected

Subject mime field is in ASCII. Every character excluded by the ASCII table has to be Q/encoded or base64/encoded. Content-Type field in the header has also nothing to do with the way subject is encoded. Am I correct?
However (and unfortunately) some clients (read Microsoft Outlook 6 for example) insert a string encoded in whatever (BIG5 for example) in the header, without specifying with q/base64 encoding that the string is in BIG5. How can i handle these wrongly-encoded emails? Is there a standard way, to parse these?
My goal is to have the biggest compatibility possible, even by using 3rd part paid programs; how can i do that? (sorry for my buggy english)
Subject header encoding has nothing to do with Content-Type header. There is no "perfect" way to handle Subject. I've implemented this just by a hack that tries to see if all characters of text fit in big5, if not then try next encoding in order.
Big5, utf-8, latin-1, q/base64 and finally ascii

How to verify browser support UTF-8 characters properly?

Is there a way to identify whether the browser encoding is set to/supports "UTF-8" from Javascript?
I want to send "UTF-8" or "English" letters based on browser setting transparently (i.e. without asking the User)
Edit: Sorry I was not very clear on the question. In a Browser the encoding is normally specified as Auto-Detect (or) Western (Windows/ISO-9959-1) (or) Unicode (UTF-8). If the user has set the default to Western then the characters I send are not readable. In this situation I want to inform the user to either set the encoding to "Auto Detect" (or) "UTF-8".
First off, UTF-8 is an encoding of the Unicode character set. English is a language. I assume you mean 'ASCII' (a character set and its encoding) instead of English.
Second, ASCII and UTF-8 overlap; any ASCII character is sent as exactly the same bits when sent as UTF-8. I'm pretty sure all modern browsers support UTF-8, and those that don't will probably just treat it as latin1 or cp1252 (both of which overlap ASCII) so it'll still work.
In other words, I wouldn't worry about it.
Just make sure to properly mark your documents as UTF-8, either in the HTTP headers or the meta tags.
I assume the length of the output (that you read back after outputting it) can tell you what happened (or, without JavaScript, use the Accept-Charset HTTP header, and assume the UTF-8 encoding is supported when Unicode is accepted).
But you'd better worry about sending the correct UTF-8 headers et cetera, and fallback scenarios for accessibility, rather than worrying about the current browsers' UTF-8 capabilities.

Resources