MIME subject decoding, when RFC are not respected - mime

Subject mime field is in ASCII. Every character excluded by the ASCII table has to be Q/encoded or base64/encoded. Content-Type field in the header has also nothing to do with the way subject is encoded. Am I correct?
However (and unfortunately) some clients (read Microsoft Outlook 6 for example) insert a string encoded in whatever (BIG5 for example) in the header, without specifying with q/base64 encoding that the string is in BIG5. How can i handle these wrongly-encoded emails? Is there a standard way, to parse these?
My goal is to have the biggest compatibility possible, even by using 3rd part paid programs; how can i do that? (sorry for my buggy english)

Subject header encoding has nothing to do with Content-Type header. There is no "perfect" way to handle Subject. I've implemented this just by a hack that tries to see if all characters of text fit in big5, if not then try next encoding in order.
Big5, utf-8, latin-1, q/base64 and finally ascii

Related

Which charset to use for multipart/alternative text/plain?

My task is to send emails which most recipients will read as HTML. A MIME text/plain alternative will be included for those who cannot read the HTML or choose not to.
The HTML is in English and has characters from Latin-1 Supplement and General Punctuation, so US-ASCII or ISO-8859-1 would not retain all of them. I could mitigate by substituting characters before encoding.
My question is which charset to use for the text/plain part? US-ASCII, ISO-8859-1 or UTF-8. Related questions are what text based email clients are still being used, and do they support these charsets?
I've had no answers about how well text based email clients read charsets, so I had a look at how common email senders encode their alternative text.
Both GMail and Outlook (2007) choose the smallest charset that can represent the content. In other words they use US-ASCII if the text is simple, ISO-8859-* if European characters are present or UTF-8 for a large span of characters.
Outlook was a bit buggy on one of my tests. I put in some General Punctuation. Outlook encoded it using WINDOWS-1252 but tagged it as ISO-8859-1.
The answer to the question in pseudo code is
for charset in us-ascii, iso-8859-1, utf-8
if encode(text, charset)
break
The list of charsets is appropriate for the input I am expecting.

How do you think Google is handling this encoding issue?

I recently came across an encoding issue specific to how Firefox encodes URLs directly entered into the address bar. It basically looks like the default Firefox character encoding for URLs is NOT UTF-8, which is the case with most browsers. Additionally, it looks like they are trying to make some intelligent decisions as to what character encoding to use, based on the content of the URL.
For example, if you enter a URL directly into the address bar (I'm using Firefox 3.5.5) with a 'q' parameter, you will get the following results:
For the given query string parameter, this is how it's actually encoded in the http request:
1) ...q=Književni --> q=Knji%9Eevni (This appears to be iso-8859-1 encoded)
2) ...q=漢字 --> q=%E6%BC%A2%E5%AD%97 (This appears to be UTF-8 encoded)
3) ...q=Književni漢字 --> Knji%C5%BEevni%E6%BC%A2%E5%AD%97 (This appears to be UTF-8 encoded ... which is odd, because notice that the first part of the value is the same as 1, which was iso-8859-1 encoded).
So, this really shouldn't be a big deal, right? Well, for me, not totally, but sort of. In the application I'm working on, we have a search box in our global navigation. When a user submits a search term in our search box, the 'q' parameter (like in our example, the parameter that holds the query string value) is submitted on the request and is UTF-8 encoded and all is well and good.
However, the URL that then appears in the address bar contains the decoded form of that URL, so the q parameter looks like "q=Književni". Now, as I mentioned before, if a user then presses the ENTER key to submit what is in the address bar, the "q=Književni" parameter is now encoded to iso-8859-1 and gets sent to our server as "q=Knji%9Eevni". The problem with this is that we are always expecting a UTF-8 encoded URL ... so when we recieve this parameter our application does not know how to interpret it and it can cause some strange results.
As I mentioned before, this appears to ONLY be a Firefox issue, and it would be rare that a user would actually run into this scenario, so it is not too concerning for us. However, I happened to notice that Google actually handles this quite nicely. Typing in the following URL using either of the differently encoded forms of the query string parameter will return nice results in Google:
http://www.google.com/search?q=Knji%C5%BEevni
http://www.google.com/search?q=Knji%9Eevni
So my question really is, how do you think they handle this scenario? Additionally, does anyone else see the same strange Firefox behavior?
Looks like it is using latin-1 unless any characters can't be represented in that encoding, otherwise it is using UTF-8.
If that is indeed the case, the way to get around this at the other end is to assume everything you receive is UTF-8, and validate it as UTF-8. If it fails validation as UTF-8 then assume it is latin-1 (iso-8859-1).
Due to the way UTF-8 is structured, it is highly unlikely that something that is not actually UTF-8 will pass when validated as UTF-8.
Still, the possibility exists and I don't think Firefox's behaviour is a good idea, though no doubt they have done it as a compromise - like for compatibility with servers that wouldn't know UTF-8 if they stepped in it.
There are several parts in a url. The domain name is encoded according to the IDN (International Domain Names) rules (http://en.wikipedia.org/wiki/Internationalized_domain_name).
The part that you care about comes (usually) from a form. And the encoding of the source page determines the encoding (before the % escaping). The form element in html can also take an encoding attribute which overrides the the page setting.
So it is not the fault of Firefox, the encoding of the referrer page/form is the determining factor. And that is the standard behavior.

Enhancing an ASCII protcol with multilingual fields

I am enhancing a piece of software that implements a simple ASCII based protocol.
The protocol is simple... here is an example of what the messages look a little bit like (not the same though, I can't show you the real protocol):
AUTH 1 1 200<CR><LF>
To which we get a response looking similar to
230 DEVICE 1 STATE AUTH 200 OUTPUT 1 NAME "Photo Black"<CR><LF>
The name "Photo Black" comes from a database sqlite database. I need to enhance it to support foreign languages. So I've been thinking that the field "Photo Black" needs to be "optionally" encoded as a UTF-8 string between the quotes. I'm wondering if there is a standard for this so that the client application can interpret the string in the quotes and straight away recognize it as either UTF-8 or plain ASCII. I'm not willing to rewrite the protocol, that would be too much work. Just slip in some kind of encoding for clients to recognize some Spanish or Swedish names.
I don't want the field to be always interpreted as UTF-8 either, long story there. You know how in C++ I can type 0xFF and the compiler knows that this is a hex string... is there an equivalent for UTF-8? Sorry I may be jumping the gun but I'm not that familiar with UTF-8 encoding and internationalization in general.
Do you have control over both the server and the client? If not, you can't change the protocol so you won't be able to do it. When you say you're "not wiling to rewrite the protocol" - you're going to have to do so at least to some extent. Whatever you do, you will be changing the protocol.
I'm not sure why you wouldn't want to always interpret the data as UTF-8 either - if it's currently only ASCII, then it would be completely backward compatible to always interpret it as UTF-8, as all ASCII is encoded the same way in UTF-8. Perhaps if you could give more information, we could provide more help.
You could introduce a prefix for UTF-8-encoded strings, e.g. U:
230 DEVICE 1 STATE AUTH 200 OUTPUT 1 NAME U"Photo UTF-8 stuff here Black"<CR><LF>
would that help?
Do you actually have an 8-bit data path? If something is going to mangle the top bit of every byte, then you'll need to consider options like Punycode instead of UTF-8.
Read up on the concept of Ascii Compatible Encoding, or ACE. iDNS is an example. So is/was UTF-7.
Here's the master speaking.
You really can't code-switch in and out of UTF-8. For a nightmare, look up ISO-2022, which attempted to support that sort of thing. Also keep in mind that UTF-8 includes ASCII, but not Latin-1.
Why don't you want the field to be "always interpreted as UTF-8"? You don't say.
If you do have the client interpret the protocol as UTF-8 encoded text, all of the existing output will still work correctly, since UTF-8 is a proper superset of ASCII.

How can I use unicode in "mailto" protocol?

I want to launch default e-mail client application via ShellExecute function.
I.e. I write something like this:
ShellExecute(0, 'mailto:example#example.com?subject=example&body=example', ...);
How can I encode non-US characters in subject and body?
I can't use default ANSI code page, because characters can be anything: chinese characters, cyrillic or something else.
P.S. Notes:
I'm using ShellExecuteW function.
Leaving subject and body "as is" will not work (tested with Windows Live Mail client on Win7 and Outlook Express on WinXP).
Encoding subject as URLEncode(UTF8Encode(Subject)) will work for Windows Live Mail, but won't work for Outlook Express.
URLEncode(UTF8Encode(Body)) will not work for both clients.
example#example.com?subject=example&body=%e5%85%ad
The short answer is no. Characters must be percentage-encoded as defined by RFC 3986 and its predecessors. RFC 2368 defines the structure of the mailto URI.
#include "windows.h"
int main() {
ShellExecute(0, TEXT("open"),
TEXT("mailto:example#example.com?subject=example&body=%e5%85%ad"),
TEXT(""), NULL, SW_SHOWNORMAL);
return 0;
}
The body in this case is the CJK character U+516D (六) encoded as UTF-8 (E5 85 AD). This works correctly with Mozilla Thunderbird (you may need to install additional fonts if it does not).
The rest is up to how your user-agent (mail client) interprets the URI. RFC 3986 mandates UTF-8, but prior specifications did not. A user-agent may fail to interpret the data correctly if it pre-dates RFC 3986, has not been updated or is maintaining backwards compatibility with prior implementations.
Note: URLEncode functions generally mean the HTML application/x-www-form-urlencoded encoding. This will probably cause space characters to be replaced by plus characters.
Note 2: I'm not current on the state of IRI support in the Windows shell, but it's probably worth looking into. However, some characters in the query part will still need to be percent-encoded.
The interpretation of the command line is up to the launched program. Depending on the nature of the installed e-mail client, you may or may not get your Unicode support (in one or another different shape or form). So there's no single recipe. Some of them may use ANSI command line (because why not?), some of them may respect URLEncoded characters, etc.
Your best bet is to detect 3-4 popular mailers by reading the registry and customize your command line accordingly. Very inelegant, and incomplete by design, but nothing else you can do.

How to verify browser support UTF-8 characters properly?

Is there a way to identify whether the browser encoding is set to/supports "UTF-8" from Javascript?
I want to send "UTF-8" or "English" letters based on browser setting transparently (i.e. without asking the User)
Edit: Sorry I was not very clear on the question. In a Browser the encoding is normally specified as Auto-Detect (or) Western (Windows/ISO-9959-1) (or) Unicode (UTF-8). If the user has set the default to Western then the characters I send are not readable. In this situation I want to inform the user to either set the encoding to "Auto Detect" (or) "UTF-8".
First off, UTF-8 is an encoding of the Unicode character set. English is a language. I assume you mean 'ASCII' (a character set and its encoding) instead of English.
Second, ASCII and UTF-8 overlap; any ASCII character is sent as exactly the same bits when sent as UTF-8. I'm pretty sure all modern browsers support UTF-8, and those that don't will probably just treat it as latin1 or cp1252 (both of which overlap ASCII) so it'll still work.
In other words, I wouldn't worry about it.
Just make sure to properly mark your documents as UTF-8, either in the HTTP headers or the meta tags.
I assume the length of the output (that you read back after outputting it) can tell you what happened (or, without JavaScript, use the Accept-Charset HTTP header, and assume the UTF-8 encoding is supported when Unicode is accepted).
But you'd better worry about sending the correct UTF-8 headers et cetera, and fallback scenarios for accessibility, rather than worrying about the current browsers' UTF-8 capabilities.

Resources