How can I use unicode in "mailto" protocol? - winapi

I want to launch default e-mail client application via ShellExecute function.
I.e. I write something like this:
ShellExecute(0, 'mailto:example#example.com?subject=example&body=example', ...);
How can I encode non-US characters in subject and body?
I can't use default ANSI code page, because characters can be anything: chinese characters, cyrillic or something else.
P.S. Notes:
I'm using ShellExecuteW function.
Leaving subject and body "as is" will not work (tested with Windows Live Mail client on Win7 and Outlook Express on WinXP).
Encoding subject as URLEncode(UTF8Encode(Subject)) will work for Windows Live Mail, but won't work for Outlook Express.
URLEncode(UTF8Encode(Body)) will not work for both clients.

example#example.com?subject=example&body=%e5%85%ad
The short answer is no. Characters must be percentage-encoded as defined by RFC 3986 and its predecessors. RFC 2368 defines the structure of the mailto URI.
#include "windows.h"
int main() {
ShellExecute(0, TEXT("open"),
TEXT("mailto:example#example.com?subject=example&body=%e5%85%ad"),
TEXT(""), NULL, SW_SHOWNORMAL);
return 0;
}
The body in this case is the CJK character U+516D (六) encoded as UTF-8 (E5 85 AD). This works correctly with Mozilla Thunderbird (you may need to install additional fonts if it does not).
The rest is up to how your user-agent (mail client) interprets the URI. RFC 3986 mandates UTF-8, but prior specifications did not. A user-agent may fail to interpret the data correctly if it pre-dates RFC 3986, has not been updated or is maintaining backwards compatibility with prior implementations.
Note: URLEncode functions generally mean the HTML application/x-www-form-urlencoded encoding. This will probably cause space characters to be replaced by plus characters.
Note 2: I'm not current on the state of IRI support in the Windows shell, but it's probably worth looking into. However, some characters in the query part will still need to be percent-encoded.

The interpretation of the command line is up to the launched program. Depending on the nature of the installed e-mail client, you may or may not get your Unicode support (in one or another different shape or form). So there's no single recipe. Some of them may use ANSI command line (because why not?), some of them may respect URLEncoded characters, etc.
Your best bet is to detect 3-4 popular mailers by reading the registry and customize your command line accordingly. Very inelegant, and incomplete by design, but nothing else you can do.

Related

Issues with using UTF-8 with PHPMailer

I'm using PHPMailer 5 to send plain text emails from forms. It looks like some users are pasting content from word into the textarea fields and the resulting email comes out with lots of non-readable characters (e.g. “).
I've tried adding $mail->CharSet = 'UTF-8'; and that seems to fix the tests I've done (e.g. bullet lists are now coming through properly).
$mail = new PHPMailer;
$mail->CharSet = 'UTF-8';
$mail->ContentType = 'text/plain';
$mail->IsHTML(false);
Are there any security issues or other issues that could come up from setting the character set to UTF-8?
You're doing it right. PHPMailer defaults (as does PHP's internal mail function) to the ISO-8859-1 character set because that can be used in the absence of the mbstring PHP extension which is not available by default - and if you don't have that extension, UTF-8 support won't work. Once you switch to using UTF-8, your entire toolchain must also use UTF-8 - your editors, your database, your database connection. You also need to be wary of functions like strlen and substr, which are not UTF-8-safe because they work in bytes, not chars (which may be more than 1 byte long). Whenever one of those things gets it wrong, you'll see the kind of corruption you have. It's a good exercise to stick in some difficult strings to test with (though see my answer about that) to make sure it comes through unscathed.
Unfortunately, MS Word is one of the best examples of how to do UTF-8 badly; it often riddles the text with unnecessary unusual characters, extra control chars etc, so I would advise doing some heavy filtering on your inputs - editors like CKEditor have built-in filters to help deal with Word's issues. That doesn't have anything to do with PHPMailer, it's a just a common problem with dealing with input that has been touched by Word.
The only thing you're doing wrong is using PHPMailer 5.x; current version is 6.x.

Character set conversion problem - debug invalid characters - reverse engineer earlier conversions

Character conversion problem.
I have a few strings which are incorrectly encoded or decoded.
The strings came in an ASCII format CSV file.
The current strings I have are:
N‚met
Tet‹
I know, that the:
"‚" character (0x82) should be originally "é" (é acute accent)
"‹" character (0x8B) should be originally "ő" (o double acute accent)
How can I debug and reverse engineer, what conversions happened with the original characters to get the current characters?
I suppose that multiple decoding encoding happened, but I was not able to reproduce the original character.
I put an expanded version of my comment as answer:
Your viewer uses CP1252 (English and Western Europe, also called ANSI in Windows) or CP1250 (Eastern Europe) or an other similar code page. Most of characters are coded in the same manner, just few language specific changes. Your example do not includes character that are different on the two encoding, so I cannot say precisely.
That code pages are used on Microsoft Windows, and they are based (but not 100% compatible) with Latin-1, so it is common to see text interpreted with such encoding. MacOs and Linux are heavily (now) UTF-8 encoded. Windows uses Unicode internally (but UTF-16)
The old encoding is probably CP437: the standard code page in DOS, so it was used frequently also for CSV files. Other frequent old encoding are CP850 (Western Europe) and CP852 (Central Europe).
For the other answers you put in the comments, I think you should go to Superuser (if you are requesting tools (some editors allow you to specify the encoding. You may use the browser (opening a local file): browsers also allow you to choose the local encoding, and I think you may copy as Unicode [not sure], other tools sometime has hidden option to import files, but possibly not with all options), or as new question in this site, if you want to do it programmatically. But so you are required to specify the language. Python is well suited for such conversions (most scripting languages are created to handle texts): python has built in many encoding, you should just specify when reading and when writing the files. R also can be instructed on the input encoding.
I wrote my own utility that helped me to diagnose and fix many thorny encoding issues. It is available as part of an Open source library. The utility converts any String to unicode sequence and vise-versa. All you will have to do is:
String codes = StringUnicodeEncoderDecoder.encodeStringToUnicodeSequence("Hello world");
And it will return String "\u0048\u0065\u006c\u006c\u006f\u0020\u0057\u006f\u0072\u006c\u0064"
The same would work for any String in any language including special characters. Here is the link to the article Open Source Java library with stack trace filtering, Silent String parsing Unicode converter and Version comparison that explains about the library and where to get it (available both on Maven central and github. In the article search for paragraph: "String Unicode converter". So when you read your String convert it and see what comes up. This way you will see what symbols are there and if the info is correct and only distorted by some wrong encoding or the info itself is lost. You can easily find info on internet that provides tables of mapping of any symbol to a unicode

Making Indy return utf-8 strings instead of ansi strings in FPC/Lazarus?

Are there any compiler directives or preprocessor commands that need to be set in a particular way to make Indy return Utf-8 strings rather than truncating them into Ansi strings? The project I'm working on has all kinds of Delphi-mode flags all over it if that matters.
If I directly set the subject line to a UTF-8 string (like below) it displays correctly on the GUI, so utf-8 support is set up correctly and I'm using an appropriate font and all of that good stuff. Subject is declared as Utf8String for clarity in this code.
MailItem.Subject := 'îņŢëŕŃïóЙǟŁ ŜũƥĵεϿד'; //Displays correctly
However, if I pull the same subject line from the header, using Indy to decode it, I get every international character replaced with exactly one question-mark, exactly the number of should-be characters. Looks like it's converting UTF-8 to ANSI, which is not what I want.
MailItem.Subject := IdCoderHeader.DecodeHeader('=?utf-8?B?w67FhsWiw6vFlcWDw6/Ds9CZx5/FgSDFnMWpxqXEtc61z7/Xkw==?='); //Displays '?????????? ???????'
So what things could be going wrong and/or how can I fix it?
I am using the latest version of Indy10 from Indy's website, and Lazarus 1.0 on Windows, so I don't think this is a "my software needs updating" bug, I think it's probably some sort of configuration issue.
No, there are no compiler flags you can set for that.
Indy uses AnsiString in Delphi prior to D2009, and in FreePascal prior to 3.0. Indy uses UnicodeString in Delphi 2009+ and FreePascal 3.0+. There is no option to change that.
However, in non-Unicode versions of Delphi and FreePascal, in some places of Indy, you can instruct Indy to interpret AnsiString input values as UTF-8, and return UTF-8 encoded AnsiString output. However, DecodeHeader() is not one of those places.

Enhancing an ASCII protcol with multilingual fields

I am enhancing a piece of software that implements a simple ASCII based protocol.
The protocol is simple... here is an example of what the messages look a little bit like (not the same though, I can't show you the real protocol):
AUTH 1 1 200<CR><LF>
To which we get a response looking similar to
230 DEVICE 1 STATE AUTH 200 OUTPUT 1 NAME "Photo Black"<CR><LF>
The name "Photo Black" comes from a database sqlite database. I need to enhance it to support foreign languages. So I've been thinking that the field "Photo Black" needs to be "optionally" encoded as a UTF-8 string between the quotes. I'm wondering if there is a standard for this so that the client application can interpret the string in the quotes and straight away recognize it as either UTF-8 or plain ASCII. I'm not willing to rewrite the protocol, that would be too much work. Just slip in some kind of encoding for clients to recognize some Spanish or Swedish names.
I don't want the field to be always interpreted as UTF-8 either, long story there. You know how in C++ I can type 0xFF and the compiler knows that this is a hex string... is there an equivalent for UTF-8? Sorry I may be jumping the gun but I'm not that familiar with UTF-8 encoding and internationalization in general.
Do you have control over both the server and the client? If not, you can't change the protocol so you won't be able to do it. When you say you're "not wiling to rewrite the protocol" - you're going to have to do so at least to some extent. Whatever you do, you will be changing the protocol.
I'm not sure why you wouldn't want to always interpret the data as UTF-8 either - if it's currently only ASCII, then it would be completely backward compatible to always interpret it as UTF-8, as all ASCII is encoded the same way in UTF-8. Perhaps if you could give more information, we could provide more help.
You could introduce a prefix for UTF-8-encoded strings, e.g. U:
230 DEVICE 1 STATE AUTH 200 OUTPUT 1 NAME U"Photo UTF-8 stuff here Black"<CR><LF>
would that help?
Do you actually have an 8-bit data path? If something is going to mangle the top bit of every byte, then you'll need to consider options like Punycode instead of UTF-8.
Read up on the concept of Ascii Compatible Encoding, or ACE. iDNS is an example. So is/was UTF-7.
Here's the master speaking.
You really can't code-switch in and out of UTF-8. For a nightmare, look up ISO-2022, which attempted to support that sort of thing. Also keep in mind that UTF-8 includes ASCII, but not Latin-1.
Why don't you want the field to be "always interpreted as UTF-8"? You don't say.
If you do have the client interpret the protocol as UTF-8 encoded text, all of the existing output will still work correctly, since UTF-8 is a proper superset of ASCII.

ANSI or OEM Codepage when using MME and DirectMusic?

I noticed that when reading MIDI port names from MME, the names are multi-byte strings encoded using the ANSI Codepage, which my app uses by default. When receiving those names from the DirectMusic driver, the names are wide-character strings encoded with the OEM Codepage. See this article by Raymond Chen for a quick refresher on Codepages.
On my German system, this means that when using the current codepage, which turns out to be the ANSI one, I get "Audiogerät" from MME, and "Audiogeröt" from DirectMusic, the latter being wrong. This gets fixed when I treat that last name as OEM-encoded instead.
So how do I know with which codepage to decode those names? Why does the name coming from DirectMusic get encoded differently? Does it come from the USB driver? The COM framework? DirectMusic? How can I know for sure which codepage to use when reading the names of my MIDI ports?
For info:
I use the MultiByteToWideChar() and WideCharToMultiByte() functions to perform the conversions, with CP_ACP and CP_OEMCP as argument for the codepage to use.
I use midiInGetDeviceCaps() to get MIDI port information from the MME subsystem...
... and convert MIDIINCAPS.szPname using the CP_ACP (ANSI) codepage.
I use IID_IDirectMusic8::EnumPort() to get port information from DirectMusic...
... and convert DMUS_PORTCAPS.wszDescription using the CP_OEMCP codepage.
I don't know for sure why the DirectMusic framework would use one set of codepages, and MME another, but the solution here on your end is probably to build an abstraction layer and then make specific implementations for each API. That way, the higher levels of your software don't need to concern itself with details like this.
That said, the endpoint names definitely come from the OS. USB MIDI devices specify only endpoint types (ie, either input or output, and the number), but the OS is free to interpret them as it sees fit, which is why they are localized.
There is not a specific API call (as far as I know) to find out which codepage the framework will deliver its strings in. However, DirectMusic does seem to use double wide characters with OEM codepage as a general convention, though I could not find this clearly stated in any of the MSDN docs. In the MSDN DirectMusic documentation about MIDI port capability structures, the description type clearly is defined as a WCHAR, and the Game Audio Programming book seems to also indicate that this type is an API-wide convention. While it's dangerous to assume that OEM is the default encoding for these chars, I can't find anything that says otherwise (and googling for "DirectMusic codepage" now lists this page as the top hit).
Edit: Check out this stackoverflow question on determining the current OS codepage. It is possible that the DirectMusic API sets the codepage in this manner.
There isn't really an automatic way to tell what codepage is used for these types of data. See here: How can I detect the encoding/codepage of a text file

Resources