German letters in From name not shown with Java Mail - utf-8

I'm sending mail with Java Mail. I use the following to set the sender info:
msg.setFrom(new InternetAddress("test#example.com", "Schaltfläche"));
Problem: When I send this message to my GMail, the sender is shown as Schaltfl?che.
In the source it's:
From: "=?ANSI_X3.4-1968?Q?Schaltfl=3Fche?=" <test#example.com>
Which looks...ok? At least it appears effort has been done to encode the ä.
So, what am I doing wrong? I could blame GMail, but that's a stretch, and testers are also seeing the error in other clients.
(Related but unrelated: The same name appears fine in the message body)

Through more searching, I found out two things:
ANSI_X3.4-1968 is apparently the canonical name for ASCII, which of course cannot encode ä. Also, =3F decodes as ? (don't know why it needs encoding in the first place).
There is a constructor InternetAddress(mail, name, charset)
So, I'm now creating the InternetAddress with UTF-8, which fixes the problem.

Good to see that defining the charset for InternetAddress object fixed it for you.
Another solution (especially if you do not have possibility to change the code) would be to run JVM with defined encoding via corresponding VM argument:
-Dfile.encoding=utf-8

Related

Issues with using UTF-8 with PHPMailer

I'm using PHPMailer 5 to send plain text emails from forms. It looks like some users are pasting content from word into the textarea fields and the resulting email comes out with lots of non-readable characters (e.g. “).
I've tried adding $mail->CharSet = 'UTF-8'; and that seems to fix the tests I've done (e.g. bullet lists are now coming through properly).
$mail = new PHPMailer;
$mail->CharSet = 'UTF-8';
$mail->ContentType = 'text/plain';
$mail->IsHTML(false);
Are there any security issues or other issues that could come up from setting the character set to UTF-8?
You're doing it right. PHPMailer defaults (as does PHP's internal mail function) to the ISO-8859-1 character set because that can be used in the absence of the mbstring PHP extension which is not available by default - and if you don't have that extension, UTF-8 support won't work. Once you switch to using UTF-8, your entire toolchain must also use UTF-8 - your editors, your database, your database connection. You also need to be wary of functions like strlen and substr, which are not UTF-8-safe because they work in bytes, not chars (which may be more than 1 byte long). Whenever one of those things gets it wrong, you'll see the kind of corruption you have. It's a good exercise to stick in some difficult strings to test with (though see my answer about that) to make sure it comes through unscathed.
Unfortunately, MS Word is one of the best examples of how to do UTF-8 badly; it often riddles the text with unnecessary unusual characters, extra control chars etc, so I would advise doing some heavy filtering on your inputs - editors like CKEditor have built-in filters to help deal with Word's issues. That doesn't have anything to do with PHPMailer, it's a just a common problem with dealing with input that has been touched by Word.
The only thing you're doing wrong is using PHPMailer 5.x; current version is 6.x.

Java Playframework Internationalization doesn't work

I used the instructions from here:
http://www.playframework.org/documentation/1.2.1/i18n
and created files for different languages.
I call play.i18n.Lang.change method to change the language file,
and it still takes the captions from the English file ("messages" without a suffix),
Any ideas why?
It is hard to know from your description exactly what the problem may be, so I have outlined how you should do a multi-lingual app.
There are a number of steps you must follow to get internationalisation to work. Firstly, you must change your application.conf file to declare your supported languages.
So, if you are supporting English and French, you would do
application.langs=en,fr
You must then create the language file for your French translation called messages.fr. The English language would just stay in the standard messages file. In this new file, add your name value pairs for the key and message.
The way Play processes the messages, is to look first in the locale specific message file first (so for english it would be messages.en, which does not exist, and for french it would be messages.fr). If the message cannot be found in the locale specific message file, it will look at the global message file. So your global messages file acts as the catch all.
Then, in your code, set the language for your particular user, using
Lang.change("fr"); // change language to French
Remember, that this will save a cookie for the particular user in a PLAY_LANG cookie, so check that this cookie is being created for the user.
Final note, make sure that your files are UTF8 encoded. It causes problems if it is not.
In my specific case I had
play.http.session.domain
set to something else other than localhost while testing.

ruby string encoding

So, I'm trying to do some screen scraping off of a certain site using nokogiri, but the site owners failed to specify the proper encoding of the page in a <meta> tag. The upshot of this is that I'm trying to deal with strings that think they're utf-8, but really aren't.
(If you care, here are the files I was using to test this:
main file: http://dpaste.de/nif5/
ann.html: http://dpaste.de/YsLM/
ann2.html: http://dpaste.de/Lofi/
ann3.html: http://dpaste.de/R21j/
a-p.html: http://dpaste.de/O9dy/
output: http://dpaste.de/WdXc/
)
After doing a lot of searching around (this SO question was particularly useful), I found that calling encode('iso-8859-1', 'utf-8') on that test string "works", in that I get a proper © symbol. The issue now is that there are other characters in some other strings I want that really do not work at being converted to latin encoding (Shōta, for instance, turns into Sh�\x8Dta).
Now, I'm probably going to bother the appropriate webmasters and try and get them to fix their damn encodings, but in the meantime, I'd like to be able to use the bytes that I've got. I'm fairly certain that there is a way, but I just can't for the life of me figure out what it is.
Those pages appear to be correctly encoded as UTF-8. That's how my browser sees them, and when I viewsource them and tell the editor to decode them as UTF-8, they look fine. The only problem I see is that some copyright symbols seem to have been corrupted before (or as) they were added to the content. The o-macron and other non-ASCII letters come through just fine.
I don't know if you're aware of this, but the proper way to notify clients of a page's encoding is through a header. Pages may include that information in <meta> tags, but that's neither required nor expected; browsers typically ignore such tags if the header is present.
Since your pages are XHTML, they could also embed the encoding information in an XML processing instruction, but again, they're not required to. But it als means you could have Nokogiri treat them as XML instead of HTML, in which case I would expect it to use UTF-8 by default. But I'm not familiar with Nokogiri, so I can't be sure. And anyway, the header is still the final authority.
So, the issue is that ANN only specifies encoding via headers, and Nokogiri doesn't receive the headers from the open() function. So, Nokogiri guesses that the page is latin-encoded, and produces strings that we really can't reverse to get back the original characters from.
You can specify the encoding to Nokogiri as the 3rd parameter to Nokogiri::HTML(), which solves the issue I was initially trying to solve. So, I'll accept this answer, even though the more specific question I asked (how to get those non-latin characters out of a latin string) is unanswerable.

How do you think Google is handling this encoding issue?

I recently came across an encoding issue specific to how Firefox encodes URLs directly entered into the address bar. It basically looks like the default Firefox character encoding for URLs is NOT UTF-8, which is the case with most browsers. Additionally, it looks like they are trying to make some intelligent decisions as to what character encoding to use, based on the content of the URL.
For example, if you enter a URL directly into the address bar (I'm using Firefox 3.5.5) with a 'q' parameter, you will get the following results:
For the given query string parameter, this is how it's actually encoded in the http request:
1) ...q=Književni --> q=Knji%9Eevni (This appears to be iso-8859-1 encoded)
2) ...q=漢字 --> q=%E6%BC%A2%E5%AD%97 (This appears to be UTF-8 encoded)
3) ...q=Književni漢字 --> Knji%C5%BEevni%E6%BC%A2%E5%AD%97 (This appears to be UTF-8 encoded ... which is odd, because notice that the first part of the value is the same as 1, which was iso-8859-1 encoded).
So, this really shouldn't be a big deal, right? Well, for me, not totally, but sort of. In the application I'm working on, we have a search box in our global navigation. When a user submits a search term in our search box, the 'q' parameter (like in our example, the parameter that holds the query string value) is submitted on the request and is UTF-8 encoded and all is well and good.
However, the URL that then appears in the address bar contains the decoded form of that URL, so the q parameter looks like "q=Književni". Now, as I mentioned before, if a user then presses the ENTER key to submit what is in the address bar, the "q=Književni" parameter is now encoded to iso-8859-1 and gets sent to our server as "q=Knji%9Eevni". The problem with this is that we are always expecting a UTF-8 encoded URL ... so when we recieve this parameter our application does not know how to interpret it and it can cause some strange results.
As I mentioned before, this appears to ONLY be a Firefox issue, and it would be rare that a user would actually run into this scenario, so it is not too concerning for us. However, I happened to notice that Google actually handles this quite nicely. Typing in the following URL using either of the differently encoded forms of the query string parameter will return nice results in Google:
http://www.google.com/search?q=Knji%C5%BEevni
http://www.google.com/search?q=Knji%9Eevni
So my question really is, how do you think they handle this scenario? Additionally, does anyone else see the same strange Firefox behavior?
Looks like it is using latin-1 unless any characters can't be represented in that encoding, otherwise it is using UTF-8.
If that is indeed the case, the way to get around this at the other end is to assume everything you receive is UTF-8, and validate it as UTF-8. If it fails validation as UTF-8 then assume it is latin-1 (iso-8859-1).
Due to the way UTF-8 is structured, it is highly unlikely that something that is not actually UTF-8 will pass when validated as UTF-8.
Still, the possibility exists and I don't think Firefox's behaviour is a good idea, though no doubt they have done it as a compromise - like for compatibility with servers that wouldn't know UTF-8 if they stepped in it.
There are several parts in a url. The domain name is encoded according to the IDN (International Domain Names) rules (http://en.wikipedia.org/wiki/Internationalized_domain_name).
The part that you care about comes (usually) from a form. And the encoding of the source page determines the encoding (before the % escaping). The form element in html can also take an encoding attribute which overrides the the page setting.
So it is not the fault of Firefox, the encoding of the referrer page/form is the determining factor. And that is the standard behavior.

URI encoding in Yahoo mail compose link

I have link generating web app. I'd like to make it easy for users to email the links they create to others using gmail, yahoo mail, etc. Yahoo mail has a particular quirk that I need a workaround for.
If you have a Yahoo mail account, please follow this link:
http://compose.mail.yahoo.com/?body=http%3A%2F%2Flocalhost%3A8000%2Fpath%23anchor
Notice that yahoo redirects to a specific mail server (e.g. http://us.mc431.mail.yahoo.com/mc/compose). As it does, it decodes the hex codes. One of them, %23, is a hash symbol which is not legal in a query string parameter value. All info after %23 is lost.
All my links are broken, and just using another character is not an option.
Calling us.mc431.yahoo.com directly works for me, but probably not for all users, depending on their location.
I've tried setting html=true|false, putting the URL in a html tag. Nothing works. Anyone got a reliable workaround for this particular quirk?
Note: any server-based workaround is a non-starter for me. This has to be a link that's just between Yahoo and the end-user.
Thanks
Here is how i do it:
run a window.escape on those chars: & ' " # > < \
run a encodeURIComponent on the full string
it works for most of my case. though newline (\n) is still an issue, but I replace \n with space in my case and it worked fine.
I have been dealing with the same problem the last couple of hours and I found a workaround!
If you double-encode the anchor it will be interpreted correctly by Yahoo. That means change %23 to %2523 (the percent-sign is %25 encoded).
So your URI will be:
http://compose.mail.yahoo.com/?body=http%3A%2F%2Flocalhost%3A8000%2Fpath%2523anchor
The same workaround can be used for ampersand. If you only encode that as %26, then Yahoo will convert that to "&" which will discard the rest of message. Same procedure as above - change %26 to %2526.
I still haven't found a solution to the newline-problem though (%0D and %0A).
For the newline, add the newline as < BR > and double encode it also, it is interpreted successfully as new line in the new message
I think you're at the mercy of what Yahoo's server does when it issues the HTTP redirect. It seems like it should preserve the URL escaping on the redirect, but isn't. However, without knowledge of their underlying application, it's hard to say why it wouldn't. Perhaps, it's just an unintended side effect (or bug), or perhaps some of the Javascript features on that page require them to do some finagling with the hash tag.

Resources