UTF-8 quoted-printable, multiline subject for Thunderbird? - utf-8
Let's say I want to compose an email header with UTF-8, quoted-printable encoded subject, which is "test — UNIX-утилита для проверки типа файла и сравнения значений". I can confirm the bytes of the characters using:
$ echo "UNIX-утилита ..." | perl utfinfo.pl
Got 16 uchars
Char: 'U' u: 85 [0x0055] b: 85 [0x55] n: LATIN CAPITAL LETTER U [Basic Latin]
Char: 'N' u: 78 [0x004E] b: 78 [0x4E] n: LATIN CAPITAL LETTER N [Basic Latin]
Char: 'I' u: 73 [0x0049] b: 73 [0x49] n: LATIN CAPITAL LETTER I [Basic Latin]
Char: 'X' u: 88 [0x0058] b: 88 [0x58] n: LATIN CAPITAL LETTER X [Basic Latin]
Char: '-' u: 45 [0x002D] b: 45 [0x2D] n: HYPHEN-MINUS [Basic Latin]
Char: 'у' u: 1091 [0x0443] b: 209,131 [0xD1,0x83] n: CYRILLIC SMALL LETTER U [Cyrillic]
Char: 'т' u: 1090 [0x0442] b: 209,130 [0xD1,0x82] n: CYRILLIC SMALL LETTER TE [Cyrillic]
Char: 'и' u: 1080 [0x0438] b: 208,184 [0xD0,0xB8] n: CYRILLIC SMALL LETTER I [Cyrillic]
...
So, I'm trying to get the UTF-8, quoted printable representation of this. For instance, using Python's quopri:
$ python -c 'import quopri; a="test — UNIX-утилита для проверки типа файла и сравнения значений"; print(quopri.encodestring(a));'
test =E2=80=94 UNIX-=D1=83=D1=82=D0=B8=D0=BB=D0=B8=D1=82=D0=B0 =D0=B4=D0=BB=
=D1=8F =D0=BF=D1=80=D0=BE=D0=B2=D0=B5=D1=80=D0=BA=D0=B8 =D1=82=D0=B8=D0=BF=
=D0=B0 =D1=84=D0=B0=D0=B9=D0=BB=D0=B0 =D0=B8 =D1=81=D1=80=D0=B0=D0=B2=D0=BD=
=D0=B5=D0=BD=D0=B8=D1=8F =D0=B7=D0=BD=D0=B0=D1=87=D0=B5=D0=BD=D0=B8=D0=B9
... or PHP's quoted_printable_encode, which gives the exact same output:
$ php -r '$a="test — UNIX-утилита для проверки типа файла и сравнения значений"; echo quoted_printable_encode($a)."\n";'
test =E2=80=94 UNIX-=D1=83=D1=82=D0=B8=D0=BB=D0=B8=D1=82=D0=B0 =D0=B4=D0=BB=
=D1=8F =D0=BF=D1=80=D0=BE=D0=B2=D0=B5=D1=80=D0=BA=D0=B8 =D1=82=D0=B8=D0=BF=
=D0=B0 =D1=84=D0=B0=D0=B9=D0=BB=D0=B0 =D0=B8 =D1=81=D1=80=D0=B0=D0=B2=D0=BD=
=D0=B5=D0=BD=D0=B8=D1=8F =D0=B7=D0=BD=D0=B0=D1=87=D0=B5=D0=BD=D0=B8=D0=B9
So, to test, I make a text file called test.eml, and try to simply wrap this output in the =?UTF-8?Q? ... ?= tags for the Subject: line, making sure that line endings are CRLF \r\n:
Message-Id: <4c428d27a41043e2b2b07e#example.com>
Subject: =?UTF-8?Q?test =E2=80=94 UNIX-=D1=83=D1=82=D0=B8=D0=BB=D0=B8=D1=82=D0=B0 =D0=B4=D0=BB=
=D1=8F =D0=BF=D1=80=D0=BE=D0=B2=D0=B5=D1=80=D0=BA=D0=B8 =D1=82=D0=B8=D0=BF=
=D0=B0 =D1=84=D0=B0=D0=B9=D0=BB=D0=B0 =D0=B8 =D1=81=D1=80=D0=B0=D0=B2=D0=BD=
=D0=B5=D0=BD=D0=B8=D1=8F =D0=B7=D0=BD=D0=B0=D1=87=D0=B5=D0=BD=D0=B8=D0=B9?=
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Hello world
... but if I open this in Thunderbird, I get a corrupt output:
I've read somewhere that multiline in long header fields is covered by RFC0822 "LONG HEADER FIELDS", and basically, the line ending should be followed by a space. So I indent the continuation lines by one space:
Message-Id: <4c428d27a41043e2b2b07e#example.com>
Subject: =?UTF-8?Q?test =E2=80=94 UNIX-=D1=83=D1=82=D0=B8=D0=BB=D0=B8=D1=82=D0=B0 =D0=B4=D0=BB=
=D1=8F =D0=BF=D1=80=D0=BE=D0=B2=D0=B5=D1=80=D0=BA=D0=B8 =D1=82=D0=B8=D0=BF=
=D0=B0 =D1=84=D0=B0=D0=B9=D0=BB=D0=B0 =D0=B8 =D1=81=D1=80=D0=B0=D0=B2=D0=BD=
=D0=B5=D0=BD=D0=B8=D1=8F =D0=B7=D0=BD=D0=B0=D1=87=D0=B5=D0=BD=D0=B8=D0=B9?=
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Hello world
... and I get a slighly different subject in Thunderbird, but still corrupt:
Now, if I delete =\r\n from the first three continuation lines, so the subject is all in one line:
Message-Id: <4c428d27a41043e2b2b07e#example.com>
Subject: =?UTF-8?Q?test =E2=80=94 UNIX-=D1=83=D1=82=D0=B8=D0=BB=D0=B8=D1=82=D0=B0 =D0=B4=D0=BB=D1=8F =D0=BF=D1=80=D0=BE=D0=B2=D0=B5=D1=80=D0=BA=D0=B8 =D1=82=D0=B8=D0=BF=D0=B0 =D1=84=D0=B0=D0=B9=D0=BB=D0=B0 =D0=B8 =D1=81=D1=80=D0=B0=D0=B2=D0=BD=D0=B5=D0=BD=D0=B8=D1=8F =D0=B7=D0=BD=D0=B0=D1=87=D0=B5=D0=BD=D0=B8=D0=B9?=
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Hello world
... then actually Thunderbird shows the subject line well:
... but then my header is in conflict with the recommendation from RFC 2822 - 2.1.1. Line Length Limits which says "Each line of characters MUST be no more than 998 characters, and SHOULD be no more than 78 characters, excluding the CRLF."; specifically the line limit of 78 characters.
So, how can I obtain the proper multi-line quoted-printable representation of an UTF-8 Subject header string, so I can use it in an .eml file split at 78 characters - and have Thunderbird correctly read it?
When I ask python to create an email with that subject, here's what it does:
$ python
Python 2.7.9 (default, Mar 1 2015, 18:22:53)
[GCC 4.9.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from email.message import Message
>>> from email.header import Header
>>> msg = Message()
>>> import quopri
>>> h = Header(quopri.decodestring('test =E2=80=94 UNIX-'
'=D1=83=D1=82=D0=B8=D0=BB=D0=B8=D1=82=D0=B0 =D0=B4=D0=BB=D1=8F'
'=D0=BF=D1=80=D0=BE=D0=B2=D0=B5=D1=80=D0=BA=D0=B8 =D1=82=D0=B8'
'=D0=BF=D0=B0 =D1=84=D0=B0=D0=B9=D0=BB=D0=B0 =D0=B8'
'=D1=81=D1=80=D0=B0=D0=B2=D0=BD=D0=B5=D0=BD=D0=B8=D1=8F '
'=D0=B7=D0=BD=D0=B0=D1=87=D0=B5=D0=BD=D0=B8=D0=B9?='), 'UTF-8')
>>> msg['Subject'] = h
>>> print msg.as_string()
Subject: =?utf-8?b?dGVzdCDigJQgVU5JWC3Rg9GC0LjQu9C40YLQsCDQtNC70Y8g0L/RgNC+0LI=?=
=?utf-8?b?0LXRgNC60Lgg0YLQuNC/0LAg0YTQsNC50LvQsCDQuCDRgdGA0LDQstC90LU=?=
=?utf-8?b?0L3QuNGPINC30L3QsNGH0LXQvdC40Lk/?=
>>>
So it uses base64 encoding instead of quoted-printable, but my strong suspicion, based on this, is that the answer is that each line must begin and end the escape.
Indeed:
>>> import email
>>> s = '''Subject: =?UTF-8?Q?test =E2=80=94 UNIX-=D1=83=D1=82=D0=B8=D0?=
... =?UTF-8?Q?=BB=D0=B8=D1=82=D0=B0 =D0=B4=D0=BB=D1=8F =D0=BF=D1=80=D0?=
... =?UTF-8?Q?=BE=D0=B2=D0=B5=D1=80=D0=BA=D0=B8 =D1=82=D0=B8=D0=BF=D0=B0?=
... =?UTF-8?Q? =D1=84=D0=B0=D0=B9=D0=BB=D0=B0 =D0=B8 =D1=81=D1=80=D0=B0=D0?=
... =?UTF-8?Q?=B2=D0=BD=D0=B5=D0=BD=D0=B8=D1=8F =D0=B7=D0=BD=D0=B0=D1?=
... =?UTF-8?Q?=87=D0=B5=D0=BD=D0=B8=D0=B9?=
...
... Hello.
... '''
>>> e = email.message_from_string(s.replace('\n', '\r\n'))
>>> email.header.decode_header(e['Subject'])
[('test \xe2\x80\x94 UNIX-\xd1\x83\xd1\x82\xd0\xb8\xd0\xbb\xd0\xb8\xd1\x82\xd0\xb0 \xd0\xb4\xd0\xbb\xd1\x8f \xd0\xbf\xd1\x80\xd0\xbe\xd0\xb2\xd0\xb5\xd1\x80\xd0\xba\xd0\xb8 \xd1\x82\xd0\xb8\xd0\xbf\xd0\xb0 \xd1\x84\xd0\xb0\xd0\xb9\xd0\xbb\xd0\xb0 \xd0\xb8 \xd1\x81\xd1\x80\xd0\xb0\xd0\xb2\xd0\xbd\xd0\xb5\xd0\xbd\xd0\xb8\xd1\x8f \xd0\xb7\xd0\xbd\xd0\xb0\xd1\x87\xd0\xb5\xd0\xbd\xd0\xb8\xd0\xb9', 'utf-8')]
>>> decoded = email.header.decode_header(e['Subject'])
>>> print decoded[0][0].decode(decoded[0][1])
test — UNIX-утилита для проверки типа файла и сравнения значений
EDIT: However, even with the above added in .eml file, Thunderbird fails again:
... but this time it indicates it got some of the chars correct. And indeed, breakage occurs where lines are broken "in the middle of a character"; say if for the sequence 0xD1, 0x83 for the character у, the =D1?= ends one line, and the Q?=83 starts the other, then Thunderbird cannot parse that. So after manual rearrangement, this snippet can be obtained:
Message-Id: <4c428d27a41043e2b2b07e#example.com>
Subject: =?UTF-8?Q?test =E2=80=94 UNIX-=D1=83=D1=82=D0=B8?=
=?UTF-8?Q?=D0=BB=D0=B8=D1=82=D0=B0 =D0=B4=D0=BB=D1=8F =D0=BF=D1=80?=
=?UTF-8?Q?=D0=BE=D0=B2=D0=B5=D1=80=D0=BA=D0=B8 =D1=82=D0=B8=D0=BF=D0=B0?=
=?UTF-8?Q? =D1=84=D0=B0=D0=B9=D0=BB=D0=B0 =D0=B8 =D1=81=D1=80=D0=B0?=
=?UTF-8?Q?=D0=B2=D0=BD=D0=B5=D0=BD=D0=B8=D1=8F =D0=B7=D0=BD=D0=B0?=
=?UTF-8?Q?=D1=87=D0=B5=D0=BD=D0=B8=D0=B9?=
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Hello world
... which opens fine as an .eml message in Thunderbird (same as this image from OP).
EDIT2: Also PHP seems to do it right, with this invocation of mb_encode_mimeheader (directly pasteable in .eml file):
$ php -r '$a="test — UNIX-утилита для проверки типа файла и сравнения значений"; mb_internal_encoding("UTF-8"); echo mb_encode_mimeheader($a, "UTF-8", "Q")."\n";'
test =?UTF-8?Q?=E2=80=94=20UNIX-=D1=83=D1=82=D0=B8=D0=BB=D0=B8=D1=82?=
=?UTF-8?Q?=D0=B0=20=D0=B4=D0=BB=D1=8F=20=D0=BF=D1=80=D0=BE=D0=B2=D0=B5?=
=?UTF-8?Q?=D1=80=D0=BA=D0=B8=20=D1=82=D0=B8=D0=BF=D0=B0=20=D1=84=D0=B0?=
=?UTF-8?Q?=D0=B9=D0=BB=D0=B0=20=D0=B8=20=D1=81=D1=80=D0=B0=D0=B2=D0=BD?=
=?UTF-8?Q?=D0=B5=D0=BD=D0=B8=D1=8F=20=D0=B7=D0=BD=D0=B0=D1=87=D0=B5=D0=BD?=
=?UTF-8?Q?=D0=B8=D0=B9?=
The problem with your test.eml is that your RFC2047 encoding is broken. The Q encoding is based on quoted-printable, but is not entirely the same. In particular, each space needs to be encoded as either =20 or _, and you cannot escape line breaks with a final =.
Fundamentally, each =?...?= sequence needs to be a single, unambiguous token per RFC 822. You can either break up your input into multiple such tokens and leave the spaces unencoded, or encode the spaces. Note that spaces between two such tokens are not significant, so encoding the spaces into the sequences makes more sense.
Message-Id: <4c428d27a41043e2b2b07e#example.com>
Subject: =?UTF-8?Q?test_=E2=80=94_UNIX-=D1=83=D1=82=D0=B8=D0=BB?=
=?UTF-8?Q?=D0=B8=D1=82=D0=B0_=D0=B4=D0=BB_=D1=8F_=D0=BF=D1=80?=
=?UTF-8?Q?=D0=BE=D0=B2=D0=B5=D1=80=D0=BA=D0=B8_=D1=82=D0=B8=D0=BF?=
=?UTF-8?Q?=D0=B0_=D1=84=D0=B0=D0=B9=D0=BB=D0=B0_=D0=B8_=D1=81?=
=?UTF-8?Q?=D1=80=D0=B0=D0=B2=D0=BD_=D0=B5=D0=BD=D0=B8=D1=8F_?=
=?UTF-8?Q?=D0=B7=D0=BD=D0=B0=D1=87=D0=B5=D0=BD=D0=B8=D0=B9?=
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Hello world
Of course, with this exposition, quoted-printable isn't really legible at all, and probably takes much more space than base64, so you might prefer to go with the B encoding in the end after all.
Unless you are writing a MIME library yourself, the simple solution is to not care, and let the library piece this together for you. PHP is more problematic (the standard library lacks this functionality, and the third-party libraries are somewhat uneven--find one you trust, and stick to it), but in Python, simply pass in a Unicode string, and the email library will encode it if necessary.
Related
How to calculate Content-length properly in tclhttpd?
My Tcl source files are in utf-8. Tclhttpd would not send national characters properly, so I modified it a bit. However, I also send binary stuff like jpg images and sometimes binary chunks are present in my otherwise utf-8 HTML. I have difficulty calculating the proper Content-length to match exactly what the browser receives (otherwise some trailing characters clobber the next-request headers or the browser keeps waiting 30 sec per request, until a timeout). In other words, can I please know how many bytes did puts $socket write into the socket? I have discovered a particular 11-byte sequence that messes up counting: proc dump3 string { binary scan $string c* c binary scan $string H* hex return [sdump $string]\n$c\n$hex };#dump3 proc Httpd_ReturnData {sock type content {code 200} {close 0}} { global Httpd upvar #0 Httpd$sock data #...skip non-pertinent code... set content \x4f\x4e\xc2\x00\x03\xff\xff\x80\x00\x3c\x2f #content=ONÂÿÿ�</ #79 78 -62 0 3 -1 -1 -128 0 60 47 #4f4ec20003ffff80003c2f puts content=[dump3 $content] puts utf8=[dump3 [encoding convertto utf-8 $content]] if {[catch { puts "string length=[string length $content] type=$type" puts "stringblength=[string bytelength $content]" set len [string length $content] if [string match -nocase *utf-8* $type] { fconfigure $sock -encoding utf-8 set len [string bytelength $content] } puts "len=$len fcon=[fconfigure $sock]" HttpdRespondHeader $sock $type $close $len $code HttpdSetCookie $sock puts $sock "" if {$data(proto) != "HEAD"} { ##fconfigure $sock -translation binary -blocking $Httpd(sockblock) ##native: -translation {auto crlf} fconfigure $sock -translation lf -blocking $Httpd(sockblock) puts -nonewline $sock $content } Httpd_SockClose $sock $close } err]} { HttpdCloseFinal $sock $err } } The output on console is: content=ONÂÿÿ�</ 79 78 -62 0 3 -1 -1 -128 0 60 47 4f4ec20003ffff80003c2f utf8=ONÃ�ÿÿÂ�</ 79 78 -61 -126 0 3 -61 -65 -61 -65 -62 -128 0 60 47 4f4ec3820003c3bfc3bfc280003c2f string length=11 type=text/html;charset=utf-8 stringblength=17 len=17 fcon=-blocking 0 -buffering full -buffersize 16384 -encoding utf-8 -eofchar {{} {}} -translation {auto crlf} -peername {128.0.0.71 128.0.0.71 55305} -sockname {128.0.0.8 gen 8016} HttpdRespondHeader 17 The resultant Content-Length: 17 is too much, the browser keeps waiting. If I only could know beforehand, how many bytes puts will make out of my string, the rest would be easy. Is there a way?
For data going over HTTP, the content length should be the number of bytes in the data as observed on the wire. When working with Httpd_ReturnData you need to ensure that you provide it the binary data to transfer; it does not handle encoding the data for you. To send binary data with a length it's actually easy, and you do: set binaryData [...] Httpd_ReturnData $sock "application/octet-stream" $binaryData # There are many other binary encodings; that's just the most universal one # Choose the right one for your application, of course To send text data with a length, you need to do a little more work with encoding convertto: set textData [...] Httpd_ReturnData $sock "text/plain; charset=utf-8" \ [encoding convertto utf-8 $textData] # Similarly, text/plain is a decent fallback here too (Yes, if you choose a different encoding then you should mention that in both places. You probably ought to use UTF-8 for all text content in this day and age.) If you can pull the data from a file, you should do so; Httpd_ReturnFile is more efficient than Httpd_ReturnData as it can move the data using efficient data transfer techniques. If sending a text file, you need to be careful to describe the encoding of the file correctly. By far the easiest way to do that is by convention, such as deciding that all text files on your system are UTF-8... You should virtually never use string bytelength, as that reports in units that are one of Tcl's internal-only encodings (a lightly-denormalized almost-UTF-8). The measure it returns is only correct when you're doing something very weird like generating C code that needs to know buffer sizes that contain strings that will be fed into Tcl's implementation, which is very much not what you're doing (I've only done that sort of thing once in more than 20 years of using Tcl; I've never heard of another legitimate use). I believe it is deprecated precisely because it has a bunch of subtle bugs in how it is used by all too many people.
How to detect and fix incorrect character encoding
A upstream service reads a stream of UTF-8 bytes, assumes they are ISO-8859-1, applies ISO-8859-1 to UTF-8 encoding, and sends them to my service, labeled as UTF-8. The upstream service is out of my control. They may fix it, it may never be fixed. I know that I can fix the encoding by applying UTF-8 to ISO-8859-1 encoding then labeling the bytes as UTF-8. But what happens if my upstream fixes their issue? Is there any way to detect this issue and fix the encoding only when I find a bad encoding? I'm also not sure that the upstream encoding is ISO-8859-1. I think the upstream is perl so that encoding makes sense and each sample I've tried decoded correctly when I apply ISO-8859-1 encoding. When the source sends e4 9c 94 (✔) to my upstream, my upstream sends me c3 a2 c2 9c c2 94 (â). utf-8 string ✔ as bytes: e4 9c 94 bytes e4 9c 94 as latin1 string: â utf-8 string â as bytes: c3 a2 c2 9c c2 94 I can fix it applying upstream.encode('ISO-8859-1').force_encoding('UTF-8') but it will break as soon as the upstream issue is fixed.
Since you know how it is mangled, you can try to unmangle it by decoding the received UTF-8 bytes, encoding to latin1, and decoding as UTF-8 again. Only your mangled strings, pure ASCII strings, or very unlikely latin-1 string combinations will successfully decode twice. If that decoding fails, assume the upstream was fixed and just decode once as UTF-8. A pure ASCII string will correctly decode with either method so there is no issue there as well. There are valid UTF-8-encoded sequences that survive a double-decode but they are unlikely to occur in normal text. Here's an example in Python (you didn't mention a language...): # Assume bytes are latin1, but return encoded UTF-8. def bad(b): return b.decode('latin1').encode('utf8') # Assume bytes are UTF-8, and pass them along. def good(b): return b def decoder(b): try: return b.decode('utf8').encode('latin1').decode('utf8') except UnicodeError: return b.decode('utf8') b = '✔'.encode('utf8') print(decoder(bad(b))) print(decoder(good(b))) Output: ✔ ✔
Bare ISO 8859-1 is almost guaranteed to be invalid UTF-8. Attempting to decode as ISO 8859-1 and then as UTF-8, and falling back to simply decoding as UTF-8 if this produces invalid byte sequences should work for this specific case. In some more detail, the UTF-8 encoding severely restricts which non-ASCII character sequences are allowed. The allowed patterns are extremely unlikely in ISO-8859-1 because in this encoding, they represent sequences like à followed by an unprintable control character or mathematical operator, which simply do not tend to occur in any valid text.
Based on Mark Tolonen's answer, again in Python 3: def maybe_fix_encoding(utf8_string, possible_codec="cp1252"): """Attempts to fix mangled text caused by interpreting UTF8 as cp1252 (or other codec: https://docs.python.org/3/library/codecs.html)""" try: return utf8_string.encode(possible_codec).decode('utf8') except UnicodeError: return utf8_string >>> maybe_fix_encoding("some normal text and some scandinavian characters æ ø Ã¥ Æ Ø Ã…") 'some normal text and some scandinavian characters æ ø å Æ Ø Å'
Based on turpachull's answer, and the python3 list of standard encodings (& Mark Amery's answer listing the set for various versions of python), here's a script that will attempt every encoding transform on stdin and output each version if it's different from the plain utf_8. #!/usr/bin/env python3 import sys import fileinput encodings = ["ascii", "big5hkscs", "cp1006", "cp1125", "cp1250", "cp1252", "cp1254", "cp1256", "cp1258", "cp273", "cp437", "cp720", "cp775", "cp852", "cp856", "cp858", "cp861", "cp863", "cp865", "cp869", "cp875", "cp949", "euc_jis_2004", "euc_kr", "gbk", "hz", "iso2022_jp_1", "iso2022_jp_2004", "iso2022_jp_ext", "iso8859_11", "iso8859_14", "iso8859_16", "iso8859_3", "iso8859_5", "iso8859_7", "iso8859_9", "koi8_r", "koi8_u", "latin_1", "mac_cyrillic", "mac_iceland", "mac_roman", "ptcp154", "shift_jis_2004", "utf_16_be", "utf_32", "utf_32_le", "utf_7", "utf_8_sig", "big5", "cp037", "cp1026", "cp1140", "cp1251", "cp1253", "cp1255", "cp1257", "cp424", "cp500", "cp737", "cp850", "cp855", "cp857", "cp860", "cp862", "cp864", "cp866", "cp874", "cp932", "cp950", "euc_jisx0213", "euc_jp", "gb18030", "gb2312", "iso2022_jp", "iso2022_jp_2", "iso2022_jp_3", "iso2022_kr", "iso8859_10", "iso8859_13", "iso8859_15", "iso8859_2", "iso8859_4", "iso8859_6", "iso8859_8", "johab", "koi8_t", "kz1048", "mac_greek", "mac_latin2", "mac_turkish", "shift_jis", "shift_jisx0213", "utf_16", "utf_16_le", "utf_32_be", "utf_8"] def maybe_fix_encoding(utf8_string, possible_codec="utf_8"): try: return utf8_string.encode(possible_codec).decode('utf_8') except UnicodeError: return utf8_string for line in sys.stdin: for e in encodings: i=line.rstrip('\n') result=maybe_fix_encoding(i, e) if result != i or e == 'utf_8': print("\t".join([e, result])) print("\n") usage e.g.: $ echo 'Requiem der morgenröte' | ~/decode_string.py cp1252 Requiem der morgenröte cp1254 Requiem der morgenröte iso2022_jp_1 Requiem der morgenr(D**B"yte iso2022_jp_2 Requiem der morgenr(D**B"yte iso2022_jp_2004 Requiem der morgenr(Q):B"yte iso2022_jp_3 Requiem der morgenr(O):B"yte iso2022_jp_ext Requiem der morgenr(D**B"yte latin_1 Requiem der morgenröte iso8859_9 Requiem der morgenröte iso8859_14 Requiem der morgenröte iso8859_15 Requiem der morgenröte mac_iceland Requiem der morgenr̦te mac_roman Requiem der morgenr̦te mac_turkish Requiem der morgenr̦te utf_7 Requiem der morgenr+AMMAtg-te utf_8 Requiem der morgenröte utf_8_sig Requiem der morgenröte
Proper way to validate DKIM-Signature (b= part)
I'm trying to develop my home email server (with NodeJS on server side but it's not important as I try to figure out principles). I use this documentation to guide myself through DKIM-Signature validation routine, but it requires some complicated steps and I can't figure out where is my mistake. For an email example I used one sent from Mail.ru server. It should be totally valid. There is it's header: DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=mail.ru; s=mail2; h=References:In-Reply-To:Content-Type:Message-ID:Reply-To:Date:MIME-Version:Subject:To:From; bh=gCWDSCJf58CbaR+wjAV9dydu9JTKkvo1o+0zkj8bNr0=; b=pheltY+k/mio2x4CFQV8cXZxNiR7oSTkIsWTOZa1CGpEyK8KVSHY07OWSdZ1aFVtuaV32PbI0mNY0yliuqIbYTsnreFUYFM/iVR5PU74QHAe8yp46ydAYRbzLQu8dy+AkFhPtEdb8CAgoZKXgPLc888/Q6MsVAh6iH1L3SZj87Y=; Received: by f427.i.mail.ru with local (envelope-from <[my name]#mail.ru>) id 1dbP18-0003I9-L7 for madbr#[domain]; Sat, 29 Jul 2017 13:30:42 +0300 Received: by e.mail.ru with HTTP; Sat, 29 Jul 2017 13:30:42 +0300 From: =?UTF-8?B?0KHQtdGA0LPQtdC5?= <[my name]#mail.ru> To: madbr#[domain] Subject: =?UTF-8?B?UmU6IA==?= MIME-Version: 1.0 X-Mailer: Mail.Ru Mailer 1.0 Date: Sat, 29 Jul 2017 13:30:42 +0300 Reply-To: =?UTF-8?B?0KHQtdGA0LPQtdC5?= <[my name]#mail.ru> X-Priority: 3 (Normal) Message-ID: <1501324242.448202607#f427.i.mail.ru> Content-Type: multipart/mixed; boundary="----uEhsLqzDWmmGeA9EZ3XNsqSIGjlgVTmA-NI9QMhpqxNHWLEDT-1501324242" Authentication-Results: f427.i.mail.ru; auth=pass smtp.auth=[my name]#mail.ru smtp.mailfrom=[my name]#mail.ru X-7FA49CB5: 0D63561A33F958A58B4AE7CD4FB69874B38CA0D04717BA57612FFEEC28D99E31725E5C173C3A84C325A81A29FB5043FD044813140D6DB928F1C9CF18C8EB2269C4224003CC836476C0CAF46E325F83A50BF2EBBBDD9D6B0F2AF38021CC9F462D574AF45C6390F7469DAA53EE0834AAEE X-Mailru-Sender: 080178E06F6B3F48806FD386034E228604900381AF51F7DD303A634C9E25199A8DFBC783E67F8C0305D8C6CDFE81985CCFB2E39DA8E91CCEEEC687A792225BA622DF1A08BD40178CA471C22AD050A14893AC9912533B2342AE208404248635DF X-Mras: OK X-Spam: undefined In-Reply-To: <1500037364.788302144#mx47.mail.ru> References: <1500037364.788302144#mx47.mail.ru> Validation instruction says: In hash step 1, the Signer/Verifier MUST hash the message body, canonicalized using the body canonicalization algorithm specified in the "c=" tag and then truncated to the length specified in the "l=" tag. That hash value is then converted to base64 form and inserted into (Signers) or compared to (Verifiers) the "bh=" tag of the DKIM- Signature header field. In hash step 2, the Signer/Verifier MUST pass the following to the hash algorithm in the indicated order. 1. The header fields specified by the "h=" tag, in the order specified in that tag, and canonicalized using the header canonicalization algorithm specified in the "c=" tag. Each header field MUST be terminated with a single CRLF. 2. The DKIM-Signature header field that exists (verifying) or will be inserted (signing) in the message, with the value of the "b=" tag (including all surrounding whitespace) deleted (i.e., treated as the empty string), canonicalized using the header canonicalization algorithm specified in the "c=" tag, and without a trailing CRLF. The first step is easy: I've get message body, canonicalized it using relaxed: function (data) { return data.replace(/[ \t]+\r\n/g, '\r\n').replace(/[ \t]+/g, ' ').replace(/\r\n{2,}$/g, CONST.CRLF); } and created sha256 (according to a= tag) hash of it. It matched bh= tag in DKIM-Signature header and yet I'm happy. For a next step I perform next actions: 1) Get all required headers from message in order given in h= signature tag. References: <1500037364.788302144#mx47.mail.ru> In-Reply-To: <1500037364.788302144#mx47.mail.ru> Content-Type: multipart/mixed; boundary="----uEhsLqzDWmmGeA9EZ3XNsqSIGjlgVTmA-NI9QMhpqxNHWLEDT-1501324242" Message-ID: <1501324242.448202607#f427.i.mail.ru> Reply-To: =?UTF-8?B?0KHQtdGA0LPQtdC5?= <[my name]#mail.ru> Date: Sat, 29 Jul 2017 13:30:42 +0300 MIME-Version: 1.0 Subject: =?UTF-8?B?UmU6IA==?= To: madbr#[domain] From: =?UTF-8?B?0KHQtdGA0LPQtdC5?= <[my name]#mail.ru> 2) Canonicalized it: references:<1500037364.788302144#mx47.mail.ru> in-reply-to:<1500037364.788302144#mx47.mail.ru> content-type:multipart/mixed; boundary="----uEhsLqzDWmmGeA9EZ3XNsqSIGjlgVTmA-NI9QMhpqxNHWLEDT-1501324242" message-id:<1501324242.448202607#f427.i.mail.ru> reply-to:=?UTF-8?B?0KHQtdGA0LPQtdC5?= <[my name]#mail.ru> date:Sat, 29 Jul 2017 13:30:42 +0300 mime-version:1.0 subject:=?UTF-8?B?UmU6IA==?= to:madbr#[domain] from:=?UTF-8?B?0KHQtdGA0LPQtdC5?= <[my name]#mail.ru> 3) Get DKIM-Signature, removed b= tag and also canonalized it (trailing \r\n was also removed according to documentation): dkim-signature:v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=mail.ru; s=mail2; h=References:In-Reply-To:Content-Type:Message-ID:Reply-To:Date:MIME-Version:Subject:To:From; bh=gCWDSCJf58CbaR+wjAV9dydu9JTKkvo1o+0zkj8bNr0==; 4) Get public key from DNS TXT record and appended -----BEGIN PUBLIC KEY-----...-----END PUBLIC KEY----- for PEM format compatibility. 5) At last I used standard RSA validation function to validate it: crypto.createVerify('sha256') .update(header + dkimHeader) .verify(publicKey, Buffer.from(signature.b, CONST.BASE64)); But it failed, and I don't really know which actions to blame. In last step I concatenated header and DKIM-Signature, because I don't really understand what does "pass the following to the hash algorithm in the indicated order" mean. Tried to use .update(header).update(dkimHeader), but it made no difference. Can someone explain please, what do I do wrong?
From section 3.7. Computing the Message Hashes of the RFC: In hash step 2, the Signer/Verifier MUST pass the following to the hash algorithm in the indicated order. The header fields specified by the "h=" tag, in the order specified in that tag, and canonicalized using the header canonicalization algorithm specified in the "c=" tag. Each header field MUST be terminated with a single CRLF. The DKIM-Signature header field that exists (verifying) or will be inserted (signing) in the message, with the value of the "b=" tag (including all surrounding whitespace) deleted (i.e., treated as the empty string), canonicalized using the header canonicalization algorithm specified in the "c=" tag, and without a trailing CRLF. I highlighted the important part: Only the value should be deleted, not the complete tag. So the correct last line of the input is (note the b=; at the end): dkim-signature:v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=mail.ru; s=mail2; h=References:In-Reply-To:Content-Type:Message-ID:Reply-To:Date:MIME-Version:Subject:To:From; bh=gCWDSCJf58CbaR+wjAV9dydu9JTKkvo1o+0zkj8bNr0=; b=;
Character encoding with Ruby 1.9.3 and the mail gem
I'm trying to parse email strings with the Ruby mail gem, and I'm having a devil of a time with character encodings. Take the following email: MIME-Version: 1.0 Sender: foobar#example.com Received: by 10.142.239.17 with HTTP; Thu, 14 Jun 2012 06:00:18 -0700 (PDT) Date: Thu, 14 Jun 2012 09:00:18 -0400 Delivered-To: foobar#gmail.com X-Google-Sender-Auth: MxfFrMybNjBoBt4O4GwAn9cMsko Message-ID: <CAGErOzF3FV5NvzN3zUpLGPok96SFzK18Z4HerzyYNALnzgMVaA#mail.gmail.com> Subject: Re: [Lorem Ipsum] Foo updated the forum topic 'Reply by email test' From: Foo Bar <foo#example.com> To: Foo <c49964d167e08e7d4a1930e6565f23c258be19a0#foo.example.com> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable This email has accents:=A0R=E9sum=E9 > > --------- Reply Above This Line ------------ > > Email parsing with accents: R=E9sum=E9 > > Click here to view this post in your browser The email body, when properly encoded, should be: This reply has accents: Résumé > > --------- Reply Above This Line ------------ > > Email parsing with accents: Résumé > > Click here to view this post in your browser However, I'm having a devil of a time actually getting the accent marks to come through. Here's what I've tried: message = Mail.new(email_string) body = message.body.decoded That gets me a string that starts like this: This reply has accents:\xA0R\xE9sum\xE9\r\n>\r\n> --------- Reply Above This Line ------------ Finally, I try this: body.encoding # => <Encoding:ASCII-8BIT> body.encode("UTF-8") # => Encoding::UndefinedConversionError: "\xA0" from ASCII-8BIT to UTF-8 Does anyone have any suggestions on how to deal with this? I'm pretty sure it has to do with the "charset=ISO-8859-1" setting in the email, but I'm not sure how to use that, or if there's a way to easily extract that using the mail gem.
After playing a bit, I found this: body.decoded.force_encoding("ISO-8859-1").encode("UTF-8") # => "This reply has accents: Résumé..." message.parts.map { |part| part.decoded.force_encoding("ISO-8859-1").encode(part.charset) } # multi-part You can extract the charset from the message like so. message.charset #=> for simple, non-multipart message.parts.map { |part| part.charset } #=> for multipart, each part can have its own charset Be careful with non-multipart, as the following can cause trouble: body.charset #=> returns "US-ASCII" which is WRONG! body.force_encoding(body.charset).encode("UTF-8") #=> Conversion error... body.force_encoding(message.charset).encode("UTF-8") #=> Correct conversion :)
This didn't work for me, so thought I'd stick up the solution I got to in case it helps anyone... Basically had to add encoding defaults and tweak the output into sensible strings. https://stackoverflow.com/a/26604049/2386548
Explain what those escaped numbers mean in unicode encoding in ruby 1.8.7
0186 is the unicode "code". Where do 198 and 134 come from? How can go the other way around, from these byte codes to unicode strings? >> c = JSON '["\\u0186"]' [ [0] "Ɔ" ] >> c[0][0] 198 >> c[0][1] 134 >> c[0][2] nil Another confusing thing is unpack. Another seemingly arbitrary number. Where does that come from? Is it even correct? From the 1.8.7 String#unpack documentation: U | Integer | UTF-8 characters as unsigned integers >> c[0].unpack('U') [ [0] 390 ] >
You can find your answers here Unicode Character 'LATIN CAPITAL LETTER OPEN O' (U+0186): Note that 186 (hexadecimal) === 390 (decimal) C/C++/Java source code : "\u0186" UTF-32 (decimal) : 390 UTF-8 (hex) : 0xC6 0x86 (i.e. 198 134) You can read more about UTF-8 encoding on Wikipedia's article on UTF-8. UTF-8 (UCS Transformation Format — 8-bit[1]) is a variable-width encoding that can represent every character in the Unicode character set. It was designed for backward compatibility with ASCII and to avoid the complications of endianness and byte order marks in UTF-16 and UTF-32.