Problem
I have an e-mail that I have problems reading into Python with the correct encoding though Outlook can read it correctly.
Reading a message
import email
message = email.message_from_file(open('/path/to/file.eml'))
Example
This is part of an email which should be decoded to "Denne her får du":
Content-Type: text/plain;
charset="utf-8"
Content-Transfer-Encoding: quoted-printable
Denne=20her=20f=EF=BF=BDr=20du
When I decode it with UTF-8, I get the following:
import quopri
mystring = 'Denne=20her=20f=EF=BF=BDr=20du'
decoded_string = quopri.decodestring(mystring)
print(decoded_string.decode('utf-8'))
>Denne her f�r du
When I just open the e-mail in Outlook I get:
"Denne her får du"
How do I decode it correctly?
Is there somewhere else in the e-mail, what I can see the correct encoding? There must be right? Else how is Outlook able to decode the message?
Related
I have several clients using a mail client that I wrote myself. They have recently stumbled upon emails where attachment file names arrive are in gibberish.
When I examined these emails, I have discovered that there is apparently a local webmail service that sends attachment names as follows:
Content-Type: application/vnd.openxmlformats-officedocument.wordprocessingml.document;
name*="UTF-8''%D7%A2%D7%A8%D7%9B%D7%AA%20%D7%94%D7%A8%D7%A9%D7%9E%D7%94%20TCMP.docx"
Content-Transfer-Encoding: base64
Content-Disposition: attachment;
filename*=UTF-8''%D7%A2%D7%A8%D7%9B%D7%AA%20%D7%94%D7%A8%D7%A9%D7%9E%D7%94%20TCMP.docx
This is a totally invalid mime header according to RFC 2047. It has no quoted-printable identifier (?Q?), the different bytes are encoded with % instead of =, and the entire encoded-word should begin with =? and end with ?=, which it doesn't.
When I fix it to the correct format like so:
Content-Type: application/vnd.openxmlformats-officedocument.wordprocessingml.document;
name="=?UTF-8?Q?=D7=A2=D7=A8=D7=9B=D7=AA=20=D7=94=D7=A8=D7=A9=D7=9E=D7=94=20TCMP.docx?="
Content-Transfer-Encoding: base64
Content-Disposition: attachment;
filename=?UTF-8?Q?=D7=A2=D7=A8=D7=9B=D7=AA=20=D7=94=D7=A8=D7=A9=D7=9E=D7=94=20TCMP.docx?=
then the header gets decoded correctly.
Can anyone tell me if I'm missing something here? Is there a new extension to RFC2047 that allows for these headers, or are they just completely wrong?
As mentioned by #alex-k, the name*= syntax is defined in RFC2231 which was written after RFC2047.
But to answer the question as asked, no. Neither set of headers is RFC2047 compliant.
The *= syntax was not in existence when RFC2047 was written, so the original ones do not conform.
The second set, with MIME encoded words, are invalid because they break the rules about where MIME encoded words are allowed according to section 5 of RFC2047, specifically both of these rules:
+ An 'encoded-word' MUST NOT appear within a 'quoted-string'.
+ An 'encoded-word' MUST NOT be used in parameter of a MIME
Content-Type or Content-Disposition field, or in any structured
field body except within a 'comment' or 'phrase'.
(Those rules are not consecutive in the RFC.)
I am trying to decode a message which doesn't completely conform to the Quoted Printable String idea.
One of the snippets as shown below has an = where should be an =3D this occurs in a number of places. In fact there are two offences occurring here:
------=_Part_7575500_2105086112.1449628640342
Content-Type: text/html; charset="UTF-8"
I'm decoding with the as follows:
qpr := quotedprintable.NewReader(msg.Body)
cleanBody, err := ioutil.ReadAll(qpr)
The resulting error is: (complaining about the _ after first =)
quotedprintable: invalid hex byte 0x5f
How can I fix get this to work please? Thank you.
You don't just have quoted-printable data, it's part of a MIME multipart message. The =_ pattern is specifically used because it can never occur in a quoted-printable message.
Use a multipart.Reader to get the contents of each part.
I am new to MIME, and I don't know if the following situation is valid:
Consider two nested MIME messages: the top-level message has Content-Transfer-Encoding: 7bit
The body of the top-level message is a nested MIME message that has Content-Transfer-Encoding: binary. The body of the internal message has lines that end in LF only, rather than CRLF.
I think this message is invalid, because the rules for 7bit say that LF by itself is not valid. However, a colleague is arguing that this message is valid, because the Content-Transfer-Encoding of the inner message is binary, which doesn't have any restrictions around CR LF.
My argument is that the entire body of the top-level message needs to conform to its encoding (7bit), regardless of the Content-Transfer-Encoding of any nested messages.
I've searched the web and tried to find the answer in the MIME spec, but was not able to find anything that seemed to address this particular situation.
Found an answer in section 6.4 of RFC 2045:
It should also be noted that, by definition, if a composite entity has
a transfer-encoding value such as "7bit", but one of the enclosed
entities has a less restrictive value such as "8bit", then either the
outer "7bit" labelling is in error, because 8bit data are included, or
the inner "8bit" labelling placed an unnecessarily high demand on the
transport system because the actual included data were actually
7bit-safe.
So the message in my example is invalid.
I'm working on an e-mail app for fun and practice in Ruby and one of the mails has this subject:
=?UTF-8?B?4p22IEFuZHJvaWQgc3RpY2sgbWsgODA5aXYgKyB1c2IyZXRoZXJuZXQgYWRh?=\r\n
=?UTF-8?B?cHRlciAtNDYlIOKdtyBKb3NlcGggSm9zZXBoIGtldWtlbmNhcnJvdXNlbCAt?=\r\n
=?UTF-8?B?NTUlIOKduCA0IENlcnJ1dGkgYm94ZXJzaG9ydHMgLTcxJSDinbkgQXJub3Zh?=\r\n
=?UTF-8?B?IDkwIEc0IHRhYmxldCAtNDIl?=
I found out I look at a Base64 string and the parts between =?UTF-8?B? and ?= need to be decoded from Base64 to UTF-8.
Can someone explain how I need to decode a string like this in Ruby?
Try the Base64 module of ruby-1.9 stdlib, see example:
require "base64"
enc = Base64.encode64('Send reinforcements')
# -> "U2VuZCByZWluZm9yY2VtZW50cw==\n"
plain = Base64.decode64(enc)
# -> "Send reinforcements"
Since the =?UTF-8?B? is set the proper codepage or encoding, in which the original string was coded, it is required to be present in email messages. I believe strings without the defined codepage are defaulted to utf-8
When encoding a picture, say, into a MIME base64 string, is there a standard way of also including its filename, or at least a suggested filename?
Content-Disposition: attachment; filename="picture.jpg". The Content-Type header can also contain a name= attribute although it is not recommended.
I am assuming email, but IIRC the same goes for HTTP.