I'm trying to parse email strings with the Ruby mail gem, and I'm having a devil of a time with character encodings. Take the following email:
MIME-Version: 1.0
Sender: foobar#example.com
Received: by 10.142.239.17 with HTTP; Thu, 14 Jun 2012 06:00:18 -0700 (PDT)
Date: Thu, 14 Jun 2012 09:00:18 -0400
Delivered-To: foobar#gmail.com
X-Google-Sender-Auth: MxfFrMybNjBoBt4O4GwAn9cMsko
Message-ID: <CAGErOzF3FV5NvzN3zUpLGPok96SFzK18Z4HerzyYNALnzgMVaA#mail.gmail.com>
Subject: Re: [Lorem Ipsum] Foo updated the forum topic 'Reply by email test'
From: Foo Bar <foo#example.com>
To: Foo <c49964d167e08e7d4a1930e6565f23c258be19a0#foo.example.com>
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
This email has accents:=A0R=E9sum=E9
>
> --------- Reply Above This Line ------------
>
> Email parsing with accents: R=E9sum=E9
>
> Click here to view this post in your browser
The email body, when properly encoded, should be:
This reply has accents: Résumé
>
> --------- Reply Above This Line ------------
>
> Email parsing with accents: Résumé
>
> Click here to view this post in your browser
However, I'm having a devil of a time actually getting the accent marks to come through. Here's what I've tried:
message = Mail.new(email_string)
body = message.body.decoded
That gets me a string that starts like this:
This reply has accents:\xA0R\xE9sum\xE9\r\n>\r\n> --------- Reply Above This Line ------------
Finally, I try this:
body.encoding # => <Encoding:ASCII-8BIT>
body.encode("UTF-8") # => Encoding::UndefinedConversionError: "\xA0" from ASCII-8BIT to UTF-8
Does anyone have any suggestions on how to deal with this? I'm pretty sure it has to do with the "charset=ISO-8859-1" setting in the email, but I'm not sure how to use that, or if there's a way to easily extract that using the mail gem.
After playing a bit, I found this:
body.decoded.force_encoding("ISO-8859-1").encode("UTF-8") # => "This reply has accents: Résumé..."
message.parts.map { |part| part.decoded.force_encoding("ISO-8859-1").encode(part.charset) } # multi-part
You can extract the charset from the message like so.
message.charset #=> for simple, non-multipart
message.parts.map { |part| part.charset } #=> for multipart, each part can have its own charset
Be careful with non-multipart, as the following can cause trouble:
body.charset #=> returns "US-ASCII" which is WRONG!
body.force_encoding(body.charset).encode("UTF-8") #=> Conversion error...
body.force_encoding(message.charset).encode("UTF-8") #=> Correct conversion :)
This didn't work for me, so thought I'd stick up the solution I got to in case it helps anyone...
Basically had to add encoding defaults and tweak the output into sensible strings.
https://stackoverflow.com/a/26604049/2386548
Related
I'm trying to develop my home email server (with NodeJS on server side but it's not important as I try to figure out principles).
I use this documentation to guide myself through DKIM-Signature validation routine, but it requires some complicated steps and I can't figure out where is my mistake. For an email example I used one sent from Mail.ru server. It should be totally valid. There is it's header:
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=mail.ru; s=mail2;
h=References:In-Reply-To:Content-Type:Message-ID:Reply-To:Date:MIME-Version:Subject:To:From; bh=gCWDSCJf58CbaR+wjAV9dydu9JTKkvo1o+0zkj8bNr0=;
b=pheltY+k/mio2x4CFQV8cXZxNiR7oSTkIsWTOZa1CGpEyK8KVSHY07OWSdZ1aFVtuaV32PbI0mNY0yliuqIbYTsnreFUYFM/iVR5PU74QHAe8yp46ydAYRbzLQu8dy+AkFhPtEdb8CAgoZKXgPLc888/Q6MsVAh6iH1L3SZj87Y=;
Received: by f427.i.mail.ru with local (envelope-from <[my name]#mail.ru>)
id 1dbP18-0003I9-L7
for madbr#[domain]; Sat, 29 Jul 2017 13:30:42 +0300
Received: by e.mail.ru with HTTP;
Sat, 29 Jul 2017 13:30:42 +0300
From: =?UTF-8?B?0KHQtdGA0LPQtdC5?= <[my name]#mail.ru>
To: madbr#[domain]
Subject: =?UTF-8?B?UmU6IA==?=
MIME-Version: 1.0
X-Mailer: Mail.Ru Mailer 1.0
Date: Sat, 29 Jul 2017 13:30:42 +0300
Reply-To: =?UTF-8?B?0KHQtdGA0LPQtdC5?= <[my name]#mail.ru>
X-Priority: 3 (Normal)
Message-ID: <1501324242.448202607#f427.i.mail.ru>
Content-Type: multipart/mixed;
boundary="----uEhsLqzDWmmGeA9EZ3XNsqSIGjlgVTmA-NI9QMhpqxNHWLEDT-1501324242"
Authentication-Results: f427.i.mail.ru; auth=pass smtp.auth=[my name]#mail.ru smtp.mailfrom=[my name]#mail.ru
X-7FA49CB5: 0D63561A33F958A58B4AE7CD4FB69874B38CA0D04717BA57612FFEEC28D99E31725E5C173C3A84C325A81A29FB5043FD044813140D6DB928F1C9CF18C8EB2269C4224003CC836476C0CAF46E325F83A50BF2EBBBDD9D6B0F2AF38021CC9F462D574AF45C6390F7469DAA53EE0834AAEE
X-Mailru-Sender: 080178E06F6B3F48806FD386034E228604900381AF51F7DD303A634C9E25199A8DFBC783E67F8C0305D8C6CDFE81985CCFB2E39DA8E91CCEEEC687A792225BA622DF1A08BD40178CA471C22AD050A14893AC9912533B2342AE208404248635DF
X-Mras: OK
X-Spam: undefined
In-Reply-To: <1500037364.788302144#mx47.mail.ru>
References: <1500037364.788302144#mx47.mail.ru>
Validation instruction says:
In hash step 1, the Signer/Verifier MUST hash the message body,
canonicalized using the body canonicalization algorithm specified in
the "c=" tag and then truncated to the length specified in the "l="
tag. That hash value is then converted to base64 form and inserted
into (Signers) or compared to (Verifiers) the "bh=" tag of the DKIM-
Signature header field.
In hash step 2, the Signer/Verifier MUST pass the following to the
hash algorithm in the indicated order.
1. The header fields specified by the "h=" tag, in the order
specified in that tag, and canonicalized using the header
canonicalization algorithm specified in the "c=" tag. Each
header field MUST be terminated with a single CRLF.
2. The DKIM-Signature header field that exists (verifying) or will
be inserted (signing) in the message, with the value of the "b="
tag (including all surrounding whitespace) deleted (i.e., treated
as the empty string), canonicalized using the header
canonicalization algorithm specified in the "c=" tag, and without
a trailing CRLF.
The first step is easy: I've get message body, canonicalized it using
relaxed: function (data) {
return data.replace(/[ \t]+\r\n/g, '\r\n').replace(/[ \t]+/g, ' ').replace(/\r\n{2,}$/g, CONST.CRLF);
}
and created sha256 (according to a= tag) hash of it. It matched bh= tag in DKIM-Signature header and yet I'm happy.
For a next step I perform next actions:
1) Get all required headers from message in order given in h= signature tag.
References: <1500037364.788302144#mx47.mail.ru>
In-Reply-To: <1500037364.788302144#mx47.mail.ru>
Content-Type: multipart/mixed;
boundary="----uEhsLqzDWmmGeA9EZ3XNsqSIGjlgVTmA-NI9QMhpqxNHWLEDT-1501324242"
Message-ID: <1501324242.448202607#f427.i.mail.ru>
Reply-To: =?UTF-8?B?0KHQtdGA0LPQtdC5?= <[my name]#mail.ru>
Date: Sat, 29 Jul 2017 13:30:42 +0300
MIME-Version: 1.0
Subject: =?UTF-8?B?UmU6IA==?=
To: madbr#[domain]
From: =?UTF-8?B?0KHQtdGA0LPQtdC5?= <[my name]#mail.ru>
2) Canonicalized it:
references:<1500037364.788302144#mx47.mail.ru>
in-reply-to:<1500037364.788302144#mx47.mail.ru>
content-type:multipart/mixed; boundary="----uEhsLqzDWmmGeA9EZ3XNsqSIGjlgVTmA-NI9QMhpqxNHWLEDT-1501324242"
message-id:<1501324242.448202607#f427.i.mail.ru>
reply-to:=?UTF-8?B?0KHQtdGA0LPQtdC5?= <[my name]#mail.ru>
date:Sat, 29 Jul 2017 13:30:42 +0300
mime-version:1.0
subject:=?UTF-8?B?UmU6IA==?=
to:madbr#[domain]
from:=?UTF-8?B?0KHQtdGA0LPQtdC5?= <[my name]#mail.ru>
3) Get DKIM-Signature, removed b= tag and also canonalized it (trailing \r\n was also removed according to documentation):
dkim-signature:v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=mail.ru; s=mail2; h=References:In-Reply-To:Content-Type:Message-ID:Reply-To:Date:MIME-Version:Subject:To:From; bh=gCWDSCJf58CbaR+wjAV9dydu9JTKkvo1o+0zkj8bNr0==;
4) Get public key from DNS TXT record and appended -----BEGIN PUBLIC KEY-----...-----END PUBLIC KEY----- for PEM format compatibility.
5) At last I used standard RSA validation function to validate it:
crypto.createVerify('sha256')
.update(header + dkimHeader)
.verify(publicKey, Buffer.from(signature.b, CONST.BASE64));
But it failed, and I don't really know which actions to blame.
In last step I concatenated header and DKIM-Signature, because I don't really understand what does "pass the following to the
hash algorithm in the indicated order" mean. Tried to use .update(header).update(dkimHeader), but it made no difference.
Can someone explain please, what do I do wrong?
From section 3.7. Computing the Message Hashes of the RFC:
In hash step 2, the Signer/Verifier MUST pass the following to the
hash algorithm in the indicated order.
The header fields specified by the "h=" tag, in the order
specified in that tag, and canonicalized using the header
canonicalization algorithm specified in the "c=" tag. Each
header field MUST be terminated with a single CRLF.
The DKIM-Signature header field that exists (verifying) or will
be inserted (signing) in the message, with the value of the "b="
tag (including all surrounding whitespace) deleted (i.e., treated
as the empty string), canonicalized using the header
canonicalization algorithm specified in the "c=" tag, and without
a trailing CRLF.
I highlighted the important part: Only the value should be deleted, not the complete tag.
So the correct last line of the input is (note the b=; at the end):
dkim-signature:v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=mail.ru; s=mail2; h=References:In-Reply-To:Content-Type:Message-ID:Reply-To:Date:MIME-Version:Subject:To:From; bh=gCWDSCJf58CbaR+wjAV9dydu9JTKkvo1o+0zkj8bNr0=; b=;
I need to generate an HTTP (GMT) date string in Ruby. This is because of a requirement of an API that I'm consuming.
What is an easy (out of the box) way to generate it?
I found that Ruby comes with a method for the Time class out of the box for this:
Time.now.httpdate # => "Thu, 06 Oct 2011 02:26:12 GMT"
The time class also supports the following methods
Time.now.iso8601 # => "2011-10-05T22:26:12-04:00"
Time.now.rfc2822 # => "Wed, 05 Oct 2011 22:26:12 -0400"
Source: http://ruby-doc.org/stdlib-2.0.0/libdoc/time/rdoc/Time.html#class-Time-label-Converting+to+a+String
does Time.now.getgm work for you?
Time.now.gmt? #=> fale
Time.now.getgm.gmt? #=> true
When I send a message with attachment using the Gmail API, the recipient receives the message without the attachment.
What is strange though is that:
1: in the sent folder of the sender, I do see the attachment properly
2: if I send to myself, both message are fine (in sent folder and in inbox folder)
3: if I use with GMail SMTP with the same raw message, it works fine
4: if I use a 3rd party SMTP with the same raw message, it works fine.
Point number 1+2 super puzzling.
Here is the source of the original message in the sent folder:
Received: from 13936824666 named unknown by gmailapi.google.com with HTTPREST; Wed, 25 Jan 2017 18:44:30 -0500
Date: Wed, 25 Jan 2017 18:44:30 -0500
From: Jeremy Chatelaine <source#gmail.com>
To: Jeremy <target#domain.com>
Message-Id: <CABX8Avad0vTtu8=jotRD5HM1r0My-ZKeV7RFRo0TmTxf5PNd0g#mail.gmail.com>
Subject: Export
Mime-Version: 1.0
Content-Type: multipart/alternative; boundary="--==_mimepart_5889385be5066_a133fd0f785e20837629"; charset=UTF-8
Content-Transfer-Encoding: 7bit
----==_mimepart_5889385be5066_a133fd0f785e20837629
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit
Please find attached the message
----==_mimepart_5889385be5066_a133fd0f785e20837629
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: 7bit
Please find attached the message
----==_mimepart_5889385be5066_a133fd0f785e20837629
Content-Type: text/csv; charset=UTF-8; filename=export.csv
Content-Transfer-Encoding: 7bit
Content-Disposition: attachment; filename=export.csv
Content-ID: <5889385be6fd7_a133fd0f785e20837735#jeremy.mail>
Some text...
----==_mimepart_5889385be5066_a133fd0f785e20837629--
Here is the source of the original message in the recipient inbox folder:
Delivered-To: target#domain.com
Received: by 10.55.110.193 with SMTP id j184csp1999707qkc;
Wed, 25 Jan 2017 16:30:42 -0800 (PST)
X-Received: by 10.107.34.213 with SMTP id i204mr540101ioi.203.1485390642385;
Wed, 25 Jan 2017 16:30:42 -0800 (PST)
Return-Path: <source#domain.com>
Received: from mail-it0-x22b.google.com (mail-it0-x22b.google.com. [2607:f8b0:4001:c0b::22b])
by mx.google.com with ESMTPS id 88si336719ioq.54.2017.01.25.16.30.42
for <target#domain.com>
(version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
Wed, 25 Jan 2017 16:30:42 -0800 (PST)
Received-SPF: pass (google.com: domain of source#domain.com designates 2607:f8b0:4001:c0b::22b as permitted sender) client-ip=2607:f8b0:4001:c0b::22b;
Authentication-Results: mx.google.com;
dkim=pass header.i=#domain.com;
spf=pass (google.com: domain of source#domain.com designates 2607:f8b0:4001:c0b::22b as permitted sender) smtp.mailfrom=source#domain.com
Received: by mail-it0-x22b.google.com with SMTP id 203so119411500ith.0
for <target#domain.com>; Wed, 25 Jan 2017 16:30:42 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
d=domain.com; s=domain;
h=from:mime-version:date:message-id:subject:to;
bh=wx26/V0bJk9VItDp3TAvKl28UAn7IRQq4NITJZDM+Co=;
b=L3KTzPTCoIJUfAacuJy+PE8jHnY9iwGuXUWSpZzneRs5bvMysigSMyPGn1YicyIvQ6
d/LvbEJPlsu+S0zElhIVPITjAmXKDKNIKwLQDHpkcKnnI3btBUrENN923fMtS1fDdHyV
3At0QenKrb34uQqYYoHtX2WU4nyYrISbYKL62=
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
d=1e100.net; s=20161025;
h=x-gm-message-state:from:mime-version:date:message-id:subject:to;
bh=wx26/V0bJk9VItDp3TAvKl28UAn7IRQq4NITJZDM+Co=;
b=tGPSHjI3iRzjrI0jGHVX76uTw7OXYC4B2bP0qQKgQotBWru+Pn5Ci9A1Qop9oGN1Ys
fnxCgLOLG8ZJU165ODBNX1DGjPa8ud9SWg18FTsxIjNw9qTr1yJqbWr0LToJi7HdQUr8
7Aaiqil7PbPUf5SdxLCqwBNf660Rn9Sd/ADZeT1Bc2+iYQcizjiK/rOPPX+X1ZndvqxP
Ok3Ac2yyIWxi+m4xaPEztcF4JXFZDlJWFdclUDv4s5Jdc0eb1HmB5d2r1qroGLo5MTjd
d7jO1nRsKTO5I9I69p9AgC+LpDiWBxgzZMBVsU6vVpeZ03/pCroyk9DHDUAjn3ijtlFh
O2vw==
X-Gm-Message-State: AIkVDXJo2tPJlkHkthgEjnxYp7vz5Rfpn91pWCS7zEurkSiDhJtyzLpUoSDORq37K/7ATCRSyLAypKIYrdYwEiL2
X-Received: by 10.36.84.148 with SMTP id t142mr20263701ita.90.1485390641934; Wed, 25 Jan 2017 16:30:41 -0800 (PST)
Received: from 13936824667 named unknown by gmailapi.google.com with HTTPREST; Wed, 25 Jan 2017 19:30:41 -0500
From: Jeremy Chatelaine <source#domain.com>
Mime-Version: 1.0
Date: Wed, 25 Jan 2017 19:30:41 -0500
Message-ID: <CABK5xHonVstaJHWHfvBJFF1BK5Y0B8NJsAJTPFokPNAUcawhGA#mail.gmail.com>
Subject: Export
To: Jeremy <target#domain.com>
Content-Type: multipart/alternative; boundary=001a1143a6cc90c9040546f4753f
--001a1143a6cc90c9040546f4753f
Content-Type: text/plain; charset=UTF-8
Please find attached your requested export
--001a1143a6cc90c9040546f4753f
Content-Type: text/html; charset=UTF-8
Please find attached your requested export
--001a1143a6cc90c9040546f4753f--
As you can see, the mime part where I had my text attachment vanished.
Here is how the message is produced (cut for clarity)
msg = Mail.new
html_part = Mail::Part.new do
content_type 'text/html; charset=UTF-8'
body html_body
end
msg.html_part = html_part
new_text = plan_text
text_part = Mail::Part.new do
body new_text
end
msg.text_part = text_part
file_paths.each do |file_path|
msg.add_file(file_path)
# I also tried like that, same result
#open(file_path) do |file|
# msg.attachments[file_path] = file.read
#end
end
raw_message = msg.to_s
Here is how I send with the gmail api
client = Google::APIClient.new(:application_name => "app", :application_version => "1")
client.authorization.client_id = "someverylongnumbers.apps.googleusercontent.com"
client.authorization.client_secret = "morerandomletters"
client.authorization.access_token = token
client.authorization.scope = [
"https://www.googleapis.com/auth/gmail.modify"
]
gmail_api = client.discovered_api('gmail', 'v1') # https://www.googleapis.com/auth/gmail.modify
result = client.execute(
:api_method => gmail_api.users.messages.to_h['gmail.users.messages.send'],
:parameters => {
'userId' => "me"
},
:body_object => {
'raw' => Base64.urlsafe_encode64(raw_message)
}
)
What is wrong with this?
Let's say I want to compose an email header with UTF-8, quoted-printable encoded subject, which is "test — UNIX-утилита для проверки типа файла и сравнения значений". I can confirm the bytes of the characters using:
$ echo "UNIX-утилита ..." | perl utfinfo.pl
Got 16 uchars
Char: 'U' u: 85 [0x0055] b: 85 [0x55] n: LATIN CAPITAL LETTER U [Basic Latin]
Char: 'N' u: 78 [0x004E] b: 78 [0x4E] n: LATIN CAPITAL LETTER N [Basic Latin]
Char: 'I' u: 73 [0x0049] b: 73 [0x49] n: LATIN CAPITAL LETTER I [Basic Latin]
Char: 'X' u: 88 [0x0058] b: 88 [0x58] n: LATIN CAPITAL LETTER X [Basic Latin]
Char: '-' u: 45 [0x002D] b: 45 [0x2D] n: HYPHEN-MINUS [Basic Latin]
Char: 'у' u: 1091 [0x0443] b: 209,131 [0xD1,0x83] n: CYRILLIC SMALL LETTER U [Cyrillic]
Char: 'т' u: 1090 [0x0442] b: 209,130 [0xD1,0x82] n: CYRILLIC SMALL LETTER TE [Cyrillic]
Char: 'и' u: 1080 [0x0438] b: 208,184 [0xD0,0xB8] n: CYRILLIC SMALL LETTER I [Cyrillic]
...
So, I'm trying to get the UTF-8, quoted printable representation of this. For instance, using Python's quopri:
$ python -c 'import quopri; a="test — UNIX-утилита для проверки типа файла и сравнения значений"; print(quopri.encodestring(a));'
test =E2=80=94 UNIX-=D1=83=D1=82=D0=B8=D0=BB=D0=B8=D1=82=D0=B0 =D0=B4=D0=BB=
=D1=8F =D0=BF=D1=80=D0=BE=D0=B2=D0=B5=D1=80=D0=BA=D0=B8 =D1=82=D0=B8=D0=BF=
=D0=B0 =D1=84=D0=B0=D0=B9=D0=BB=D0=B0 =D0=B8 =D1=81=D1=80=D0=B0=D0=B2=D0=BD=
=D0=B5=D0=BD=D0=B8=D1=8F =D0=B7=D0=BD=D0=B0=D1=87=D0=B5=D0=BD=D0=B8=D0=B9
... or PHP's quoted_printable_encode, which gives the exact same output:
$ php -r '$a="test — UNIX-утилита для проверки типа файла и сравнения значений"; echo quoted_printable_encode($a)."\n";'
test =E2=80=94 UNIX-=D1=83=D1=82=D0=B8=D0=BB=D0=B8=D1=82=D0=B0 =D0=B4=D0=BB=
=D1=8F =D0=BF=D1=80=D0=BE=D0=B2=D0=B5=D1=80=D0=BA=D0=B8 =D1=82=D0=B8=D0=BF=
=D0=B0 =D1=84=D0=B0=D0=B9=D0=BB=D0=B0 =D0=B8 =D1=81=D1=80=D0=B0=D0=B2=D0=BD=
=D0=B5=D0=BD=D0=B8=D1=8F =D0=B7=D0=BD=D0=B0=D1=87=D0=B5=D0=BD=D0=B8=D0=B9
So, to test, I make a text file called test.eml, and try to simply wrap this output in the =?UTF-8?Q? ... ?= tags for the Subject: line, making sure that line endings are CRLF \r\n:
Message-Id: <4c428d27a41043e2b2b07e#example.com>
Subject: =?UTF-8?Q?test =E2=80=94 UNIX-=D1=83=D1=82=D0=B8=D0=BB=D0=B8=D1=82=D0=B0 =D0=B4=D0=BB=
=D1=8F =D0=BF=D1=80=D0=BE=D0=B2=D0=B5=D1=80=D0=BA=D0=B8 =D1=82=D0=B8=D0=BF=
=D0=B0 =D1=84=D0=B0=D0=B9=D0=BB=D0=B0 =D0=B8 =D1=81=D1=80=D0=B0=D0=B2=D0=BD=
=D0=B5=D0=BD=D0=B8=D1=8F =D0=B7=D0=BD=D0=B0=D1=87=D0=B5=D0=BD=D0=B8=D0=B9?=
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Hello world
... but if I open this in Thunderbird, I get a corrupt output:
I've read somewhere that multiline in long header fields is covered by RFC0822 "LONG HEADER FIELDS", and basically, the line ending should be followed by a space. So I indent the continuation lines by one space:
Message-Id: <4c428d27a41043e2b2b07e#example.com>
Subject: =?UTF-8?Q?test =E2=80=94 UNIX-=D1=83=D1=82=D0=B8=D0=BB=D0=B8=D1=82=D0=B0 =D0=B4=D0=BB=
=D1=8F =D0=BF=D1=80=D0=BE=D0=B2=D0=B5=D1=80=D0=BA=D0=B8 =D1=82=D0=B8=D0=BF=
=D0=B0 =D1=84=D0=B0=D0=B9=D0=BB=D0=B0 =D0=B8 =D1=81=D1=80=D0=B0=D0=B2=D0=BD=
=D0=B5=D0=BD=D0=B8=D1=8F =D0=B7=D0=BD=D0=B0=D1=87=D0=B5=D0=BD=D0=B8=D0=B9?=
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Hello world
... and I get a slighly different subject in Thunderbird, but still corrupt:
Now, if I delete =\r\n from the first three continuation lines, so the subject is all in one line:
Message-Id: <4c428d27a41043e2b2b07e#example.com>
Subject: =?UTF-8?Q?test =E2=80=94 UNIX-=D1=83=D1=82=D0=B8=D0=BB=D0=B8=D1=82=D0=B0 =D0=B4=D0=BB=D1=8F =D0=BF=D1=80=D0=BE=D0=B2=D0=B5=D1=80=D0=BA=D0=B8 =D1=82=D0=B8=D0=BF=D0=B0 =D1=84=D0=B0=D0=B9=D0=BB=D0=B0 =D0=B8 =D1=81=D1=80=D0=B0=D0=B2=D0=BD=D0=B5=D0=BD=D0=B8=D1=8F =D0=B7=D0=BD=D0=B0=D1=87=D0=B5=D0=BD=D0=B8=D0=B9?=
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Hello world
... then actually Thunderbird shows the subject line well:
... but then my header is in conflict with the recommendation from RFC 2822 - 2.1.1. Line Length Limits which says "Each line of characters MUST be no more than 998 characters, and SHOULD be no more than 78 characters, excluding the CRLF."; specifically the line limit of 78 characters.
So, how can I obtain the proper multi-line quoted-printable representation of an UTF-8 Subject header string, so I can use it in an .eml file split at 78 characters - and have Thunderbird correctly read it?
When I ask python to create an email with that subject, here's what it does:
$ python
Python 2.7.9 (default, Mar 1 2015, 18:22:53)
[GCC 4.9.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from email.message import Message
>>> from email.header import Header
>>> msg = Message()
>>> import quopri
>>> h = Header(quopri.decodestring('test =E2=80=94 UNIX-'
'=D1=83=D1=82=D0=B8=D0=BB=D0=B8=D1=82=D0=B0 =D0=B4=D0=BB=D1=8F'
'=D0=BF=D1=80=D0=BE=D0=B2=D0=B5=D1=80=D0=BA=D0=B8 =D1=82=D0=B8'
'=D0=BF=D0=B0 =D1=84=D0=B0=D0=B9=D0=BB=D0=B0 =D0=B8'
'=D1=81=D1=80=D0=B0=D0=B2=D0=BD=D0=B5=D0=BD=D0=B8=D1=8F '
'=D0=B7=D0=BD=D0=B0=D1=87=D0=B5=D0=BD=D0=B8=D0=B9?='), 'UTF-8')
>>> msg['Subject'] = h
>>> print msg.as_string()
Subject: =?utf-8?b?dGVzdCDigJQgVU5JWC3Rg9GC0LjQu9C40YLQsCDQtNC70Y8g0L/RgNC+0LI=?=
=?utf-8?b?0LXRgNC60Lgg0YLQuNC/0LAg0YTQsNC50LvQsCDQuCDRgdGA0LDQstC90LU=?=
=?utf-8?b?0L3QuNGPINC30L3QsNGH0LXQvdC40Lk/?=
>>>
So it uses base64 encoding instead of quoted-printable, but my strong suspicion, based on this, is that the answer is that each line must begin and end the escape.
Indeed:
>>> import email
>>> s = '''Subject: =?UTF-8?Q?test =E2=80=94 UNIX-=D1=83=D1=82=D0=B8=D0?=
... =?UTF-8?Q?=BB=D0=B8=D1=82=D0=B0 =D0=B4=D0=BB=D1=8F =D0=BF=D1=80=D0?=
... =?UTF-8?Q?=BE=D0=B2=D0=B5=D1=80=D0=BA=D0=B8 =D1=82=D0=B8=D0=BF=D0=B0?=
... =?UTF-8?Q? =D1=84=D0=B0=D0=B9=D0=BB=D0=B0 =D0=B8 =D1=81=D1=80=D0=B0=D0?=
... =?UTF-8?Q?=B2=D0=BD=D0=B5=D0=BD=D0=B8=D1=8F =D0=B7=D0=BD=D0=B0=D1?=
... =?UTF-8?Q?=87=D0=B5=D0=BD=D0=B8=D0=B9?=
...
... Hello.
... '''
>>> e = email.message_from_string(s.replace('\n', '\r\n'))
>>> email.header.decode_header(e['Subject'])
[('test \xe2\x80\x94 UNIX-\xd1\x83\xd1\x82\xd0\xb8\xd0\xbb\xd0\xb8\xd1\x82\xd0\xb0 \xd0\xb4\xd0\xbb\xd1\x8f \xd0\xbf\xd1\x80\xd0\xbe\xd0\xb2\xd0\xb5\xd1\x80\xd0\xba\xd0\xb8 \xd1\x82\xd0\xb8\xd0\xbf\xd0\xb0 \xd1\x84\xd0\xb0\xd0\xb9\xd0\xbb\xd0\xb0 \xd0\xb8 \xd1\x81\xd1\x80\xd0\xb0\xd0\xb2\xd0\xbd\xd0\xb5\xd0\xbd\xd0\xb8\xd1\x8f \xd0\xb7\xd0\xbd\xd0\xb0\xd1\x87\xd0\xb5\xd0\xbd\xd0\xb8\xd0\xb9', 'utf-8')]
>>> decoded = email.header.decode_header(e['Subject'])
>>> print decoded[0][0].decode(decoded[0][1])
test — UNIX-утилита для проверки типа файла и сравнения значений
EDIT: However, even with the above added in .eml file, Thunderbird fails again:
... but this time it indicates it got some of the chars correct. And indeed, breakage occurs where lines are broken "in the middle of a character"; say if for the sequence 0xD1, 0x83 for the character у, the =D1?= ends one line, and the Q?=83 starts the other, then Thunderbird cannot parse that. So after manual rearrangement, this snippet can be obtained:
Message-Id: <4c428d27a41043e2b2b07e#example.com>
Subject: =?UTF-8?Q?test =E2=80=94 UNIX-=D1=83=D1=82=D0=B8?=
=?UTF-8?Q?=D0=BB=D0=B8=D1=82=D0=B0 =D0=B4=D0=BB=D1=8F =D0=BF=D1=80?=
=?UTF-8?Q?=D0=BE=D0=B2=D0=B5=D1=80=D0=BA=D0=B8 =D1=82=D0=B8=D0=BF=D0=B0?=
=?UTF-8?Q? =D1=84=D0=B0=D0=B9=D0=BB=D0=B0 =D0=B8 =D1=81=D1=80=D0=B0?=
=?UTF-8?Q?=D0=B2=D0=BD=D0=B5=D0=BD=D0=B8=D1=8F =D0=B7=D0=BD=D0=B0?=
=?UTF-8?Q?=D1=87=D0=B5=D0=BD=D0=B8=D0=B9?=
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Hello world
... which opens fine as an .eml message in Thunderbird (same as this image from OP).
EDIT2: Also PHP seems to do it right, with this invocation of mb_encode_mimeheader (directly pasteable in .eml file):
$ php -r '$a="test — UNIX-утилита для проверки типа файла и сравнения значений"; mb_internal_encoding("UTF-8"); echo mb_encode_mimeheader($a, "UTF-8", "Q")."\n";'
test =?UTF-8?Q?=E2=80=94=20UNIX-=D1=83=D1=82=D0=B8=D0=BB=D0=B8=D1=82?=
=?UTF-8?Q?=D0=B0=20=D0=B4=D0=BB=D1=8F=20=D0=BF=D1=80=D0=BE=D0=B2=D0=B5?=
=?UTF-8?Q?=D1=80=D0=BA=D0=B8=20=D1=82=D0=B8=D0=BF=D0=B0=20=D1=84=D0=B0?=
=?UTF-8?Q?=D0=B9=D0=BB=D0=B0=20=D0=B8=20=D1=81=D1=80=D0=B0=D0=B2=D0=BD?=
=?UTF-8?Q?=D0=B5=D0=BD=D0=B8=D1=8F=20=D0=B7=D0=BD=D0=B0=D1=87=D0=B5=D0=BD?=
=?UTF-8?Q?=D0=B8=D0=B9?=
The problem with your test.eml is that your RFC2047 encoding is broken. The Q encoding is based on quoted-printable, but is not entirely the same. In particular, each space needs to be encoded as either =20 or _, and you cannot escape line breaks with a final =.
Fundamentally, each =?...?= sequence needs to be a single, unambiguous token per RFC 822. You can either break up your input into multiple such tokens and leave the spaces unencoded, or encode the spaces. Note that spaces between two such tokens are not significant, so encoding the spaces into the sequences makes more sense.
Message-Id: <4c428d27a41043e2b2b07e#example.com>
Subject: =?UTF-8?Q?test_=E2=80=94_UNIX-=D1=83=D1=82=D0=B8=D0=BB?=
=?UTF-8?Q?=D0=B8=D1=82=D0=B0_=D0=B4=D0=BB_=D1=8F_=D0=BF=D1=80?=
=?UTF-8?Q?=D0=BE=D0=B2=D0=B5=D1=80=D0=BA=D0=B8_=D1=82=D0=B8=D0=BF?=
=?UTF-8?Q?=D0=B0_=D1=84=D0=B0=D0=B9=D0=BB=D0=B0_=D0=B8_=D1=81?=
=?UTF-8?Q?=D1=80=D0=B0=D0=B2=D0=BD_=D0=B5=D0=BD=D0=B8=D1=8F_?=
=?UTF-8?Q?=D0=B7=D0=BD=D0=B0=D1=87=D0=B5=D0=BD=D0=B8=D0=B9?=
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Hello world
Of course, with this exposition, quoted-printable isn't really legible at all, and probably takes much more space than base64, so you might prefer to go with the B encoding in the end after all.
Unless you are writing a MIME library yourself, the simple solution is to not care, and let the library piece this together for you. PHP is more problematic (the standard library lacks this functionality, and the third-party libraries are somewhat uneven--find one you trust, and stick to it), but in Python, simply pass in a Unicode string, and the email library will encode it if necessary.
How do I get the word after a particular word in a Ruby string?
For example:
From:Ysxrb<abc#gmail.com>\nTo: <xyzn#gmail.com>Subject: xyzabc\nDate: Tue, 19 Jun 2012 03:26:56 -0700\nMessage-ID: <9D.A1.02635.ABB40EF4#ecout1>
I just want to get:
Ysxrb<abc#gmail.com
xyzabc
I think your question/requirement may need a bit of refinement.
You state: "How to get the word after a particular word in a ruby string?" and your example text is this : "From:Ysxrb\nTo: Subject: xyzabc\nDate: Tue, 19 Jun 2012 03:26:56 -0700\nMessage-ID: <9D.A1.02635.ABB40EF4#ecout1>"
and then you finally say that what you really want out of these string are the following words:
"'Ysxrb' and 'xyzabc'".
Will you always be parsing email text, which is what this appears to be? If so, then there are some more specific approaches you could take. For instance, in this example you could do something like this:
eml = "From:Ysxrb\nTo: Subject: xyzabc\nDate: Tue, 19 Jun 2012 03:26:56 -0700\nMessage-ID: <9D.A1.02635.ABB40EF4#ecout1>"
tokens = eml.split(/[\s\:]/)
which would yield this:
["From", "Ysxrb", "To", "", "Subject", "", "xyzabc", "Date", "", "Tue,", "19", "Jun", "2012", "03", "26", "56", "-0700", "Message-ID", "", "<9D.A1.02635.ABB40EF4#ecout1>"]
At this point, if the word following "To" and "Subject" are what you're after, you could simply get the first non-blank array element after each one, like this:
tokens[tokens.find_index("From") + 1] => "Ysxrb"
tokens[tokens.find_index("Subject") + 2] => "xyzabc" # + 2 is needed because of the newline.
You can use a regular expresion, try this on a irb console:
string = "From:Ysxrb<abc#gmail.com>\nTo: <xyzn#gmail.com>Subject:"
/From:(.+)\n/.match string
$1
$1 hold the backreference we capture with the parenthesis in the regular expression
You could try a regexp, here's an example:
>> s = "From:Ysxrb\nTo: Subject: xyzabc\nDate: Tue, 19 Jun 2012 03:26:56 -0700\nMessage-ID: <9D.A1.02635.ABB40EF4#ecout1>"
=> "From:Ysxrb\nTo: Subject: xyzabc\nDate: Tue, 19 Jun 2012 03:26:56 -0700\nMessage-ID: <9D.A1.02635.ABB40EF4#ecout1>"
>> m, w1, w2 = s.match(/^From:(\w*)\W+.*Subject: (\w*)/).to_a
=> ["From:Ysxrb\nTo: Subject: xyzabc", "Ysxrb", "xyzabc"]
>> w1
=> "Ysxrb"
>> w2
=> "xyzabc"
to find out a good regexp for your requirements, you may use rubular, a Ruby regular expression editor