I'm trying to develop my home email server (with NodeJS on server side but it's not important as I try to figure out principles).
I use this documentation to guide myself through DKIM-Signature validation routine, but it requires some complicated steps and I can't figure out where is my mistake. For an email example I used one sent from Mail.ru server. It should be totally valid. There is it's header:
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=mail.ru; s=mail2;
h=References:In-Reply-To:Content-Type:Message-ID:Reply-To:Date:MIME-Version:Subject:To:From; bh=gCWDSCJf58CbaR+wjAV9dydu9JTKkvo1o+0zkj8bNr0=;
b=pheltY+k/mio2x4CFQV8cXZxNiR7oSTkIsWTOZa1CGpEyK8KVSHY07OWSdZ1aFVtuaV32PbI0mNY0yliuqIbYTsnreFUYFM/iVR5PU74QHAe8yp46ydAYRbzLQu8dy+AkFhPtEdb8CAgoZKXgPLc888/Q6MsVAh6iH1L3SZj87Y=;
Received: by f427.i.mail.ru with local (envelope-from <[my name]#mail.ru>)
id 1dbP18-0003I9-L7
for madbr#[domain]; Sat, 29 Jul 2017 13:30:42 +0300
Received: by e.mail.ru with HTTP;
Sat, 29 Jul 2017 13:30:42 +0300
From: =?UTF-8?B?0KHQtdGA0LPQtdC5?= <[my name]#mail.ru>
To: madbr#[domain]
Subject: =?UTF-8?B?UmU6IA==?=
MIME-Version: 1.0
X-Mailer: Mail.Ru Mailer 1.0
Date: Sat, 29 Jul 2017 13:30:42 +0300
Reply-To: =?UTF-8?B?0KHQtdGA0LPQtdC5?= <[my name]#mail.ru>
X-Priority: 3 (Normal)
Message-ID: <1501324242.448202607#f427.i.mail.ru>
Content-Type: multipart/mixed;
boundary="----uEhsLqzDWmmGeA9EZ3XNsqSIGjlgVTmA-NI9QMhpqxNHWLEDT-1501324242"
Authentication-Results: f427.i.mail.ru; auth=pass smtp.auth=[my name]#mail.ru smtp.mailfrom=[my name]#mail.ru
X-7FA49CB5: 0D63561A33F958A58B4AE7CD4FB69874B38CA0D04717BA57612FFEEC28D99E31725E5C173C3A84C325A81A29FB5043FD044813140D6DB928F1C9CF18C8EB2269C4224003CC836476C0CAF46E325F83A50BF2EBBBDD9D6B0F2AF38021CC9F462D574AF45C6390F7469DAA53EE0834AAEE
X-Mailru-Sender: 080178E06F6B3F48806FD386034E228604900381AF51F7DD303A634C9E25199A8DFBC783E67F8C0305D8C6CDFE81985CCFB2E39DA8E91CCEEEC687A792225BA622DF1A08BD40178CA471C22AD050A14893AC9912533B2342AE208404248635DF
X-Mras: OK
X-Spam: undefined
In-Reply-To: <1500037364.788302144#mx47.mail.ru>
References: <1500037364.788302144#mx47.mail.ru>
Validation instruction says:
In hash step 1, the Signer/Verifier MUST hash the message body,
canonicalized using the body canonicalization algorithm specified in
the "c=" tag and then truncated to the length specified in the "l="
tag. That hash value is then converted to base64 form and inserted
into (Signers) or compared to (Verifiers) the "bh=" tag of the DKIM-
Signature header field.
In hash step 2, the Signer/Verifier MUST pass the following to the
hash algorithm in the indicated order.
1. The header fields specified by the "h=" tag, in the order
specified in that tag, and canonicalized using the header
canonicalization algorithm specified in the "c=" tag. Each
header field MUST be terminated with a single CRLF.
2. The DKIM-Signature header field that exists (verifying) or will
be inserted (signing) in the message, with the value of the "b="
tag (including all surrounding whitespace) deleted (i.e., treated
as the empty string), canonicalized using the header
canonicalization algorithm specified in the "c=" tag, and without
a trailing CRLF.
The first step is easy: I've get message body, canonicalized it using
relaxed: function (data) {
return data.replace(/[ \t]+\r\n/g, '\r\n').replace(/[ \t]+/g, ' ').replace(/\r\n{2,}$/g, CONST.CRLF);
}
and created sha256 (according to a= tag) hash of it. It matched bh= tag in DKIM-Signature header and yet I'm happy.
For a next step I perform next actions:
1) Get all required headers from message in order given in h= signature tag.
References: <1500037364.788302144#mx47.mail.ru>
In-Reply-To: <1500037364.788302144#mx47.mail.ru>
Content-Type: multipart/mixed;
boundary="----uEhsLqzDWmmGeA9EZ3XNsqSIGjlgVTmA-NI9QMhpqxNHWLEDT-1501324242"
Message-ID: <1501324242.448202607#f427.i.mail.ru>
Reply-To: =?UTF-8?B?0KHQtdGA0LPQtdC5?= <[my name]#mail.ru>
Date: Sat, 29 Jul 2017 13:30:42 +0300
MIME-Version: 1.0
Subject: =?UTF-8?B?UmU6IA==?=
To: madbr#[domain]
From: =?UTF-8?B?0KHQtdGA0LPQtdC5?= <[my name]#mail.ru>
2) Canonicalized it:
references:<1500037364.788302144#mx47.mail.ru>
in-reply-to:<1500037364.788302144#mx47.mail.ru>
content-type:multipart/mixed; boundary="----uEhsLqzDWmmGeA9EZ3XNsqSIGjlgVTmA-NI9QMhpqxNHWLEDT-1501324242"
message-id:<1501324242.448202607#f427.i.mail.ru>
reply-to:=?UTF-8?B?0KHQtdGA0LPQtdC5?= <[my name]#mail.ru>
date:Sat, 29 Jul 2017 13:30:42 +0300
mime-version:1.0
subject:=?UTF-8?B?UmU6IA==?=
to:madbr#[domain]
from:=?UTF-8?B?0KHQtdGA0LPQtdC5?= <[my name]#mail.ru>
3) Get DKIM-Signature, removed b= tag and also canonalized it (trailing \r\n was also removed according to documentation):
dkim-signature:v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=mail.ru; s=mail2; h=References:In-Reply-To:Content-Type:Message-ID:Reply-To:Date:MIME-Version:Subject:To:From; bh=gCWDSCJf58CbaR+wjAV9dydu9JTKkvo1o+0zkj8bNr0==;
4) Get public key from DNS TXT record and appended -----BEGIN PUBLIC KEY-----...-----END PUBLIC KEY----- for PEM format compatibility.
5) At last I used standard RSA validation function to validate it:
crypto.createVerify('sha256')
.update(header + dkimHeader)
.verify(publicKey, Buffer.from(signature.b, CONST.BASE64));
But it failed, and I don't really know which actions to blame.
In last step I concatenated header and DKIM-Signature, because I don't really understand what does "pass the following to the
hash algorithm in the indicated order" mean. Tried to use .update(header).update(dkimHeader), but it made no difference.
Can someone explain please, what do I do wrong?
From section 3.7. Computing the Message Hashes of the RFC:
In hash step 2, the Signer/Verifier MUST pass the following to the
hash algorithm in the indicated order.
The header fields specified by the "h=" tag, in the order
specified in that tag, and canonicalized using the header
canonicalization algorithm specified in the "c=" tag. Each
header field MUST be terminated with a single CRLF.
The DKIM-Signature header field that exists (verifying) or will
be inserted (signing) in the message, with the value of the "b="
tag (including all surrounding whitespace) deleted (i.e., treated
as the empty string), canonicalized using the header
canonicalization algorithm specified in the "c=" tag, and without
a trailing CRLF.
I highlighted the important part: Only the value should be deleted, not the complete tag.
So the correct last line of the input is (note the b=; at the end):
dkim-signature:v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=mail.ru; s=mail2; h=References:In-Reply-To:Content-Type:Message-ID:Reply-To:Date:MIME-Version:Subject:To:From; bh=gCWDSCJf58CbaR+wjAV9dydu9JTKkvo1o+0zkj8bNr0=; b=;
Related
I am needing to extract specific lines from a raw email that i have coming into my webhook.site url. I figured it would be an xpath extract or some custom action but my knowledge only goes so far.
i need to extract: Subject: which is the header about the content-type: text/plain.
Whats my path and layout to do so??
thanks
Return-Path: <leadnotification-noreply2#ylopo.com>
Received: from mail-lf1-f53.google.com (mail-lf1-f53.google.com [209.85.167.53])
by inbound-smtp.eu-west-1.amazonaws.com with SMTP id nqoso8q41s14sho14modqhpqbqvvvcqb0e3hea81
for OsirsYlopoPriorityEmail#email.webhook.site;
Wed, 27 Oct 2021 19:21:34 +0000 (UTC)
X-SES-Spam-Verdict: PASS
X-SES-Virus-Verdict: PASS
Received-SPF: pass (spfCheck: domain of ylopo.com designates 209.85.167.53 as permitted sender) client-ip=209.85.167.53; envelope-from=leadnotification-noreply2#ylopo.com; helo=mail-lf1-f53.google.com;
Authentication-Results: amazonses.com;
spf=pass (spfCheck: domain of ylopo.com designates 209.85.167.53 as permitted sender) client-ip=209.85.167.53; envelope-from=leadnotification-noreply2#ylopo.com; helo=mail-lf1-f53.google.com;
dkim=pass header.i=#ylopo.com;
dmarc=pass header.from=ylopo.com;
X-SES-RECEIPT: AEFBQUFBQUFBQUFHdlpaMkNVNS9wd25QYmtqSU9xbGJBVzZJa0tSV1dCNWcwWFFwZFNUS1lweHpxY1A1NlRoSlZyU1NEM0drMlp3Q0Jpd0d2ZG1RUC9VYVRCbWt6UUhMdkFwOUJLS1NGYnFCSzQyVGpQK1loZzU0SkpIcy9pNnQ4aHhScnV2dG9sV2M5b2VnSzY4MU0vMm9JaWx5VDJOcW5WWllPRzhvNkp6VHdJSWNmbmJBd1lvZlF1WHdEQUNFRzkyM0dQQVkxdGgwS0NwUHAzcVo5dit5clgvOXFzdG8xMFJaTVpuMVRLcXUwVDNNMVRod3lWSDB0NStRNjNUcEMrZ2RwZ0dSZFBQcEQ2OXZ2bUFOMkI3UUhJYUNZdG5iM2hqMUZhV1NST0FBbHdwSUM5TDIySk11dWh2alN3VE9BbVhGWjlKditMUzQ9
X-SES-DKIM-SIGNATURE: a=rsa-sha256; q=dns/txt; b=AXzUmXzoB1/c89r/6eJBBuEUriK6kdQMMJRXPnBwlUC5oCKD1Apk+Av3zpg5yxj+4djsbxeCsSPIwkcYX6xlXUfWUQ/mi7pgHViuVh+r/NrEKptMjb5efdeH7/mls8tRzyaQQF+12LbSb0wBo2bTkcQLXkP3WvKP5OSFde3B620=; c=relaxed/simple; s=uku4taia5b5tsbglxyj6zym32efj7xqv; d=amazonses.com; t=1635362495; v=1; bh=5WqfRbOb11+ge2mx0Egi4AqA6n1lyVIU1IpAVg4H0dA=; h=From:To:Cc:Bcc:Subject:Date:Message-ID:MIME-Version:Content-Type:X-SES-RECEIPT;
Received: by mail-lf1-f53.google.com with SMTP id y26so8295478lfa.11
for <OsirsYlopoPriorityEmail#email.webhook.site>; Wed, 27 Oct 2021 12:21:34 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
d=ylopo.com; s=google;
h=mime-version:from:date:message-id:subject:to;
bh=nrZwBjOzEgZN0tQ61MtDEDWEj3XJn2q+y4Qx1/acY5A=;
b=gtj7GzFH9q0O2S1ynt3Qvhp5zgomrYmufqbSQ0qIjEalwk9Dd0lSI7MeOMrgNtjDFL
sGwRBO9L4ZW3yE5ZKmP/wSYKmVlerL51ZlTQQhuTXsxioymJto3j0ERWirJQj+BapzGT
HBxScQEwYkpqZqWX6KkCTjCzCGZqW+fp9vitHmgfqt1/nLiyZp+7WEbluw+rPQO0G7dR
CGObjTeYa0Fd+Dc8h/k/a7suZ2umrqqnl/HYaoY7BeMxhAJDP5TuaoAsjQh1EU9zqHY8
TJxcJZoo83n+7f8qVMSNpAstVynlmsH6h7nzW1q27pfeWWY6LgMRUjkKqrYIc4F8lsqR
98SQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
d=1e100.net; s=20210112;
h=x-gm-message-state:mime-version:from:date:message-id:subject:to;
bh=nrZwBjOzEgZN0tQ61MtDEDWEj3XJn2q+y4Qx1/acY5A=;
b=D/VtUGyzcVfTxi2x63V/3OC7ckHj1uuLlILVUNmOkMrvW6GWUJq7mKI9/D4UYXWc5i
Ybds0/Dktx6cDJUZbAiAYWOiYd56XMMc2O+Yoe36u+eREry29IJQXOLTcRj2KFcLGSWa
Me5GjcFPVBTuhjtxlPb41wCKhGmDevYEHDkbGIcoNp5w3weGobSPg8bLQNsEO/Hspn6y
4Q18s7uNIJ8X2o5DEevA8DrZfibThQ3X5HUFmpuaRT3089Qm3H92wiHv3rkn3fQFDnVY
4p/PcWw9bxK6pU47bBtO/qtJ2ce/3Q7OQq/NCvJ5ZjD83lRai1mKIaCtiyt3gccQrd5S
hc5g==
X-Gm-Message-State: AOAM531woVO34G4bG9aVBLV9ae6QgX8pklYV4DDYs4VALPxuz7W9GStL
QM3YDolGrDOXbzaZghT/1+So8DR3WqejlOCELMsa+zO2Xxs=
X-Google-Smtp-Source: ABdhPJylxjs2zEdxsuBr2M1l9CnsOHrWvk+53vOLw2bs77bdl75s4022uZaKlqYx/GG+UyEsRrt8DT6mRRCAQZSebbE=
X-Received: by 2002:a05:6512:acf:: with SMTP id n15mr578213lfu.222.1635362493750;
Wed, 27 Oct 2021 12:21:33 -0700 (PDT)
Received: from 927538837578 named unknown by gmailapi.google.com with
HTTPREST; Wed, 27 Oct 2021 12:21:33 -0700
MIME-Version: 1.0
From: leadnotification-noreply2#ylopo.com
Date: Wed, 27 Oct 2021 12:21:33 -0700
Message-ID: <CAN2r-3o0aD_98y2GrdxdBW7W6UHi8RMS9-JmvsHrftheurwMeQ#mail.gmail.com>
Subject: Ylopo Priority Alert - Party: Daniel Askew 19293 -
PRIORITY_LEAD_EVENT - massaquoimartha#yahoo.com - 8562838525
To: OsirsYlopoPriorityEmail#email.webhook.site, qojfsghi#mailparser.io
Content-Type: multipart/alternative; boundary="00000000000084f8be05cf5a80dc"
--00000000000084f8be05cf5a80dc
Content-Type: text/plain; charset="UTF-8"
Lead Name: Martha Mansaray
Lead Email: massaquoimartha#yahoo.com
Lead Phone: 8562838525
Text:
Ylopo PRIORITY LEAD ALERT: Martha Mansaray (856) 283-8525
Martha Mansaray VIEWED 6185 Old Highway 31E, Bethpage, TN
<https://andrea.livetn.com/listing-detail/124037148> 29 TIMES.
Recommend actions:
I think you need regex because XPath is used to find a node from XML/HTML type text. But regex can be used on any text. To get the value of header name "Subject" you can use regex \nSubject: (.*).
Python example:
import re
sample = """
...
<text you want to parse>
...
Subject: Ylopo Priority Alert - Party: Daniel Askew 19293 -
"""
if match := re.search(r"\nSubject: (.*)", sample):
print(match.group(1)) # output: Ylopo Priority Alert - Party: Daniel Askew 19293 -
Let's say I want to compose an email header with UTF-8, quoted-printable encoded subject, which is "test — UNIX-утилита для проверки типа файла и сравнения значений". I can confirm the bytes of the characters using:
$ echo "UNIX-утилита ..." | perl utfinfo.pl
Got 16 uchars
Char: 'U' u: 85 [0x0055] b: 85 [0x55] n: LATIN CAPITAL LETTER U [Basic Latin]
Char: 'N' u: 78 [0x004E] b: 78 [0x4E] n: LATIN CAPITAL LETTER N [Basic Latin]
Char: 'I' u: 73 [0x0049] b: 73 [0x49] n: LATIN CAPITAL LETTER I [Basic Latin]
Char: 'X' u: 88 [0x0058] b: 88 [0x58] n: LATIN CAPITAL LETTER X [Basic Latin]
Char: '-' u: 45 [0x002D] b: 45 [0x2D] n: HYPHEN-MINUS [Basic Latin]
Char: 'у' u: 1091 [0x0443] b: 209,131 [0xD1,0x83] n: CYRILLIC SMALL LETTER U [Cyrillic]
Char: 'т' u: 1090 [0x0442] b: 209,130 [0xD1,0x82] n: CYRILLIC SMALL LETTER TE [Cyrillic]
Char: 'и' u: 1080 [0x0438] b: 208,184 [0xD0,0xB8] n: CYRILLIC SMALL LETTER I [Cyrillic]
...
So, I'm trying to get the UTF-8, quoted printable representation of this. For instance, using Python's quopri:
$ python -c 'import quopri; a="test — UNIX-утилита для проверки типа файла и сравнения значений"; print(quopri.encodestring(a));'
test =E2=80=94 UNIX-=D1=83=D1=82=D0=B8=D0=BB=D0=B8=D1=82=D0=B0 =D0=B4=D0=BB=
=D1=8F =D0=BF=D1=80=D0=BE=D0=B2=D0=B5=D1=80=D0=BA=D0=B8 =D1=82=D0=B8=D0=BF=
=D0=B0 =D1=84=D0=B0=D0=B9=D0=BB=D0=B0 =D0=B8 =D1=81=D1=80=D0=B0=D0=B2=D0=BD=
=D0=B5=D0=BD=D0=B8=D1=8F =D0=B7=D0=BD=D0=B0=D1=87=D0=B5=D0=BD=D0=B8=D0=B9
... or PHP's quoted_printable_encode, which gives the exact same output:
$ php -r '$a="test — UNIX-утилита для проверки типа файла и сравнения значений"; echo quoted_printable_encode($a)."\n";'
test =E2=80=94 UNIX-=D1=83=D1=82=D0=B8=D0=BB=D0=B8=D1=82=D0=B0 =D0=B4=D0=BB=
=D1=8F =D0=BF=D1=80=D0=BE=D0=B2=D0=B5=D1=80=D0=BA=D0=B8 =D1=82=D0=B8=D0=BF=
=D0=B0 =D1=84=D0=B0=D0=B9=D0=BB=D0=B0 =D0=B8 =D1=81=D1=80=D0=B0=D0=B2=D0=BD=
=D0=B5=D0=BD=D0=B8=D1=8F =D0=B7=D0=BD=D0=B0=D1=87=D0=B5=D0=BD=D0=B8=D0=B9
So, to test, I make a text file called test.eml, and try to simply wrap this output in the =?UTF-8?Q? ... ?= tags for the Subject: line, making sure that line endings are CRLF \r\n:
Message-Id: <4c428d27a41043e2b2b07e#example.com>
Subject: =?UTF-8?Q?test =E2=80=94 UNIX-=D1=83=D1=82=D0=B8=D0=BB=D0=B8=D1=82=D0=B0 =D0=B4=D0=BB=
=D1=8F =D0=BF=D1=80=D0=BE=D0=B2=D0=B5=D1=80=D0=BA=D0=B8 =D1=82=D0=B8=D0=BF=
=D0=B0 =D1=84=D0=B0=D0=B9=D0=BB=D0=B0 =D0=B8 =D1=81=D1=80=D0=B0=D0=B2=D0=BD=
=D0=B5=D0=BD=D0=B8=D1=8F =D0=B7=D0=BD=D0=B0=D1=87=D0=B5=D0=BD=D0=B8=D0=B9?=
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Hello world
... but if I open this in Thunderbird, I get a corrupt output:
I've read somewhere that multiline in long header fields is covered by RFC0822 "LONG HEADER FIELDS", and basically, the line ending should be followed by a space. So I indent the continuation lines by one space:
Message-Id: <4c428d27a41043e2b2b07e#example.com>
Subject: =?UTF-8?Q?test =E2=80=94 UNIX-=D1=83=D1=82=D0=B8=D0=BB=D0=B8=D1=82=D0=B0 =D0=B4=D0=BB=
=D1=8F =D0=BF=D1=80=D0=BE=D0=B2=D0=B5=D1=80=D0=BA=D0=B8 =D1=82=D0=B8=D0=BF=
=D0=B0 =D1=84=D0=B0=D0=B9=D0=BB=D0=B0 =D0=B8 =D1=81=D1=80=D0=B0=D0=B2=D0=BD=
=D0=B5=D0=BD=D0=B8=D1=8F =D0=B7=D0=BD=D0=B0=D1=87=D0=B5=D0=BD=D0=B8=D0=B9?=
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Hello world
... and I get a slighly different subject in Thunderbird, but still corrupt:
Now, if I delete =\r\n from the first three continuation lines, so the subject is all in one line:
Message-Id: <4c428d27a41043e2b2b07e#example.com>
Subject: =?UTF-8?Q?test =E2=80=94 UNIX-=D1=83=D1=82=D0=B8=D0=BB=D0=B8=D1=82=D0=B0 =D0=B4=D0=BB=D1=8F =D0=BF=D1=80=D0=BE=D0=B2=D0=B5=D1=80=D0=BA=D0=B8 =D1=82=D0=B8=D0=BF=D0=B0 =D1=84=D0=B0=D0=B9=D0=BB=D0=B0 =D0=B8 =D1=81=D1=80=D0=B0=D0=B2=D0=BD=D0=B5=D0=BD=D0=B8=D1=8F =D0=B7=D0=BD=D0=B0=D1=87=D0=B5=D0=BD=D0=B8=D0=B9?=
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Hello world
... then actually Thunderbird shows the subject line well:
... but then my header is in conflict with the recommendation from RFC 2822 - 2.1.1. Line Length Limits which says "Each line of characters MUST be no more than 998 characters, and SHOULD be no more than 78 characters, excluding the CRLF."; specifically the line limit of 78 characters.
So, how can I obtain the proper multi-line quoted-printable representation of an UTF-8 Subject header string, so I can use it in an .eml file split at 78 characters - and have Thunderbird correctly read it?
When I ask python to create an email with that subject, here's what it does:
$ python
Python 2.7.9 (default, Mar 1 2015, 18:22:53)
[GCC 4.9.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from email.message import Message
>>> from email.header import Header
>>> msg = Message()
>>> import quopri
>>> h = Header(quopri.decodestring('test =E2=80=94 UNIX-'
'=D1=83=D1=82=D0=B8=D0=BB=D0=B8=D1=82=D0=B0 =D0=B4=D0=BB=D1=8F'
'=D0=BF=D1=80=D0=BE=D0=B2=D0=B5=D1=80=D0=BA=D0=B8 =D1=82=D0=B8'
'=D0=BF=D0=B0 =D1=84=D0=B0=D0=B9=D0=BB=D0=B0 =D0=B8'
'=D1=81=D1=80=D0=B0=D0=B2=D0=BD=D0=B5=D0=BD=D0=B8=D1=8F '
'=D0=B7=D0=BD=D0=B0=D1=87=D0=B5=D0=BD=D0=B8=D0=B9?='), 'UTF-8')
>>> msg['Subject'] = h
>>> print msg.as_string()
Subject: =?utf-8?b?dGVzdCDigJQgVU5JWC3Rg9GC0LjQu9C40YLQsCDQtNC70Y8g0L/RgNC+0LI=?=
=?utf-8?b?0LXRgNC60Lgg0YLQuNC/0LAg0YTQsNC50LvQsCDQuCDRgdGA0LDQstC90LU=?=
=?utf-8?b?0L3QuNGPINC30L3QsNGH0LXQvdC40Lk/?=
>>>
So it uses base64 encoding instead of quoted-printable, but my strong suspicion, based on this, is that the answer is that each line must begin and end the escape.
Indeed:
>>> import email
>>> s = '''Subject: =?UTF-8?Q?test =E2=80=94 UNIX-=D1=83=D1=82=D0=B8=D0?=
... =?UTF-8?Q?=BB=D0=B8=D1=82=D0=B0 =D0=B4=D0=BB=D1=8F =D0=BF=D1=80=D0?=
... =?UTF-8?Q?=BE=D0=B2=D0=B5=D1=80=D0=BA=D0=B8 =D1=82=D0=B8=D0=BF=D0=B0?=
... =?UTF-8?Q? =D1=84=D0=B0=D0=B9=D0=BB=D0=B0 =D0=B8 =D1=81=D1=80=D0=B0=D0?=
... =?UTF-8?Q?=B2=D0=BD=D0=B5=D0=BD=D0=B8=D1=8F =D0=B7=D0=BD=D0=B0=D1?=
... =?UTF-8?Q?=87=D0=B5=D0=BD=D0=B8=D0=B9?=
...
... Hello.
... '''
>>> e = email.message_from_string(s.replace('\n', '\r\n'))
>>> email.header.decode_header(e['Subject'])
[('test \xe2\x80\x94 UNIX-\xd1\x83\xd1\x82\xd0\xb8\xd0\xbb\xd0\xb8\xd1\x82\xd0\xb0 \xd0\xb4\xd0\xbb\xd1\x8f \xd0\xbf\xd1\x80\xd0\xbe\xd0\xb2\xd0\xb5\xd1\x80\xd0\xba\xd0\xb8 \xd1\x82\xd0\xb8\xd0\xbf\xd0\xb0 \xd1\x84\xd0\xb0\xd0\xb9\xd0\xbb\xd0\xb0 \xd0\xb8 \xd1\x81\xd1\x80\xd0\xb0\xd0\xb2\xd0\xbd\xd0\xb5\xd0\xbd\xd0\xb8\xd1\x8f \xd0\xb7\xd0\xbd\xd0\xb0\xd1\x87\xd0\xb5\xd0\xbd\xd0\xb8\xd0\xb9', 'utf-8')]
>>> decoded = email.header.decode_header(e['Subject'])
>>> print decoded[0][0].decode(decoded[0][1])
test — UNIX-утилита для проверки типа файла и сравнения значений
EDIT: However, even with the above added in .eml file, Thunderbird fails again:
... but this time it indicates it got some of the chars correct. And indeed, breakage occurs where lines are broken "in the middle of a character"; say if for the sequence 0xD1, 0x83 for the character у, the =D1?= ends one line, and the Q?=83 starts the other, then Thunderbird cannot parse that. So after manual rearrangement, this snippet can be obtained:
Message-Id: <4c428d27a41043e2b2b07e#example.com>
Subject: =?UTF-8?Q?test =E2=80=94 UNIX-=D1=83=D1=82=D0=B8?=
=?UTF-8?Q?=D0=BB=D0=B8=D1=82=D0=B0 =D0=B4=D0=BB=D1=8F =D0=BF=D1=80?=
=?UTF-8?Q?=D0=BE=D0=B2=D0=B5=D1=80=D0=BA=D0=B8 =D1=82=D0=B8=D0=BF=D0=B0?=
=?UTF-8?Q? =D1=84=D0=B0=D0=B9=D0=BB=D0=B0 =D0=B8 =D1=81=D1=80=D0=B0?=
=?UTF-8?Q?=D0=B2=D0=BD=D0=B5=D0=BD=D0=B8=D1=8F =D0=B7=D0=BD=D0=B0?=
=?UTF-8?Q?=D1=87=D0=B5=D0=BD=D0=B8=D0=B9?=
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Hello world
... which opens fine as an .eml message in Thunderbird (same as this image from OP).
EDIT2: Also PHP seems to do it right, with this invocation of mb_encode_mimeheader (directly pasteable in .eml file):
$ php -r '$a="test — UNIX-утилита для проверки типа файла и сравнения значений"; mb_internal_encoding("UTF-8"); echo mb_encode_mimeheader($a, "UTF-8", "Q")."\n";'
test =?UTF-8?Q?=E2=80=94=20UNIX-=D1=83=D1=82=D0=B8=D0=BB=D0=B8=D1=82?=
=?UTF-8?Q?=D0=B0=20=D0=B4=D0=BB=D1=8F=20=D0=BF=D1=80=D0=BE=D0=B2=D0=B5?=
=?UTF-8?Q?=D1=80=D0=BA=D0=B8=20=D1=82=D0=B8=D0=BF=D0=B0=20=D1=84=D0=B0?=
=?UTF-8?Q?=D0=B9=D0=BB=D0=B0=20=D0=B8=20=D1=81=D1=80=D0=B0=D0=B2=D0=BD?=
=?UTF-8?Q?=D0=B5=D0=BD=D0=B8=D1=8F=20=D0=B7=D0=BD=D0=B0=D1=87=D0=B5=D0=BD?=
=?UTF-8?Q?=D0=B8=D0=B9?=
The problem with your test.eml is that your RFC2047 encoding is broken. The Q encoding is based on quoted-printable, but is not entirely the same. In particular, each space needs to be encoded as either =20 or _, and you cannot escape line breaks with a final =.
Fundamentally, each =?...?= sequence needs to be a single, unambiguous token per RFC 822. You can either break up your input into multiple such tokens and leave the spaces unencoded, or encode the spaces. Note that spaces between two such tokens are not significant, so encoding the spaces into the sequences makes more sense.
Message-Id: <4c428d27a41043e2b2b07e#example.com>
Subject: =?UTF-8?Q?test_=E2=80=94_UNIX-=D1=83=D1=82=D0=B8=D0=BB?=
=?UTF-8?Q?=D0=B8=D1=82=D0=B0_=D0=B4=D0=BB_=D1=8F_=D0=BF=D1=80?=
=?UTF-8?Q?=D0=BE=D0=B2=D0=B5=D1=80=D0=BA=D0=B8_=D1=82=D0=B8=D0=BF?=
=?UTF-8?Q?=D0=B0_=D1=84=D0=B0=D0=B9=D0=BB=D0=B0_=D0=B8_=D1=81?=
=?UTF-8?Q?=D1=80=D0=B0=D0=B2=D0=BD_=D0=B5=D0=BD=D0=B8=D1=8F_?=
=?UTF-8?Q?=D0=B7=D0=BD=D0=B0=D1=87=D0=B5=D0=BD=D0=B8=D0=B9?=
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Hello world
Of course, with this exposition, quoted-printable isn't really legible at all, and probably takes much more space than base64, so you might prefer to go with the B encoding in the end after all.
Unless you are writing a MIME library yourself, the simple solution is to not care, and let the library piece this together for you. PHP is more problematic (the standard library lacks this functionality, and the third-party libraries are somewhat uneven--find one you trust, and stick to it), but in Python, simply pass in a Unicode string, and the email library will encode it if necessary.
We are using a Joomla component and all the messages for the "contact this advertiser" web form go directly to spam. We have tested many email accounts and have made every change I can think of. Any assistance would be great.
here is the page where the web form is:
http://www.shopforbiz.com/buy-a-business/ad/restaurants,26/franchise-sub-shop,31
Here is the email header information:
Delivered-To: adammotta#gmail.com
Received: by 10.194.27.195 with SMTP id v3csp56291wjg;
Thu, 7 Aug 2014 06:01:33 -0700 (PDT)
X-Received: by 10.68.113.133 with SMTP id iy5mr17521168pbb.135.1407416493076;
Thu, 07 Aug 2014 06:01:33 -0700 (PDT)
Return-Path: <inquiry#shopforbiz.com>
Received: from see.seekmomentum.com (see.seekmomentum.com. [198.57.217.77])
by mx.google.com with ESMTPS id ym5si3469643pab.6.2014.08.07.06.01.32
for <adammotta#gmail.com>
(version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
Thu, 07 Aug 2014 06:01:32 -0700 (PDT)
Received-SPF: pass (google.com: domain of inquiry#shopforbiz.com designates 198.57.217.77 as permitted sender) client-ip=198.57.217.77;
Authentication-Results: mx.google.com;
spf=pass (google.com: domain of inquiry#shopforbiz.com designates 198.57.217.77 as
permitted sender) smtp.mail=inquiry#shopforbiz.com
Received: from localhost ([127.0.0.1]:35995 helo=www.shopforbiz.com)
by see.seekmomentum.com with esmtpa (Exim 4.82)
(envelope-from <inquiry#shopforbiz.com>)
id 1XFNJz-0003RC-Jy
for adammotta#gmail.com; Thu, 07 Aug 2014 09:01:31 -0400
Date: Thu, 7 Aug 2014 09:01:31 -0400
To: adammotta#gmail.com
From: Shop For Biz <inquiry#shopforbiz.com>
Reply-To: inquiry#shopforbiz.com
Subject: New Inquiry from ShopforBiz
Message-ID: <18888aba9485a3ef865449febf2667c2#www.shopforbiz.com>
X-Priority: 3
X-Mailer: PHPMailer 5.2.6 (https://github.com/PHPMailer/PHPMailer/)
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit
X-AntiAbuse: This header was added to track abuse, please include it with any abuse report
X-AntiAbuse: Primary Hostname - see.seekmomentum.com
X-AntiAbuse: Original Domain - seekmomentum.com
X-AntiAbuse: Originator/Caller UID/GID - [47 12] / [47 12]
X-AntiAbuse: Sender Address Domain - shopforbiz.com
X-Get-Message-Sender-Via: see.seekmomentum.com: authenticated_id: inquiry#shopforbiz.com
X-Source:
X-Source-Args:
X-Source-Dir:
From user: adam
User email: adamd#yaho.com
Your advertisement ''Carry Out Chicken, Ribs & Pizza'' enquiry
test
How do I get the word after a particular word in a Ruby string?
For example:
From:Ysxrb<abc#gmail.com>\nTo: <xyzn#gmail.com>Subject: xyzabc\nDate: Tue, 19 Jun 2012 03:26:56 -0700\nMessage-ID: <9D.A1.02635.ABB40EF4#ecout1>
I just want to get:
Ysxrb<abc#gmail.com
xyzabc
I think your question/requirement may need a bit of refinement.
You state: "How to get the word after a particular word in a ruby string?" and your example text is this : "From:Ysxrb\nTo: Subject: xyzabc\nDate: Tue, 19 Jun 2012 03:26:56 -0700\nMessage-ID: <9D.A1.02635.ABB40EF4#ecout1>"
and then you finally say that what you really want out of these string are the following words:
"'Ysxrb' and 'xyzabc'".
Will you always be parsing email text, which is what this appears to be? If so, then there are some more specific approaches you could take. For instance, in this example you could do something like this:
eml = "From:Ysxrb\nTo: Subject: xyzabc\nDate: Tue, 19 Jun 2012 03:26:56 -0700\nMessage-ID: <9D.A1.02635.ABB40EF4#ecout1>"
tokens = eml.split(/[\s\:]/)
which would yield this:
["From", "Ysxrb", "To", "", "Subject", "", "xyzabc", "Date", "", "Tue,", "19", "Jun", "2012", "03", "26", "56", "-0700", "Message-ID", "", "<9D.A1.02635.ABB40EF4#ecout1>"]
At this point, if the word following "To" and "Subject" are what you're after, you could simply get the first non-blank array element after each one, like this:
tokens[tokens.find_index("From") + 1] => "Ysxrb"
tokens[tokens.find_index("Subject") + 2] => "xyzabc" # + 2 is needed because of the newline.
You can use a regular expresion, try this on a irb console:
string = "From:Ysxrb<abc#gmail.com>\nTo: <xyzn#gmail.com>Subject:"
/From:(.+)\n/.match string
$1
$1 hold the backreference we capture with the parenthesis in the regular expression
You could try a regexp, here's an example:
>> s = "From:Ysxrb\nTo: Subject: xyzabc\nDate: Tue, 19 Jun 2012 03:26:56 -0700\nMessage-ID: <9D.A1.02635.ABB40EF4#ecout1>"
=> "From:Ysxrb\nTo: Subject: xyzabc\nDate: Tue, 19 Jun 2012 03:26:56 -0700\nMessage-ID: <9D.A1.02635.ABB40EF4#ecout1>"
>> m, w1, w2 = s.match(/^From:(\w*)\W+.*Subject: (\w*)/).to_a
=> ["From:Ysxrb\nTo: Subject: xyzabc", "Ysxrb", "xyzabc"]
>> w1
=> "Ysxrb"
>> w2
=> "xyzabc"
to find out a good regexp for your requirements, you may use rubular, a Ruby regular expression editor
I'm trying to parse email strings with the Ruby mail gem, and I'm having a devil of a time with character encodings. Take the following email:
MIME-Version: 1.0
Sender: foobar#example.com
Received: by 10.142.239.17 with HTTP; Thu, 14 Jun 2012 06:00:18 -0700 (PDT)
Date: Thu, 14 Jun 2012 09:00:18 -0400
Delivered-To: foobar#gmail.com
X-Google-Sender-Auth: MxfFrMybNjBoBt4O4GwAn9cMsko
Message-ID: <CAGErOzF3FV5NvzN3zUpLGPok96SFzK18Z4HerzyYNALnzgMVaA#mail.gmail.com>
Subject: Re: [Lorem Ipsum] Foo updated the forum topic 'Reply by email test'
From: Foo Bar <foo#example.com>
To: Foo <c49964d167e08e7d4a1930e6565f23c258be19a0#foo.example.com>
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
This email has accents:=A0R=E9sum=E9
>
> --------- Reply Above This Line ------------
>
> Email parsing with accents: R=E9sum=E9
>
> Click here to view this post in your browser
The email body, when properly encoded, should be:
This reply has accents: Résumé
>
> --------- Reply Above This Line ------------
>
> Email parsing with accents: Résumé
>
> Click here to view this post in your browser
However, I'm having a devil of a time actually getting the accent marks to come through. Here's what I've tried:
message = Mail.new(email_string)
body = message.body.decoded
That gets me a string that starts like this:
This reply has accents:\xA0R\xE9sum\xE9\r\n>\r\n> --------- Reply Above This Line ------------
Finally, I try this:
body.encoding # => <Encoding:ASCII-8BIT>
body.encode("UTF-8") # => Encoding::UndefinedConversionError: "\xA0" from ASCII-8BIT to UTF-8
Does anyone have any suggestions on how to deal with this? I'm pretty sure it has to do with the "charset=ISO-8859-1" setting in the email, but I'm not sure how to use that, or if there's a way to easily extract that using the mail gem.
After playing a bit, I found this:
body.decoded.force_encoding("ISO-8859-1").encode("UTF-8") # => "This reply has accents: Résumé..."
message.parts.map { |part| part.decoded.force_encoding("ISO-8859-1").encode(part.charset) } # multi-part
You can extract the charset from the message like so.
message.charset #=> for simple, non-multipart
message.parts.map { |part| part.charset } #=> for multipart, each part can have its own charset
Be careful with non-multipart, as the following can cause trouble:
body.charset #=> returns "US-ASCII" which is WRONG!
body.force_encoding(body.charset).encode("UTF-8") #=> Conversion error...
body.force_encoding(message.charset).encode("UTF-8") #=> Correct conversion :)
This didn't work for me, so thought I'd stick up the solution I got to in case it helps anyone...
Basically had to add encoding defaults and tweak the output into sensible strings.
https://stackoverflow.com/a/26604049/2386548