Strip signatures and replies from emails - ruby

I'm currently working on a system that allows users to reply to notification emails that are sent out (sigh).
I need to strip out the replies and signatures, so that I'm left with the actual content of the reply, without all the noise.
Does anyone have any suggestions about the best way to do this?

If your system is in-house and/or you have a limited number of reply formats, it's possible to do a pretty good job. Here are the filters we have set up for email responses to trac tickets:
Drop all text after and including:
Lines that equal '-- \n' (standard email sig delimiter)
Lines that equal '--\n' (people often forget the space in sig delimiter; and this is not that common outside sigs)
Lines that begin with '-----Original Message-----' (MS Outlook default)
Lines that begin with '________________________________' (32 underscores, Outlook again)
Lines that begin with 'On ' and end with ' wrote:\n' (OS X Mail.app default)
Lines that begin with 'From: ' (failsafe four Outlook and some other reply formats)
Lines that begin with 'Sent from my iPhone'
Lines that begin with 'Sent from my BlackBerry'
Numbers 3 and 4 are 'begin with' instead of 'equals' because sometimes users will squash lines together on accident.
We try to be more liberal about stripping out replies, since it's much more of an annoyance (to us) have reply garbage than it is to correct missing text.
Anybody have other formats from the wild that they want to share?

Check out the email_reply_parser gem - https://github.com/github/email_reply_parser . It does a nice job handling this problem.

I don't believe you can do this reliably (signatures used to begin with '--' but I don't see that anymore). Perhaps you're better off asking people to reply inbetween text headers and then simply strip the reply from this ? It's not elegant, but perhaps more reliable.
e.g.
REPLY BETWEEN HERE -->
AND HERE -->
so you'd simply look for the required headers above and take what's inbetween.

If you want something powerful & robust, and don't mind reading academic publications, you might check out this:
Learning to Extract Signature and Reply Lines from Email
Here's the homepage for one of the authors, with more info & some downloads:
Vitor R. Carvalho - Software and Datasets - (Vitor Carvalho)

An approach that can be used for signature only (in addition to detect __ or --) is to test if the first name and/or family name of the sender is on a short line (~ containing 3 to 4 words, max).
The sender name is on the raw email header, most of the time next to the email address, like in:
From: John Doe <jdoe#provider.com>
This would be based on the assumption that you rarely write your own name in a email, and if you do so, it is probably in a long sentence.
Of course there will be some false positive, but it may not be a big problem depending on what you do (we use it to fold quoted text and signature into a ... gmail-style button, so overdetection does not end up into losing any content, it is just misplaced).

If you can assume that these emails are in plain text, just strip lines that begins with ">" as replies, and "-- " line should delimit signature. But those assumptions might not work, as not all people over internet use software that complies to rules.

There's a really nice PHP library dedicated to the email parsing
http://williamdurand.fr/EmailReplyParser/
https://github.com/willdurand/EmailReplyParser

I made one for golang: https://github.com/web-ridge/email-reply-parser it detects signatures like
Karen The Green
Graphic Designer
Office
Tel: +44423423423423
Fax: +44234234234234
karen#webby.com
Street 2, City, Zeeland, 4694EG, NL
www.thing.com
The content of this email is confidential and intended for the recipient specified in message only. It is strictly forbidden to share any part of this message with any third party, without a written consent of the sender. If you received this message by mistake, please reply to this message and follow with its deletion, so that we can ensure such a mistake does not occur in the future.
Met vriendelijke groeten,
Richard Lindhout

The recommended signature delimiter is "-- \n". If people follow this recommendation, stripping signatures should be easy.

Related

What is SCC_BODY_URI_ONLY rule in spam assassin?

I am facing this issue SCC_BODY_URI_ONLY with my email when checked with SPAM ASSASSIN,
Does anybody know about this rule. There is no great deal of documentation around it.
You are right about the documentation. I checked out https://www.futurequest.net/docs/SA/. A very long list. But still no description.
But I did see it had to do with the Meta.
So I looked at the source of the email and saw that the title brackets were empty. So I just added a title and bam.... email passed ! Helpful I hope. Rock on..
As of 2022-06-23, the rule works as follows, as defined under 72_active.cf:
meta SCC_BODY_URI_ONLY T_SCC_BODY_TEXT_LINE < 2 && __HAS_ANY_URI && !__SMIME_MESSAGE
meta T_SCC_BODY_TEXT_LINE __SCC_BODY_TEXT_LINE_FULL - __SCC_SUBJECT_HAS_NON_SPACE
body __SCC_BODY_TEXT_LINE_FULL /^\s*\S/
tflags __SCC_BODY_TEXT_LINE_FULL multiple maxhits=3
header __SCC_SUBJECT_HAS_NON_SPACE Subject =~ /\S/
To summarise, the rule SCC_BODY_URI_ONLY will trigger if:
T_SCC_BODY_TEXT_LINE returns a number less than 2
T_SCC_BODY_TEXT_LINE checks the body of the email for lines containing any non-whitespace character, with any amount of whitespace characters before it, and will run this check a maximum of 3 times. Minus 1 if the Subject contains any non-whitespace characters.
The email contains any URI
The email does not have a Content-Type header indicating it is an S/MIME email
So, pretty much any email that contains:
At least 1 URI, 1 line in the body and a blank subject, OR
At least 1 URI, 2 lines and a subject with content
The above may be out of date in future, you would have to check the current state of the rules in your Spamassassin definitions. More information can be found about writing/interpreting rules here: https://cwiki.apache.org/confluence/display/SPAMASSASSIN/writingrules

How do I create a multiline bot response in Rasa Core?

Can anyone help how to get bot responses in multiple lines.
Also how to get bullets in the Bot responses. I tried with >, * , enter key and also. Nothing seem to work. Does Rasa response templates support HTML tags?
The visualization of the message depends on the output channel which you are using.
Hence, it should be possible to provide HTML tags in your bots answers as long as your output channel can then correctly render it. For a simple newline, please try adding a "\n" in your messages, e.g.:
utter_message:
- text: "First line\nSecond line\Third line"
You can also have a multiline string in your yaml file which then results in a string containing newlines (see here for examples). The block below is the same as the example above:
utter_message:
- text: >
First line
Second line
Third line
To include bullets, you could simply add the unicode character of a bullet, e.g.:
utter_message:
- text: >
• First line
• Second line
• Third line
I think newlines doesn't correspond to "multiple bot responses" (that I interpret with multiple boxes on a instant messaging/caht channel. It's so in Telegram, by example. So I fair #Tobias solution isn't definitive.
A solution to have separate box messages could be to split the original single utterance in a sequence of utterances to be inserted afterward in a "story" as described in this RASA forum reply: https://forum.rasa.com/t/split-utterances-templates-into-multiple-answers/1204/2?u=solyarisoftware
That's more a workaround but that's debatable from the conversational design perspective. Maybe I want different boxes not just for a text pretty printing with newlines, but to communicate different semantics.
For example, if the user say:
Hello
The bot could reply answering the greet and also introducing a new question/prompt to let the dialog continue.
And that could deserves a new box, for a sequence of 2 boxes.
So bot reply could better be:
Hello!
How are you?

Regex string for Sublime Text to find email address between two commas

I'm new to regex's and Sublime's and am having issues trying to do a find/replace on all email addresses in a csv file.
I thought it would be reasonably straightforward but seem to be heading down the rabbit hole at a great rate of knots.
Data looks like;
data,data,email#address.com,data,data etc NB: there are about 100 fields per record and about 300 records
My thought was to look for the # symbol, then go left and right until I get to the comma and then replace with my new email address but I just can't get a win.
Any thoughts or am I using the wrong tool for the job?
(Also tagging with Ruby as if I need to do some scripting then I'll try to get figure it out in Ruby)
Thanks,
Liam
user2141046's expression won't find an email address like- "a.b#c.com"
I would suggest using:
[a-zA-Z0-9.!#$%&'+-/=?\^_`{|}~-]+#[a-zA-Z0-9-]+(?:.[a-zA-Z0-9-]+)
Source
I'm not familiar with the ruby language, but a regex that finds what you want is:
\w+\#\w+\.\w+
with the \. maybe unneeded (depending on language).
a perl one-liner that does the exact thing:
perl -pi -e 's/\w+\#\w+\.\w+/<your new email here>/g' <csv file here>
note
make sure you use \# in the enw email in the one liner i wrote, meaning new_email\#server.com
Try this:
[a-zA-Z0-9.!#$%&'*+-/=?\^_`{|}~-]+#[a-zA-Z0-9-]+(?:\.[a-zA-Z0-9-]+)*
It worked perfectly on a very long csv file filled with emails and all other kinds of stuff.
[a-zA-Z0-9.!#$%&'+-/=?\^_`{|}~-]+#[a-zA-Z0-9-]+(?:.[a-zA-Z0-9-]+)
will not work fine, because some domains have 2 or more levels (like com.br)
Use:
[a-zA-Z0-9.!#$%&'+-/=?\^_`{|}~-]+#[a-zA-Z0-9-]+(?:.[\.a-zA-Z0-9-]+)

rfc2047 multiple encoded-word in email subject

I need to send an email with the Subject containing cyrillic letters. But my recipients sometimes receive incorrect letters due to some problems with mail server and/or client. I always send emails in windows-1251 encoding, but sometimes a mail client shows letter's Subject and Sender in another encoding (KOI-8R) and our users can't understand the message.
I tried to use an encoded-word tag as described in RFC 2047 Standard. For example, my Subject field in the email now looks like:
Subject: =?WINDOWS-1251?B?wiDt5eTw4PUg8vPt5PD7IOL75PD7IOIg4+Xy8OD1IPL78P/yIOIg4uXk8OAg/+Tw?=
=?WINDOWS-1251?B?4CDq5eTw4C4gwvvw4uDiIPEg4vvk8Psg4iDy8+3k8OUg4+Xy?=
=?WINDOWS-1251?B?8PssIOL78vDzIOL75PDu6SD/5PDgIOrl5PDgLCDi+/Lw8yDj?=
=?WINDOWS-1251?B?5fLw7ukg4vvk8OUg7O7w5PMsIP/k8OAg4iDi5eTw4Cwg4vvk?=
=?WINDOWS-1251?B?8PMg4iDy8+3k8PMu?=
These lines was generated by Oracle function UTL_ENCODE.MIMEHEADER_ENCODE.
All mail clients (Lotus Notes, gmail.com) show only the first line of such email subject (only first 48 symbols).
What is the problem with my mail subject?
The problem is, that you do not fold correctly, according to RFC 2822. To make a multi line field in the header each line has to start with a white space.
What you need to do is:
replace(UTL_ENCODE.MIMEHEADER_ENCODE(subject, 'UTF8', UTL_ENCODE.BASE64), UTL_TCP.CRLF, UTL_TCP.CRLF || ' ')
This should solve your problem.

Algorithm for re-wrapping hard-wrapped text?

Let's say that I have written a custom e-mail management application for the company that I work for. It reads e-mails from the company's support account and stores cleaned-up, plain text versions of them in a database, doing other neat things like associating it with customer accounts and orders in the process. When an employee replies to a message, my program generates an e-mail that is sent to the customer with a formatted version of the discussion thread. If the customer responds, the app looks for a unique number in the subject line to read the incoming message, strip out the previous discussion, and add it as a new item in the thread. For example:
This is a message from Contoso customer service.
Recently, you requested customer support. Below is a summary of your
request and our reply.
--------------------------------------------------------------------
Contoso (Fred) on Tuesday, December 30, 2008 at 9:04 a.m.
--------------------------------------------------------------------
John:
I've modified your address. You can confirm my work by logging into
"Your Account" on our Web site. Your order should ship out today.
Thanks for shopping at Contoso.
--------------------------------------------------------------------
You on Tuesday, December 30, 2008 at 8:03 a.m.
--------------------------------------------------------------------
Oops, I entered my address incorrectly. Can you change it to
Fred Smith
123 Main St
Anytown, VA 12345
Thanks!
--
Fred Smith
Contoso Product Lover
Generally, this all works great, but there's one area that I've kind of putting off cleaning up for a while now, and it deals with text wrapping. In order to generate the pretty e-mail format like the one above, I need to re-wrap the text that the customer originally sent.
I've written an algorithm that does this (though looking at the code, I'm not entirely sure how it works anymore--it could use some refactoring). But it can't distinguish between a hard-wrap newline, an "end of paragraph" newline, and a "semantic" newline. For example, a hard-wrap newline is one that the e-mail client inserted within a paragraph to wrap a long line of text, say, at 79 columns. An end of paragraph newline is one that the user added after the last sentence in a paragraph. And a semantic newline would be something like the br tag, such as the address that the Fred typed above.
My algorithm instead only sees two newlines in a row as indicating a new paragraph, so it would make the customer's e-mail be formatted something like the following:
Oops, I entered my address incorrectly. Can you change it to
Fred Smith 123 Main St Anytown, VA 12345
Thanks!
-- Fred Smith Contoso Product Lover
Whenever I try to write a version that would re-wrap this text as intended, I basically hit a wall in that I need to know the semantics of the text, the difference between a "hard-wrap" newline and a "I really meant it like a br"-type newline, such as in the customer's address. (I use two newlines in a row to determine when to start a new paragraph, which coincides with how the majority of people seem to actually type e-mails.)
Anyone have an algorithm that can re-wrap the text as intended? Or is this implementation "good enough" when weighing the complexity of any given solution?
Thanks.
You could try to check if a newline has been inserted to keep the line length below a maximum (aka hard wrap): Just check for the longest line in the text. Then, for any given line, you append the first word of the following line to it. If the resulting line exceeds the maximum length, the line break probably was a hard wrap.
Even simpler you might just consider all breaks in (maxlength - 15) <= length <= maxlength as being hardwraps (with 15 just being an educated guess). This would certainly filter out intentional breaks as in addresses and stuff, and any missed break in this range wouldn't influence the result too badly.
I have two suggestions, as follows.
Pay attention to punctuation: this will help you to distinguish between a "hard-wrap" newline and an "end of paragraph" newline (because, if the line ends with a full stop, then it's more likely that the user intended it to be an end-of-paragraph.
Pay attention to whether a line is much shorter than the maximum line length: in the example above, you might have text that's being "hard-wrapped" at 79 characters, plus you have address lines which are only 30 characters long; because 30 is much less than 79, you know that the address lines were broken by the user and not by the user's text-wrap algorithm.
Also, pay attention to indents: lines which are indented with whitespace from the left may be supposed to be new paragraphs, broken from the previous lines, as they are on this forum.
Following Ole's advice above, I re-worked my implementation to look at a threshold. It seems to handle most scenarios I throw at it well enough without me having to go nuts and write code that actually understand the English language.
Basically, I first scan through the input string and record the longest line length in the variable inputMaxLineLength. Then as I'm rewrapping, if I encounter a newline that has an index between inputMaxLineLength and 85% of inputMaxLineLength, then I replace that newline with a space because I think it's a hard wrap newline--unless it's immediately followed by another newline, because then I assume that it's just a one-line paragraph that just happens to within that range. This can happen if someone types out a short bulleted list, for example.
Certainly not perfect, but "good enough" for my scenario, considering the text is usually half-mangled by a previous e-mail client to begin with.
Here's some code, my a-few-hours-old implementation that probably still underwraps in a few edge cases (using C#). It's a lot less complicated than my previous solution, which is nice.
Source Code
And here's some unit tests that exercise that code (using MSTest):
Test Code
If anyone has a better implementation (and no doubt a better implementation exists), I'll be happy to read your thoughts! Thanks.

Resources