What is SCC_BODY_URI_ONLY rule in spam assassin? - spamassassin

I am facing this issue SCC_BODY_URI_ONLY with my email when checked with SPAM ASSASSIN,
Does anybody know about this rule. There is no great deal of documentation around it.

You are right about the documentation. I checked out https://www.futurequest.net/docs/SA/. A very long list. But still no description.
But I did see it had to do with the Meta.
So I looked at the source of the email and saw that the title brackets were empty. So I just added a title and bam.... email passed ! Helpful I hope. Rock on..

As of 2022-06-23, the rule works as follows, as defined under 72_active.cf:
meta SCC_BODY_URI_ONLY T_SCC_BODY_TEXT_LINE < 2 && __HAS_ANY_URI && !__SMIME_MESSAGE
meta T_SCC_BODY_TEXT_LINE __SCC_BODY_TEXT_LINE_FULL - __SCC_SUBJECT_HAS_NON_SPACE
body __SCC_BODY_TEXT_LINE_FULL /^\s*\S/
tflags __SCC_BODY_TEXT_LINE_FULL multiple maxhits=3
header __SCC_SUBJECT_HAS_NON_SPACE Subject =~ /\S/
To summarise, the rule SCC_BODY_URI_ONLY will trigger if:
T_SCC_BODY_TEXT_LINE returns a number less than 2
T_SCC_BODY_TEXT_LINE checks the body of the email for lines containing any non-whitespace character, with any amount of whitespace characters before it, and will run this check a maximum of 3 times. Minus 1 if the Subject contains any non-whitespace characters.
The email contains any URI
The email does not have a Content-Type header indicating it is an S/MIME email
So, pretty much any email that contains:
At least 1 URI, 1 line in the body and a blank subject, OR
At least 1 URI, 2 lines and a subject with content
The above may be out of date in future, you would have to check the current state of the rules in your Spamassassin definitions. More information can be found about writing/interpreting rules here: https://cwiki.apache.org/confluence/display/SPAMASSASSIN/writingrules

Related

Las URI segment is zero (/0) -> error DIsallowed characters

I tried to find a solution for this issue but nothing worked. When my REST api URI request is, ex. https://serverip/meeting/userlist/0
I always get the error "The URI you submitted has disallowed characters”. I have even tried to leave this parameter in the config file blank:
$config['permitted_uri_chars'] = 'a-z 0-9~%.:_-+';
But I get the same error.
Is not allowed to have a 0 at the end of the URI as unique content of that segment? Because I need that to retrieve user with id = 0.
Thanks a lot.
EDIT - SOLVED:
Hi Again,
finally I solved it. I found that long time ago we commented a check related to UTF8 encodig in URI.php
if ( ! empty($str) && ! empty($this->_permitted_uri_chars) && ! preg_match('/^['.$this->_permitted_uri_chars.']+$/i'.(UTF8_ENABLED ? 'u' : ''), $str))
And we only left the first condition. We had some code issues that seem not to reproduce after revert that comment. And /0 now works fine.
So sorry, at the end it was a problem related to our own modifications.
Thanks.
$config['permitted_uri_chars'] is used as a PCRE character class pattern.
With the last character in there being a dash, it looks for a dash. However, when a dash is between two characters, it triggers a range search. So ... when you append the + (plus) sign after the dash, you get:
[_-+] // a range between underscore and plus in the ASCII table
You might be thinking "So what? Zeros are already allowed previously via 0-9", and you'd be correct, but that's not the problem. The problem is that the plus sign has a lower ASCII number than the underscore, and ranges don't work backwards, so _-+ is invalid and triggers a PCRE compilation failure, which in turn means the entire check fails and nothing is actually allowed.
You would see this if you had error_reporting enabled and/or looked at the error logs.
This doesn't happen if you only append the plus sign to the default pattern - the dash is not only the last character, but also escaped with a backslash - as you'd have this instead:
[_\-+] // Underscore, dash and plus sign as individual characters; not a range
I guess you thought it was an actual character to be allowed and removed it. Just add it back:
$config['permitted_uri_chars'] = 'a-z 0-9~%.:_\-+';

What constitutes a valid URI query parameter key?

I'm looking over Section 3.4 of RFC 3986 trying to understand what constitutes a valid URI query parameter key, but I'm not seeing a clear answer.
The reason I'm asking is because I'm writing a Ruby class that composes a URI with query parameters. When a new parameter is added I want to validate the key. Based on experience, it seems like the key will be invalid if it requires any escaping.
I should also say that I plan to validate the key. I'm not sure how to go about validating this data either, but I do know that in all cases I should escape this value.
Advice is appreciated. Advice in the context of how validation might already be possible through say a Ruby Gem would also be a plus.
I could well be wrong, but that spec seems to say that anything following '?' or '#' is valid as long. I wonder if you should be looking more at the spec for 'application/x-www-form-urlencoded' (ie. the key/value pairs we're all used to)?
http://www.w3.org/TR/html401/interact/forms.html#h-17.13.4.1
This is the default content type. Forms submitted with this content
type must be encoded as follows:
Control names and values are escaped. Space characters are replaced by +', and then reserved characters are escaped as described in [RFC1738], section 2.2: Non-alphanumeric characters are replaced by %HH', a percent sign and two hexadecimal digits representing the ASCII code of the character. Line breaks are represented as "CR LF" pairs (i.e., `%0D%0A').
The control names/values are listed in the order they appear in the document. The name is separated from the value by =' and name/value pairs are separated from each other by &'.
I don't believe key=value is part of the RFC, it's a convention that has emerged. Wikipedia suggests this is an 'W3C recommendation'.
Seems like some good stuff to be found searching on the application/x-www-form-urlencoded content type.
http://www.w3.org/TR/REC-html40/interact/forms.html#form-data-set

rfc2047 multiple encoded-word in email subject

I need to send an email with the Subject containing cyrillic letters. But my recipients sometimes receive incorrect letters due to some problems with mail server and/or client. I always send emails in windows-1251 encoding, but sometimes a mail client shows letter's Subject and Sender in another encoding (KOI-8R) and our users can't understand the message.
I tried to use an encoded-word tag as described in RFC 2047 Standard. For example, my Subject field in the email now looks like:
Subject: =?WINDOWS-1251?B?wiDt5eTw4PUg8vPt5PD7IOL75PD7IOIg4+Xy8OD1IPL78P/yIOIg4uXk8OAg/+Tw?=
=?WINDOWS-1251?B?4CDq5eTw4C4gwvvw4uDiIPEg4vvk8Psg4iDy8+3k8OUg4+Xy?=
=?WINDOWS-1251?B?8PssIOL78vDzIOL75PDu6SD/5PDgIOrl5PDgLCDi+/Lw8yDj?=
=?WINDOWS-1251?B?5fLw7ukg4vvk8OUg7O7w5PMsIP/k8OAg4iDi5eTw4Cwg4vvk?=
=?WINDOWS-1251?B?8PMg4iDy8+3k8PMu?=
These lines was generated by Oracle function UTL_ENCODE.MIMEHEADER_ENCODE.
All mail clients (Lotus Notes, gmail.com) show only the first line of such email subject (only first 48 symbols).
What is the problem with my mail subject?
The problem is, that you do not fold correctly, according to RFC 2822. To make a multi line field in the header each line has to start with a white space.
What you need to do is:
replace(UTL_ENCODE.MIMEHEADER_ENCODE(subject, 'UTF8', UTL_ENCODE.BASE64), UTL_TCP.CRLF, UTL_TCP.CRLF || ' ')
This should solve your problem.

What is the actual minimum length of an email address as defined by the IETF?

I'm specifically looking for the minimum length of the prefix and domain.
I've seen conflicting information and nothing that looks authoritative.
For reference, I found this page which claims that a one character email address is functional:
http://www.cjvandyk.com/blog/Lists/Posts/Post.aspx?ID=176
I tried validating email addresses at Gmail and they expect prefix greater than or equal to 6.
These are obviously way off.
My web framework expects prefix greater than or equal to 2.
The shortest valid email address may consist of only two parts: name and domain.
name#domain
Since both the name and domain may have the length of 1 character, the minimal total length resolves to 3 characters.
well the problem is really the question.. email depends on if it is sent over the internet, or within a closed system (eg intranet). over the internet, I believe x#y.zz is the shortest email possible (e.g. google's G.CN for china would result in the shortest email adress possible, e.g. i#g.cn, which is 6 characters long). on the intranet however, it is an entirely different thing, and i#y would be possible, which is just 3 characters long.
I believe the standard you are looking for is RFC 2822 - Internet Message Format
More specific info on email address restrictions in RFC 3696 - Section 3
To quote the spec:
Contemporary email addresses consist of a "local part" separated from a "domain part" (a fully-qualified domain name) by an at-sign ("#").
So three characters is the shortest.
I originally got this info from Phil Haack's blog post.
Many mail-servers will not accept the email-address if there aren't at least 2 characters before the #.
That doesn't make it an invalid address, but if the servers don't know that, it sure can lead to a lot of problems.

Strip signatures and replies from emails

I'm currently working on a system that allows users to reply to notification emails that are sent out (sigh).
I need to strip out the replies and signatures, so that I'm left with the actual content of the reply, without all the noise.
Does anyone have any suggestions about the best way to do this?
If your system is in-house and/or you have a limited number of reply formats, it's possible to do a pretty good job. Here are the filters we have set up for email responses to trac tickets:
Drop all text after and including:
Lines that equal '-- \n' (standard email sig delimiter)
Lines that equal '--\n' (people often forget the space in sig delimiter; and this is not that common outside sigs)
Lines that begin with '-----Original Message-----' (MS Outlook default)
Lines that begin with '________________________________' (32 underscores, Outlook again)
Lines that begin with 'On ' and end with ' wrote:\n' (OS X Mail.app default)
Lines that begin with 'From: ' (failsafe four Outlook and some other reply formats)
Lines that begin with 'Sent from my iPhone'
Lines that begin with 'Sent from my BlackBerry'
Numbers 3 and 4 are 'begin with' instead of 'equals' because sometimes users will squash lines together on accident.
We try to be more liberal about stripping out replies, since it's much more of an annoyance (to us) have reply garbage than it is to correct missing text.
Anybody have other formats from the wild that they want to share?
Check out the email_reply_parser gem - https://github.com/github/email_reply_parser . It does a nice job handling this problem.
I don't believe you can do this reliably (signatures used to begin with '--' but I don't see that anymore). Perhaps you're better off asking people to reply inbetween text headers and then simply strip the reply from this ? It's not elegant, but perhaps more reliable.
e.g.
REPLY BETWEEN HERE -->
AND HERE -->
so you'd simply look for the required headers above and take what's inbetween.
If you want something powerful & robust, and don't mind reading academic publications, you might check out this:
Learning to Extract Signature and Reply Lines from Email
Here's the homepage for one of the authors, with more info & some downloads:
Vitor R. Carvalho - Software and Datasets - (Vitor Carvalho)
An approach that can be used for signature only (in addition to detect __ or --) is to test if the first name and/or family name of the sender is on a short line (~ containing 3 to 4 words, max).
The sender name is on the raw email header, most of the time next to the email address, like in:
From: John Doe <jdoe#provider.com>
This would be based on the assumption that you rarely write your own name in a email, and if you do so, it is probably in a long sentence.
Of course there will be some false positive, but it may not be a big problem depending on what you do (we use it to fold quoted text and signature into a ... gmail-style button, so overdetection does not end up into losing any content, it is just misplaced).
If you can assume that these emails are in plain text, just strip lines that begins with ">" as replies, and "-- " line should delimit signature. But those assumptions might not work, as not all people over internet use software that complies to rules.
There's a really nice PHP library dedicated to the email parsing
http://williamdurand.fr/EmailReplyParser/
https://github.com/willdurand/EmailReplyParser
I made one for golang: https://github.com/web-ridge/email-reply-parser it detects signatures like
Karen The Green
Graphic Designer
Office
Tel: +44423423423423
Fax: +44234234234234
karen#webby.com
Street 2, City, Zeeland, 4694EG, NL
www.thing.com
The content of this email is confidential and intended for the recipient specified in message only. It is strictly forbidden to share any part of this message with any third party, without a written consent of the sender. If you received this message by mistake, please reply to this message and follow with its deletion, so that we can ensure such a mistake does not occur in the future.
Met vriendelijke groeten,
Richard Lindhout
The recommended signature delimiter is "-- \n". If people follow this recommendation, stripping signatures should be easy.

Resources