Documentation for SpamAssassin rules (HTML_30_40) - spamassassin

I'd like to refine the password reset mails which are sent by my web application to avoid them to be mistaken as spam; a customer forwarded a mail header to me which contains several SpamAssassin rule names.
Some of the rules I could find, e.g. BAYES_40, but others I couldn't find there; those are:
HTML_30_40
TO_NO_BRKTS_HTML_ONLY
TO_NO_BRKTS_NORDNS
TO_NO_BRKTS_NORDNS_HTML
What do these rules mean; are there documentation pages somewhere?
The SpamAssassin which reported them is version 3.3.2; the latest version as of now is 3.4.1. Do those rules still exist?

The HTML_30_40 rule is no longer included in SpamAssassin, but if I remember correctly it was some test that concluded the email consisted of 30-40% HTML codes. Why that has any relevance for spam filtering I cannot see, and probably that is why it is no longer present.. :)
Those other rules still exist in SpamAssassin version 3.4.1. There is no explicit documentation per rule, other than an occasional comment or description along the rule implementation itself:
describe TO_NO_BRKTS_HTML_ONLY To: misformatted and HTML only
describe TO_NO_BRKTS_NORDNS_HTML To: misformatted and no rDNS and HTML only
You are probably sending emails from an ip-address with no reverse-DNS name, and the To: line is poorly formatted. Things should improve significantly if you get the DNS problems fixed (or relay the emails via your ISP) and format the To: line in the email properly, e.g.
To: "J Random User" <jrnd#email>

Related

MS Access email validation rule fails

I have used this rule ((Like "*?#?*.?*") And (Not Like "*[ ,;]*")) in MS Access for email validation it's working fine, but when I type this email#youdomain.com###hello it also accepts more # signs how to solve this? The rule is taken from here
You can't reliably validate e-mail addresses using an Access SQL statement or regex for that matter, see this for an example of a regex that still only works on prepared mail addresses, and Access SQL is substantially more limited than regex for text pattern matching.
However, fixing this specific issue is easy:
Just add Not Like "*#*#*" to your statement to disallow multiple # charactes:
((Like "*?#?*.?*") And (Not Like "*[ ,;]*")) And Not Like "*#*#*"

What is spifno1stsp really doing as a rsyslog property?

I was reading the template documentation of rsyslog to find better properties and I stumble upon this one:
spifno1stsp - expert options for RFC3164 template processing
However, as you can see, the documentation is quite vague. Moreover, I have not been able to find a longer explanation anywhere. The only mentions found with Google are always about the same snippet or the same very short description.
Indeed, there is no explanation of this property:
on the entire rsyslog.com website,
or in the RFC3164,
or anywhere else actually.
It is like everybody copy & paste the same snippet here and there but it is very difficult to understand what it is actually doing.
Any idea ?
Think of it as somewhat like an if statement. If a space is present, don't do anything. Otherwise, if a space is not present, add a space.
It is useful for ensuring that just one space is added to the output, often between two strings.
For any cases like this that you find where the docs can be improved please feel free to open an issue with a request for clarification in the official GitHub rsyslog documentation project. The documentation team is understaffed, but team members will assist where they can.
If you're looking for general help, the rsyslog-users mailing list is also a good resource. I've learned a lot over the years by going over the archives and reading prior threads.
Back to your question about the spifno1stsp option:
While you will get a few hits on that option, what you'll probably find more results on is searching for the older string template option, sp-if-no-1st-sp. Here is an example of its use from the documentation page you linked to:
template(name="forwardFormat" type="string"
string="<%PRI%>%TIMESTAMP:::date-rfc3339% %HOSTNAME% %syslogtag:1:32%%msg:::sp-if-no-1st-sp%%msg%"
)
Here is the specific portion that is relevant here:
`%msg:::sp-if-no-1st-sp%%msg%`
From the Property Replacer documentation:
sp-if-no-1st-sp
This option looks scary and should probably not be used by a user. For
any field given, it returns either a single space character or no
character at all. Field content is never returned. A space is returned
if (and only if) the first character of the field’s content is NOT a
space. This option is kind of a hack to solve a problem rooted in RFC
3164: 3164 specifies no delimiter between the syslog tag sequence and
the actual message text. Almost all implementation in fact delimit the
two by a space. As of RFC 3164, this space is part of the message text
itself. This leads to a problem when building the message (e.g. when
writing to disk or forwarding). Should a delimiting space be included
if the message does not start with one? If not, the tag is immediately
followed by another non-space character, which can lead some log
parsers to misinterpret what is the tag and what the message. The
problem finally surfaced when the klog module was restructured and the
tag correctly written. It exists with other message sources, too. The
solution was the introduction of this special property replacer
option. Now, the default template can contain a conditional space,
which exists only if the message does not start with one. While this
does not solve all issues, it should work good enough in the far
majority of all cases. If you read this text and have no idea of what
it is talking about - relax: this is a good indication you will never
need this option. Simply forget about it ;)
In short, sp-if-no-1st-sp (string template option) is analogous to spifno1stsp (standard template option).
Hope that helps.

Trying to figure out spamassassin globbing rules

How do the globbing rules work for spamassassin work? I've looked at the docs, but they are not clear as to whether sub-domains are included in a whitelist rule. For example, does:
whitelist_from *#somewhere.com
also whitelist addresses from subdomain.somewhere.com? This seems not to be the case, as subdomains are still labeled as spam, if they fail checking.
Should I use something like this:
whitelist_from *#*.somewhere.com
I've added this to some addresses to find out and it passes spamassassin --lint, but it may be a while before I get another email from one of those subdomain, so I thought I've just as here.
Thanks
I eventually found the answer. I can use the whitelist_from_rcvd directive instead.

Match all email addresses belonging to a specific domain and its subdomains

I am looking to match all email addresses from a specific domain.
Any email coming from example.com or foo.example.com should match, everything else should be rejected. To do this, I could do some basic string matching to check if the given string ends with, or contains, example.com which would work fine but it also means that something like fooexample.com will pass.
Hence, based on the above requirements, I started working on a pattern that would pass the domain and its sub-domain. I was able to come up with the following regex pattern:
`/\b[A-Z0-9._%+-]+#[A-Z0-9.-]+\.example.com\b/i`
This only matched subdomains, but I have seen the pattern at "How to match all email addresses at a specific domain using regex?" which handles the main domain.
Is there a way to combine these two into something that works for any address from example.com.
How about
/\b(?:(?![_.-])(?!.*[_.-]{2})[a-z0-9_.-]+(?<![_.-]))#(?:(?!-)(?!.*--)[a-z0-9-]+(?<!-)\.)*example\.com\b/i
This one would also match 'tagged' and 'tagged-subdomain' mails like a+b#example.com and a+b#i.example.com
(([A-Za-z0-9]+_+)|([A-Za-z0-9]+\-+)|([A-Za-z0-9]+\.+)|([A-Za-z0-9]+\++))*[A-Za-z0-9]+#(?:(?!-)(?!.*--)[a-z0-9-]+(?<!-)\.)*example\.com\b
Hope it helps you
I'd recommend reading "Stop Validating Email Addresses With Your Complex Regex".
From that point, I'd look for:
/#.*\bexample\.com/
For instance:
%w[foo#example.com foo#barexample.com foo#subdomain.example.com].grep(/#.*\bexample\.com/)
=> ["foo#example.com", "foo#subdomain.example.com"]
It's too easy to end up with a regex that is a maintenance nightmare, and that doesn't accomplish what you need. I highly recommend keeping it simple.

Check well formatted email address

I have a text file of e-mails like this:
10:info#example.com;dev#example.com
12:john#host.com; "George <g.top#host.com>"
43:jim.p#web.com.;sue-allen#web.com
...
I want to check whether the list contains well formatted entries. Do you know any tool or web-service to check and give me a list of invalid addresses?
Update Dear all, thank you for your input. I was really looking for a basic syntax check, so I will stay with Rafe's idea (I will do it with Java).
Read this so you are doing it the RFC compliant way:
http://www.eph.co.uk/resources/email-address-length-faq/
Probably the simplest way to validate an email is to send a message to it. As Sean points out this can leave you open to DoS attacks, but from your description it seems you have a text file rather than a web page, so this shouldn't be a problem.
Regular expressions are not a good tool for matching emails, there are a lot of valid addresses that naive matching will fail. Check out this comparison of attempts to validate emails with regex for details.
If you have to check them offline, I would split the email into parts (i.e. the parts before the # and after the #), you could then create a custom validator (or regex) to validate those parts.
Email validation is not as simple as a regular expression
First, I would read this article I Knew How To Validate An Email Address Until I Read The RFC.
Back in the days of yore, you could just connect to the user's mail server and use the VRFY command and verify that an email address was valid, but spammers abused that privilege and we all lost out.
Now, I would recommend a three part approach:
Verify the syntactic validity. You can use the monster regex from the Mail perl module to check to make sure that the email address is well formed. Then make sure to blacklist localhost domains/ips as part of your check.
Verify that the domain is live. Do a DNS validation check on the domain. You could take this one step further and use a STMP check and make sure that you can connect to a valid mailserver for the domain. However, there may be some false negative results due to virtual hosting schemes.
Send an actual email, but include a single image that links to a script on your server. When the email is read with the image, your server will be notified that the image was download and hence the email is alive and valid. However, nowadays many email clients do not load images by default for this very reason, so it won't be 100% effective.
Resources
Validating Email Addresses in ASP (online)
Validating Email Addresses in PHP (code examples)
This commercial product does bulk email verification ← This is probably what you are looking for
SO Question: How to check if an email address exists without sending an-email
I wrote a simple Perl script that uses the Email::Address module to validate these addresses:
#!/usr/bin/env perl
use Email::Address;
while (<>) {
chomp;
#addresses = split /\;/;
foreach my $address (#addresses) {
if (!Email::Address->parse($address)) {
print $address, "\n";
}
}
}
You'll just need to install the module. Its home page is:
http://emailproject.perl.org/wiki/Email::Address
This problem is harder than it appears. When faced with it, I stole the code from the mf.c module in the NMH sources. I then imported the address parser into Lua so I could handle email addresses from scripts.
Using somebody else's code saved me a world of pain.

Resources