Can you write rules in a programmatic if-then style? - spamassassin

Can you do if-then style statements in SpamAssassin?
I get spam sent to me that uses my email address for the sender's name and I would like to write a general rule for this.
For example, I receive spam messages with From: and To: lines like this:
From: "me#mydomain.org" <spam#spam.com>
To: <me#mydomain.org>
Below I refer to this format as:
From: "Name" <address>
To: <address>
Is it possible to write a rule that says:
if
the (From: name)
is equal to (To: email address)
but not the (From: email address)
then
give it a score?
I am thinking this specifically in case my server automatically sends messages in a similar format, such as: "root#mydomain.org" <root#mydomain.org>.
I don't want the rule to accidentally score emails like that.
I only see how to write positive rules. So I can look for these kinds of simple matches
header LOCAL_FROM_NAME_MyAddress From =~ /\"me#mydomain.org\"/
header LOCAL_FROM_Address_MyAddress From =~ /<me#mydomain.org>/
header LOCAL_TO_Address_MyAddress From =~ /<me#mydomain.org>/
So I could create a score if they all produced a match:
meta LOCAL_FROM_ME_TO_ME ((LOCAL_FROM_NAME_MyAddress + LOCAL_FROM_Address_MyAddress + LOCAL_TO_Address_MyAddress) >2)
score LOCAL_FROM_ME_TO_ME -0.1
But that is as far as I can go. I haven't seen any way to do something more complex.

SpamAssassin meta rules support boolean expressions, so you can use the &&, ||, and ! operators to create more complex matches. In the specific example you've given, the rule is logically equivalent to:
(FROM_NAME equals MyAddress) and (FROM_ADDR does not equal MyAddress)
A ruleset to express this could be:
header __LOCAL_FROM_NAME_MyAddress From:name =~ /me\#mydomain\.org/
header __LOCAL_FROM_ADDR_MyAddress From:addr =~ /me\#mydomain\.org/
meta LOCAL_SPOOFED_FROM (__LOCAL_FROM_NAME_MyAddress && !__LOCAL_FROM_ADDR_MyAddress)
score LOCAL_SPOOFED_FROM 5.0
If meta rules and boolean expressions are not enough, you can write a Perl plugin. Check out the many examples on CPAN, and perhaps specifically Mail::SpamAssassin:FromMailSpoof.
Notes
You can write :name and :addr to parse specific parts of the From and To headers.
You can prefix your sub-rules with __ so that they will not score on their own.
Special characters like # and . should be escaped in regex patterns.

Related

Jflex ambiguity

I have these two rules from a jflex code:
Bool = true
Ident = [:letter:][:letterdigit:]*
if I try for example to analyse the word "trueStat", it gets recognnized as an Ident expression and not Bool.
How can I avoid this type of ambiguity in Jflex?
In almost all languages, a keyword is only recognised as such if it is a complete word. Otherwise, you would end up banning identifiers like format, downtime and endurance (which would instead start with the keywords for, do and end, respectively). That's quite confusing for programmers, although it's not unheard-of. Lexical scanner generators, like Flex and JFlex generally try to make the common case easy; thus, the snippet you provide, which recognises trueStat as an identifier. But if you really want to recognise it as a keyword followed by an identifier, you can accomplish that by adding trailing context to all your keywords:
Bool = true/[:letterdigit:]*
Ident = [:letter:][:letterdigit:]*
With that pair of patterns, true will match the Bool rule, even if it occurs as trueStat. The pattern matches true and any alphanumeric string immediately following it, and then rewinds the input cursor so that the token matched is just true.
Note that like Lex and Flex, JFlex accepts the longest match at the current input position; if more than one rule accepts this match, the action corresponding to the first such rule is executed. (See the manual section "How the Input is Matched" for a slightly longer explanation of the matching algorithm.) Trailing context is considered part of the match for the purposes of this rule (but, as noted above, is then removed from the match).
The consequence of this rule is that you should always place more specific patterns before the general patterns they might override, whether or not the specific pattern uses trailing context. So the Bool rule must precede the Ident rule.

how to test my Grammar antlr4 successfully? [duplicate]

I have been starting to use ANTLR and have noticed that it is pretty fickle with its lexer rules. An extremely frustrating example is the following:
grammar output;
test: FILEPATH NEWLINE TITLE ;
FILEPATH: ('A'..'Z'|'a'..'z'|'0'..'9'|':'|'\\'|'/'|' '|'-'|'_'|'.')+ ;
NEWLINE: '\r'? '\n' ;
TITLE: ('A'..'Z'|'a'..'z'|' ')+ ;
This grammar will not match something like:
c:\test.txt
x
Oddly if I change TITLE to be TITLE: 'x' ; it still fails this time giving an error message saying "mismatched input 'x' expecting 'x'" which is highly confusing. Even more oddly if I replace the usage of TITLE in test with FILEPATH the whole thing works (although FILEPATH will match more than I am looking to match so in general it isn't a valid solution for me).
I am highly confused as to why ANTLR is giving such extremely strange errors and then suddenly working for no apparent reason when shuffling things around.
This seems to be a common misunderstanding of ANTLR:
Language Processing in ANTLR:
The Language Processing is done in two strictly separated phases:
Lexing, i.e. partitioning the text into tokens
Parsing, i.e. building a parse tree from the tokens
Since lexing must preceed parsing there is a consequence: The lexer is independent of the parser, the parser cannot influence lexing.
Lexing
Lexing in ANTLR works as following:
all rules with uppercase first character are lexer rules
the lexer starts at the beginning and tries to find a rule that matches best to the current input
a best match is a match that has maximum length, i.e. the token that results from appending the next input character to the maximum length match is not matched by any lexer rule
tokens are generated from matches:
if one rule matches the maximum length match the corresponding token is pushed into the token stream
if multiple rules match the maximum length match the first defined token in the grammar is pushed to the token stream
Example: What is wrong with your grammar
Your grammar has two rules that are critical:
FILEPATH: ('A'..'Z'|'a'..'z'|'0'..'9'|':'|'\\'|'/'|' '|'-'|'_'|'.')+ ;
TITLE: ('A'..'Z'|'a'..'z'|' ')+ ;
Each match, that is matched by TITLE will also be matched by FILEPATH. And FILEPATH is defined before TITLE: So each token that you expect to be a title would be a FILEPATH.
There are two hints for that:
keep your lexer rules disjunct (no token should match a superset of another).
if your tokens intentionally match the same strings, then put them into the right order (in your case this will be sufficient).
if you need a parser driven lexer you have to change to another parser generator: PEG-Parsers or GLR-Parsers will do that (but of course this can produce other problems).
This was not directly OP's problem, but for those who have the same error message, here is something you could check.
I had the same Mismatched Input 'x' expecting 'x' vague error message when I introduced a new keyword. The reason for me was that I had placed the new key word after my VARNAME lexer rule, which assigned it as a variable name instead of as the new keyword. I fixed it by putting the keywords before the VARNAME rule.

How to write a regex to match .com or .org with a "-" in the domain name

How do I write a regex in ruby that will look for a "-" and ".org" or "com" like:
some-thing.org
some-thing.org.sg
some-thing.com
some-thing.com.sg
some-thing.com.* (there are too many countries so for now any suffix is fine- I will deal with this problem later )
but not:
some-thing
some-thing.moc
I wrote : /.-.(org)?|.*(.com)/i
but it fails to stop "some-thing" or "some-thing.moc" :(
Support optional hyphen
I can come with this regex:
(https?:\/\/)?(www\.)?[a-z0-9-]+\.(com|org)(\.[a-z]{2,3})?
Working demo
Keep in mind that I used capturing groups for simplicity, but if you want to avoid capturing the content you can use non capturing groups like this:
(?:https?:\/\/)?(?:www\.)?[a-z0-9-]+\.(?:com|org)(?:\.[a-z]{2,3})?
^--- Notice "?:" to use non capturing groups
Additionally, if you don't want to use protocol and www pattern you can use:
[a-z0-9-]+\.(?:com|org)(?:\.[a-z]{2,3})?
Support mandatory hyphen
However, as Greg Hewgill pointed in his comment, if you want to ensure you have a hyphen at least, you can use this regex:
(?:https?:\/\/)?(?:www\.)?[a-z0-9]+(?:[-][a-z0-9]+)+\.(?:com|org)(?:\.[a-z]{2,3})?
Although, this regex can fall in horrible backtracking issues.
Working demo
This may help :
/[a-z0-9]+-?[a-z0-9]+\.(org|com)(\.[a-z]+)?/i
It matches '-' in the middle optionally, i.e. still matches names without '-'.
I had a similar issue when I was writing an HTTP server...
... I ended up using the following Regexp:
m = url.match /(([a-z0-9A-Z]+):\/\/)?(([^\/\:]+))?(:([0-9]+))?([^\?\#]*)(\?([^\#]*))?/
m[1] # => requested_protocol (optional) - i.e. https, http, ws, ftp etc'
m[4] # => host_name (optional) - i.e. www.my-site.com
m[6] # => port (optional)
m[7] #=> encoded URI - i.e. /index.htm
If what you are trying to do is validate a host name, you can simply make sure it doesn't contain the few illegal characters (:, /) and contains at least one dot separated string.
If you want to validate only .com or .org (+ country codes), you can do something like this:
def is_legit_url?(url)
allowed_master_domains = %w{com org}
allowed_country_domains = %w{sg it uk}
url.match(/[^\/\:]+\.(#{allowed_master_domains.join '|'})(\.#{allowed_country_domains.join '|'})?/i) && true
end
* notice that certain countries use .co, i.e. the UK uses www.amazon.co.uk
I would convert the Regexp to a constant, for performance reasons:
module MyURLReview
ALLOWED_MASTER_DOMAINS = %w{com org}
ALLOWED_COUNTRY_DOMAINS = %w{sg it uk}
DOMAINS_REGEXP = /[^\/\:]+\.(#{ALLOWED_MASTER_DOMAINS.join '|'})(\.#{ALLOWED_COUNTRY_DOMAINS.join '|'})?/i
def self.is_legit_url?(url)
url.match(DOMAINS_REGEXP) && true
end
end
Good Luck!
Regex101 Example
/[a-zA-Z0-9]-[a-zA-Z0-9]+?\.(?:org|com)\.?/
Of course, the above could be simplified depending on how lenient your rules are. The following is a simpler pattern, but would allow s0me-th1ng.com-plete to pass through:
/\w-\w+?\.(?:org|com)\b/
You could use a lookahead:
^(?=[^.]+-[^.]+)([^.]+\.(?:org|com).*)
Demo
Assuming you are looking for the general pattern of letters-letters where letters could be Unicode, you can do:
^(?=\p{L}+-\p{L}+)([^.]+\.(?:org|com).*)
If you want to add digits:
^(?=[\p{L}0-9]+-[\p{L}0-9]+)([^.]+\.(?:org|com).*)
So that you can match sòme1-thing.com
Demo
(Ruby 2.0+ for \p{L} I think...)

spamassassin filter for custom expression

I would like to add a rule that blocks all incoming e-mails that contain a certain expression. Ex: 'Test Phrase'. I have added the line
rawbody NO_SPAMW /Test" *"Phrase/i
but it seems it doesn't work. Can you tell me what is the correct way to parse a space to spamassassin?
Thank you!
You can match a space with \s.
rawbody TEST_PHRASE /test\s*phrase/i
score TEST_PHRASE 0.1
describe TEST_PHRASE This is a test
More about writing custom rules here

How do you check for a changing value within a string

I am doing some localization testing and I have to test for strings in both English and Japaneses. The English string might be 'Waiting time is {0} minutes.' while the Japanese string might be '待ち時間は{0}分です。' where {0} is a number that can change over the course of a test. Both of these strings are coming from there respective property files. How would I be able to check for the presence of the string as well as the number that can change depending on the test that's running.
I should have added the fact that I'm checking these strings on a web page which will display in the relevant language depending on the location of where they are been viewed. And I'm using watir to verify the text.
You can read elsewhere about various theories of the best way to do testing for proper language conversion.
One typical approach is to replace all hard-coded text matches in your code with constants, and then have a file that sets the constants which can be updated based on the language in use. (I've seen that done by wrapping the require of that file in a case statement based on the language being tested. Another approach is an array or hash for each value, enumerated by a variable with a name like 'language', which lets the tests change the language on the fly. So validations would look something like this
b.div(:id => "wait-time-message).text.should == WAIT_TIME_MESSAGE[language]
To match text where part is expected to change but fall within a predictable pattern, use a regular expression. I'd recommend a little reading about regular expressions in ruby, especially using unicode regular expressions in ruby, as well as some experimenting with a tool like Rubular to test regexes
In the case above a regex such as:
/Waiting time is \d+ minutes./ or /待ち時間は\d+分です。/
would match the messages above and expect one or more digits in the middle (note that it would fail if no digits appear, if you want zero or more digits, then you would need a * in place of the +
Don't check for the literal string. Check for some kind of intermediate form that can be used to render the final string.
Sometimes this is done by specifying a message and any placeholder data, like:
[ :waiting_time_in_minutes, 10 ]
Where that would render out as the appropriate localized text.
An alternative is to treat one of the languages as a template, something that's more limited in flexibility but works most of the time. In that case you could use the English version as the string that's returned and use a helper to render it to the final page.

Resources