How to split single mail with procmail? - bash

I have a quarantine folder that I periodically have to download and split by recipient inbox or even better split each message in a text file. I have c.a. 10.000 mails per day and I'm coding something with fetchmail and procmail. The problem is that i can't find out how to split message-by-message in procmail; they all end up in the same inbox.
I tried to pass every message in a script via a recipe like:
:0
| script_processing_messages.sh
Which contained
read varname
echo "$varname" > test_file
To try to see if I could obtain a single message in the $varname variable but nope, I only obtain a single line of a message each time.
Right now I use
fetchmail --keep
where .fetchmailrc is
poll mail.mymta.my protocol pop3 username "my#inbox.com" password "****" mda "procmail /root/.procmailrc"
and .procmailrc is
VERBOSE=0
DEFAULT=/root/inbox.quarantine
I would like to obtain a file for each message, so:
1.txt
2.txt
3.txt
[...]
10000.txt
I have many recipients and many domains, so I can't let's say write 5000 rules to match every recipient. It would be good if there was some kind of
^To: $USER
that redirect to
/$USER.inbox
so that procmail itself takes care of reading and creating dinamically these inbox
I'm not very expert in fetchmail and procmail recipes, I'm trying hard but I'm not going so far.

You seem to have two or three different questions; proper etiquette on Stack Overflow would be to ask each one separately - this also helps future visitors who have just one of your problems.
First off, to split a Berkeley mbox file containing multiple messages and run Procmail on each separately, try
formail -s procmail -m <file.mbox
You might need to read up on the mailbox formats supported by Procmail. A Berkeley mailbox is a single file which contains multiple messages, simply separated by a line beginning with From (with a space after the four alphabetic characters). This separator has to be unique, and so a message which contains those five characters at beginning of a line in the body will need to be escaped somehow (typically by writing a > before From).
To save each message in a separate file, choose a different mailbox format than the single-file Berkeley format. Concretely, if the destination is a directory, Procmail will create a new file in that directory. How exactly the new file is named depends on the contents of the directory (if it contains the Maildir subdirectories new, tmp, and cur, the new file is created in new in accordance with Maildir naming conventions) and on how exactly the directory is specified (trailing slash and dot selects MH format; otherwise, mail directory format).
Saving to one mailbox per recipient has a number of pesky corner cases. What if the message was sent to more than one of your local recipients? What if the recipient address is not visible in the headers? etc (the Procmail Mini-FAQ has a section about this, in the context of virtual hosting of a domain, which this is basically a variation of). But if we simply ignore these, you might be able to pull it off with something like
:0 # whitespace before ] is a literal tab
* ^TO_\/[^ # ]+#(yourdomain\.example|example\.info)\>
{
# Trim domain part from captured MATCH
:0
* MATCH ?? ^\/[^#]+
./$MATCH/
}
This will capture into $MATCH the first address which matches the regex, then perform another regex match on the captured string to capture just the part before the # sign. This obviously requires that the addresses you want to match are all in a set of specific domains (here, I used yourdomain.example and example.info; obviously replace those with your actual domain names) and that capturing the first matching address is sufficient (so if a message was To: alice#yourdomain.example and Cc: bob#example.info, whichever one of those is closer to the top of the message will be picked out by this recipe, and the other one will be ignored).
In some more detail, the \/ special token causes Procmail to copy the text which matched the regex after this point into the internal variable MATCH. As this recipe demonstrates, you can then perform a regex match on that variable itself to extract a substring of it (or, in other words, discard part of the captured match).
The action ./$MATCH/ uses the captured string in MATCH as the name of the folder to save into. The leading ./ specifies the current directory (which is equal to the value of the Procmail variable MAILDIR) and the trailing / selects mail directory format.
If your expected recipients cannot be constrained to be in a specific set of domains or otherwise matched by a single regex, my recommendation would be to ask a new question with more limited scope, and enough details to actually identify what you want to accomplish.

I found a solution to a part of my problem.
It seems that there is no way in procmail to let procmail itself recognize the For recipient without specifying it in a recipe, so I just obtained a list and create a huge recipe file.
But then I just discovered that to save single mails and to avoid huge mailboxes filled with a lot of mails, one could just write a recipe like:
:0
* ^To: recipient#mail.it
/inbox/folder/recipient#mail.it/
Note the / at the end: this will make procmail creating a folder structure instead of writing everywhing in a single file.

Related

read input containing spaces

I have my bash shell script working but I need to take into account the use case where when I read user input it will contain valid white spaces between the words. It can be multiple word so I need to either need a way to read the entire line and parse them or change it that the enter a search string as a unique entry and save it for input to my grep search
Example 1 time out
Example 2 fails to start
Example 3 device failed to respond
Thanks!

Processing form input in a Joomla component

I am creating a Joomla component and one of the pages contains a form with a text input for an email address.
When a < character is typed in the input field, that character and everything after is not showing up in the input.
I tried $_POST['field'] and JFactory::getApplication()->input->getCmd('field')
I also tried alternatives for getCmd like getVar, getString, etc. but no success.
E.g. John Doe <j.doe#mail.com> returns only John Doe.
When the < is left out, like John Doe j.doe#mail.com> the value is coming in correctly.
What can I do to also have the < character in the posted variable?
BTW. I had to use & lt; in this question to display it as I want it. This form suffers from the same problem!!
You actually need to set the filtering that you want when you grab the input. Otherwise, you will get some heavy filtering. (Typically, I will also lose # symbols.)
Replace this line:
JFactory::getApplication()->input->getCmd('field');
with this line:
JFactory::getApplication()->input->getRaw('field');
The name after the get part of the function is the filtering that you will use. Cmd strips everything but alphanumeric characters and ., -, and _. String will run through the html clean tags feature of joomla and depending on your settings will clean out <>. (That usually doesn't happen for me, but my settings are generally pretty open to the point of no filtering on super admins and such.
getRaw should definitely work, but note that there is no filtering at all, which can open security holes in your application.
The default text filter trims html from the input for your field. You should set the property
filter="raw"
in your form's manifest (xml) file, and then use getRaw() to retrieve the value. getCmd removes the non-alphanumeric characters.

BASH - How to check for duplicate email addresses across multiple files?

I'm currently working on a project where I need to send an email to a large number of email addresses. As such I am attempting to avoid any "temporary" glitches with respect to service providers throttling emails etc.
My plan is to take the initial list of email addresses and chop it up into smaller (chopped) lists, so that they can be scheduled in a staggered manner. Due to the sensitive nature of sending emails, I want to ensure that no duplicate email addresses exist across any of the chopped lists. Is there a way to do this via bash?
Side note, I am 100% certain that all email addresses in the master list are unique, due to the nature of the query used to comprise the list, I would just like to ensure, my script which chopped the master list, does not have a defect creating duplicate email addresses across the chopped lists.
You can put the chopped files together (temporarily) via cat and use sort --unique to remove duplicates - then check if the result has as many lines as the original file:
cat original_list | wc -l
and
cat list_part* | sort --unique | wc -l
if the results are same there are no duplicates.
Try
cat *.txt | sort | sort -u -c
given that your filenames are ending with .txt.
The first sort command orders all email addresses. The second sort command checks that no two consecutive lines are equal and throws an error in the other case.
The Problem
You need to sort unique addresses, and then split the ordered list into chunks.
The Solution
Given the following assumptions:
Your emails are stored in files called emails_xxxx.txt. (Note: You can name them anything you like, but a sensible set of filenames that are easy to glob will make your life simpler.)
Each line holds one address.
you can handle this with a short pipeline. Sort will accept a glob pattern or multiple file arguments (e.g. from xargs), so you can avoid the "useless use of cat." You then pipe the output into split, where you can control various aspects of the chunking. For example:
sort --unique emails_*.txt |
split --numeric-suffixes \
--lines=200 \
--suffix-length=4 \
--verbose
This splits the sorted/filtered lines into chunks of up to 200 lines each, and names each chunk with a numeric extension suitable for batch processing. You can adjust the lines and suffix length to suit your requirements.
Sample Output
creating file `x0000'
creating file `x0001'

Parse /var/email/username file in Ruby

For some reason I need to fetch emails from /var/mail/username file. It seems like an append only file.
My question is, is it safe to parse the content of the /var/email/username file depending on the first line From username#host Mon Jun 20 16:50:15 2011? What if the similar pattern found inside the email body?
Furthermore, is there any opensource ruby script available for reference?
Yes, that seems like more or less the right way to parse the mbox format - from a quick scan of the RFC specification:
The structure of the separator lines
vary across implementations, but
usually contain the exact character
sequence of "From", followed by a
single Space character (0x20), an
email address of some kind, another
Space character, a timestamp sequence
of some kind, and an end-of- line
marker.
And...
Many implementations are also known
to escape message body lines that
begin with the character sequence of
"From ", so as to prevent confusion
with overly-liberal parsers that do
not search for full separator
lines. In the common case, a leading
Greater-Than symbol (0x3E) is used
for this purpose (with "From "
becoming ">From "). However, other
implementations are known not to
escape such lines unless they are
immediately preceded by a blank line
or if they also appear to contain
an email address and a timestamp.
Other implementations are also
known to perform secondary escapes
against these lines if they are
already escaped or quoted, while
others ignore these mechanisms
altogether.
Update:
There's also this: https://github.com/meh/ruby-mbox

Retrieve the server name from a UNC path

Is there an api in windows that retrieves the server name from a UNC path ? (\\server\share)
Or do i need to make my own ?
I found PathStripToRoot but it doesn't do the trick.
I don't know of a Win32 API for parsing a UNC path; however you should check for:
\\computername\share
\\?\UNC\computername\share (people use this to access long paths > 260 chars)
You can optionally also handle this case: smb://computername/share and this case hostname:/directorypath/resource
Read here for more information
This is untested, but maybe a combination of PathIsUNC() and PathFindNextComponent() would do the trick.
I don't know if there is a specific API for this, I would just implement the simple string handling on my own (skip past "\\" or return null, look for next \ or end of string and return that substring) possibly calling PathIsUNC() first
If you'll be receiving the data as plain text you should be able to parse it with a simple regex, not sure what language you use but I tend to use perk for quick searches like this. Supposing you have a large document containing multiple lines containing one path per line you can search on \\'s I.e
m/\\\\([0-9][0-9][0-9]\.(repeat 3 times, of course not recalling ip address requirements you might need to modify the first one for sure) then\\)? To make it optional and include the trailing slash, and finally (.*)\\/ig it's rough but should do the trick, and the path name should be in $2 for use!
I hope that was clear enough!

Resources