How can I sort an array of emails by the email provider? - ruby

So I dumped all the emails from a DB into a txt file and I`m looking to sort them by email provider, basically anything that comes after the # sign.
I know I can use regex to validate each email.
However how do I indicate that I want to sort them by anything that comes after the # sign?

I know I can use regex to validate each email.
Careful! The range of valid e-mail addresses is much wider than most people think. The only correct regexes for e-mail validation are on the order of a page in length. If you must use a regex, just check for the # and one ..
However how do I indicate that I want to sort them by anything that comes after the # sign
email_addresses.sort_by {|addr| addr.split('#').last }

Related

How to split single mail with procmail?

I have a quarantine folder that I periodically have to download and split by recipient inbox or even better split each message in a text file. I have c.a. 10.000 mails per day and I'm coding something with fetchmail and procmail. The problem is that i can't find out how to split message-by-message in procmail; they all end up in the same inbox.
I tried to pass every message in a script via a recipe like:
:0
| script_processing_messages.sh
Which contained
read varname
echo "$varname" > test_file
To try to see if I could obtain a single message in the $varname variable but nope, I only obtain a single line of a message each time.
Right now I use
fetchmail --keep
where .fetchmailrc is
poll mail.mymta.my protocol pop3 username "my#inbox.com" password "****" mda "procmail /root/.procmailrc"
and .procmailrc is
VERBOSE=0
DEFAULT=/root/inbox.quarantine
I would like to obtain a file for each message, so:
1.txt
2.txt
3.txt
[...]
10000.txt
I have many recipients and many domains, so I can't let's say write 5000 rules to match every recipient. It would be good if there was some kind of
^To: $USER
that redirect to
/$USER.inbox
so that procmail itself takes care of reading and creating dinamically these inbox
I'm not very expert in fetchmail and procmail recipes, I'm trying hard but I'm not going so far.
You seem to have two or three different questions; proper etiquette on Stack Overflow would be to ask each one separately - this also helps future visitors who have just one of your problems.
First off, to split a Berkeley mbox file containing multiple messages and run Procmail on each separately, try
formail -s procmail -m <file.mbox
You might need to read up on the mailbox formats supported by Procmail. A Berkeley mailbox is a single file which contains multiple messages, simply separated by a line beginning with From (with a space after the four alphabetic characters). This separator has to be unique, and so a message which contains those five characters at beginning of a line in the body will need to be escaped somehow (typically by writing a > before From).
To save each message in a separate file, choose a different mailbox format than the single-file Berkeley format. Concretely, if the destination is a directory, Procmail will create a new file in that directory. How exactly the new file is named depends on the contents of the directory (if it contains the Maildir subdirectories new, tmp, and cur, the new file is created in new in accordance with Maildir naming conventions) and on how exactly the directory is specified (trailing slash and dot selects MH format; otherwise, mail directory format).
Saving to one mailbox per recipient has a number of pesky corner cases. What if the message was sent to more than one of your local recipients? What if the recipient address is not visible in the headers? etc (the Procmail Mini-FAQ has a section about this, in the context of virtual hosting of a domain, which this is basically a variation of). But if we simply ignore these, you might be able to pull it off with something like
:0 # whitespace before ] is a literal tab
* ^TO_\/[^ # ]+#(yourdomain\.example|example\.info)\>
{
# Trim domain part from captured MATCH
:0
* MATCH ?? ^\/[^#]+
./$MATCH/
}
This will capture into $MATCH the first address which matches the regex, then perform another regex match on the captured string to capture just the part before the # sign. This obviously requires that the addresses you want to match are all in a set of specific domains (here, I used yourdomain.example and example.info; obviously replace those with your actual domain names) and that capturing the first matching address is sufficient (so if a message was To: alice#yourdomain.example and Cc: bob#example.info, whichever one of those is closer to the top of the message will be picked out by this recipe, and the other one will be ignored).
In some more detail, the \/ special token causes Procmail to copy the text which matched the regex after this point into the internal variable MATCH. As this recipe demonstrates, you can then perform a regex match on that variable itself to extract a substring of it (or, in other words, discard part of the captured match).
The action ./$MATCH/ uses the captured string in MATCH as the name of the folder to save into. The leading ./ specifies the current directory (which is equal to the value of the Procmail variable MAILDIR) and the trailing / selects mail directory format.
If your expected recipients cannot be constrained to be in a specific set of domains or otherwise matched by a single regex, my recommendation would be to ask a new question with more limited scope, and enough details to actually identify what you want to accomplish.
I found a solution to a part of my problem.
It seems that there is no way in procmail to let procmail itself recognize the For recipient without specifying it in a recipe, so I just obtained a list and create a huge recipe file.
But then I just discovered that to save single mails and to avoid huge mailboxes filled with a lot of mails, one could just write a recipe like:
:0
* ^To: recipient#mail.it
/inbox/folder/recipient#mail.it/
Note the / at the end: this will make procmail creating a folder structure instead of writing everywhing in a single file.

regex email validation Ruby

This is my regex for email validation, but I want to restrict consecutive period like I don't want . _ - to be consecutively repeated. Anyone can help me?
/^((?:[a-z]+[0-9_\.-]*)+[a-z0-9_\.-]*[a-z0-9])#((?:[a-z0-9]+[\.-]*)+\.[a-z]{2,4})$/
for example:
test..test#example.com instead i want test.test#example.com or test_test#example.com test-test#example.com
You can use following regex to avoid consecutive period.
^(?!.*\.{2})\A\S+#.+\.\S+\z
Check it here
You can add,
^(?!.*\.{2})
before any email regex that will work to avoid consecutive dots.

Clean string to get Email with Regex [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 8 years ago.
Improve this question
I have a ruby code that extracts email addresses from a page. my code outputs the email address, but also captures other text as well.
I would like to pull the actual email out of this string. Sometimes, the string will include a mailto, sometimes it will not. I was trying to get the single word that occurs before the #, and anything that comes after the # by using a split, but I'm having trouble. Any ideas? Thanks!
href="mailto:someonesname#domain.rr.com"> | Email</a></td>
Use something prebuilt:
require 'uri'
addresses = URI.extract(<<EOT, :mailto)
this is some text. mailto:foo#bar.com and more text
and some more http://foo#bar.com text
href="mailto:someonesname#domain.rr.com"> | Email</a></td>
EOT
addresses # => ["mailto:foo#bar.com", "mailto:someonesname#domain.rr.com"]
URI comes with Ruby, and the pattern used to parse out URIs is well tested. It's not bullet-proof, but it works pretty well. If you're getting false-positives, you can use a select, reject or grep block to filter out the unwanted entries returned.
If you can't count on having mailto:, the problem becomes harder, because email addresses aren't simple to parse; There's too much variation to them. The problem is akin to validating an email address using a pattern, because, again, the format for addresses varies too much. "Using a regular expression to validate an email address" and "JavaScript Email Validation when there are (soon to be) 1000's of TLD's?" are good reads for more information.
This should also work nicely though won't account for invalid email formats - it will simply extract the email address based on your two use cases.
string[/[^\"\:](\w+#.*)(?=\")/]
This should work
inputstring[/href="[^"]+"/][6 .. -2].gsub("mailto:", "")
Explanation:
Grab the href attribute and it's contents
Remove the href= and qoutes
Remove the mailto: if it's there
Example:
irb(main):021:0> test = "href=\"mailto:francesco#hawaii.rr.com\"> | Email DuVin</a></td>"
=> "href=\"mailto:francesco#hawaii.rr.com\"> | Email DuVin</a></td>"
irb(main):022:0> test[/href="[^"]+"/][6 .. -2].gsub("mailto:", "")
=> "francesco#hawaii.rr.com"
irb(main):023:0> test = "href=\"francesco#hawaii.rr.com\"> | Email DuVin</a></td>"
=> "href=\"francesco#hawaii.rr.com\"> | Email DuVin</a></td>"
irb(main):024:0> test[/href="[^"]+"/][6 .. -2].gsub("mailto:", "")
=> "francesco#hawaii.rr.com"

Regex string for Sublime Text to find email address between two commas

I'm new to regex's and Sublime's and am having issues trying to do a find/replace on all email addresses in a csv file.
I thought it would be reasonably straightforward but seem to be heading down the rabbit hole at a great rate of knots.
Data looks like;
data,data,email#address.com,data,data etc NB: there are about 100 fields per record and about 300 records
My thought was to look for the # symbol, then go left and right until I get to the comma and then replace with my new email address but I just can't get a win.
Any thoughts or am I using the wrong tool for the job?
(Also tagging with Ruby as if I need to do some scripting then I'll try to get figure it out in Ruby)
Thanks,
Liam
user2141046's expression won't find an email address like- "a.b#c.com"
I would suggest using:
[a-zA-Z0-9.!#$%&'+-/=?\^_`{|}~-]+#[a-zA-Z0-9-]+(?:.[a-zA-Z0-9-]+)
Source
I'm not familiar with the ruby language, but a regex that finds what you want is:
\w+\#\w+\.\w+
with the \. maybe unneeded (depending on language).
a perl one-liner that does the exact thing:
perl -pi -e 's/\w+\#\w+\.\w+/<your new email here>/g' <csv file here>
note
make sure you use \# in the enw email in the one liner i wrote, meaning new_email\#server.com
Try this:
[a-zA-Z0-9.!#$%&'*+-/=?\^_`{|}~-]+#[a-zA-Z0-9-]+(?:\.[a-zA-Z0-9-]+)*
It worked perfectly on a very long csv file filled with emails and all other kinds of stuff.
[a-zA-Z0-9.!#$%&'+-/=?\^_`{|}~-]+#[a-zA-Z0-9-]+(?:.[a-zA-Z0-9-]+)
will not work fine, because some domains have 2 or more levels (like com.br)
Use:
[a-zA-Z0-9.!#$%&'+-/=?\^_`{|}~-]+#[a-zA-Z0-9-]+(?:.[\.a-zA-Z0-9-]+)

Extract email addresses from a block of text

How can I create an array of email addresses contained within a block of text?
I've tried
addrs = text.scan(/ .+?#.+? /).map{|e| e[1...-1]}
but (not surprisingly) it doesn't work reliably.
Howabout this for a (slightly) better regular expression
\b[A-Z0-9._%+-]+#[A-Z0-9.-]+\.[A-Z]{2,4}\b
You can find this here:
Email Regex
Just an FYI, the problem with your email is that you allow only one type of separator before or after an email address. You would match "#" alone, if separated by spaces.

Resources