This question already has answers here:
How can I validate an email address using a regular expression?
(79 answers)
Closed 9 years ago.
I'm using the following regex to validate emails in an email DB of a Rails App:
/\A[\w+\-.]+#[a-z\d\-.]+\.[a-z]+\z/i
it kinda works, but it leaves me with invalid mails wich have this kind of errors:
name#domain..com
How can I rwrite this regex to avoid that or wich is the best regex to clean up an email list I have to leave only valid email addresses? I'm using a method like this one to clean up the list:
def mailSweep
mails = Email.all.lazy
for address in mails do
if address.email.to_s =~ /\A[\w+\-.]+#[a-z\d\-.]+\.[a-z]+\z/i
puts address.email.to_s + " " + "it's valid"
else
puts address.email.to_s + " " + "it's invalid, destroying..."
address.destroy
end
end
end
I suggest
/\A[^#\s]+#([-a-zA-Z0-9]{1,63}\.)+[a-zA-Z0-9]{2,63}\z/
Note that this regex does not allow all legal email addresses. You need to something insanely complicated to allow for all RFC 822 valid email addresses that have been stripped of comments. If you want to include comments, then it is beyond the power of regular expressions and you have to go to a lexical parser.
Even then, you can't be sure the email is really valid. The best check is to try to send to it and see if the receiving system accepts it. Of course, even then it might be silently discarded....
Related
This question already has answers here:
How to return the substring of a string between two strings in Ruby?
(3 answers)
Closed 6 years ago.
My goal is to be able to define different sets of sub-strings that can be removed without eliminating the other strings. Open to better ideas.
What I have now is:
#outbound_text = " XTEST this is hidden XTEST hey there, what's up! XTEST this is also hidden XTEST but then I just keep writing here "
I tried the following but realized it was deleting hey there, what's up
if ENV['ENVIRONMENT'] == 'test'
# this will allow the XTEST string to come through
else # for production and development, remove the XTEST
unless #outbound_text.gsub!(/XTEST(.*)XTEST/ , '').nil?
#outbound_text.strip!
end
logger.debug "remove XTEST: #{#outbound_text}"
end
Open to different strings bookending what I need to remove (but the number of hidden sub-strings will vary so they can only be a beginning and end).
I think open to -- although have a number of them which get parsed, so open to using Nokogiri to remove the hidden tags. I would need to spend some time to try that, but wanted to know if there were a simple gsub before trying it.
Just make the repetition non-greedy:
#outbound_text.gsub(/XTEST(.*?)XTEST/ , '').strip
# => "hey there, what's up! but then I just keep writing here"
Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 8 years ago.
Improve this question
I have a ruby code that extracts email addresses from a page. my code outputs the email address, but also captures other text as well.
I would like to pull the actual email out of this string. Sometimes, the string will include a mailto, sometimes it will not. I was trying to get the single word that occurs before the #, and anything that comes after the # by using a split, but I'm having trouble. Any ideas? Thanks!
href="mailto:someonesname#domain.rr.com"> | Email</a></td>
Use something prebuilt:
require 'uri'
addresses = URI.extract(<<EOT, :mailto)
this is some text. mailto:foo#bar.com and more text
and some more http://foo#bar.com text
href="mailto:someonesname#domain.rr.com"> | Email</a></td>
EOT
addresses # => ["mailto:foo#bar.com", "mailto:someonesname#domain.rr.com"]
URI comes with Ruby, and the pattern used to parse out URIs is well tested. It's not bullet-proof, but it works pretty well. If you're getting false-positives, you can use a select, reject or grep block to filter out the unwanted entries returned.
If you can't count on having mailto:, the problem becomes harder, because email addresses aren't simple to parse; There's too much variation to them. The problem is akin to validating an email address using a pattern, because, again, the format for addresses varies too much. "Using a regular expression to validate an email address" and "JavaScript Email Validation when there are (soon to be) 1000's of TLD's?" are good reads for more information.
This should also work nicely though won't account for invalid email formats - it will simply extract the email address based on your two use cases.
string[/[^\"\:](\w+#.*)(?=\")/]
This should work
inputstring[/href="[^"]+"/][6 .. -2].gsub("mailto:", "")
Explanation:
Grab the href attribute and it's contents
Remove the href= and qoutes
Remove the mailto: if it's there
Example:
irb(main):021:0> test = "href=\"mailto:francesco#hawaii.rr.com\"> | Email DuVin</a></td>"
=> "href=\"mailto:francesco#hawaii.rr.com\"> | Email DuVin</a></td>"
irb(main):022:0> test[/href="[^"]+"/][6 .. -2].gsub("mailto:", "")
=> "francesco#hawaii.rr.com"
irb(main):023:0> test = "href=\"francesco#hawaii.rr.com\"> | Email DuVin</a></td>"
=> "href=\"francesco#hawaii.rr.com\"> | Email DuVin</a></td>"
irb(main):024:0> test[/href="[^"]+"/][6 .. -2].gsub("mailto:", "")
=> "francesco#hawaii.rr.com"
Here are two possible email string scenarios:
email = "Joe Schmoe <joe#example.com>"
email = "joe#example.com"
I always only want joe#example.com.
So what would the regex or method be that would account for both scenarios?
This passes your examples:
def find_email(string)
string[/<([^>]*)>$/, 1] || string
end
find_email "Joe Schmoe <joe#example.com>" # => "joe#example.com"
find_email "joe#example.com" # => "joe#example.com"
If you know your email is always going to be in the < > then you can do a sub string with those as the starting and ending indexes.
If those are the only two formats, don't use a regex. Just use simple string parsing. IF you find a "<>" pair, then pull the email address out from between them, and if you don't find those characters, treat the whole string as the email address.
Regexes are great when you need them, but if you have very simple patterns, then the overhead of loading in and parsing down the regex and processing with it will be much higher than simple string manipulation. Not loading in extra libraries other than what is very core in a language will almost always be faster than going a different route.
If you are willing to load an extra library, this has already been solved in the TMail gem:
http://lindsaar.net/2008/4/13/tip-5-cleaning-up-an-verifying-an-email-address-with-ruby-on-rails
TMail::Address.parse('Mikel A. <spam#lindsaar.net>').spec
=> "spam#lindsaar.net"
So I dumped all the emails from a DB into a txt file and I`m looking to sort them by email provider, basically anything that comes after the # sign.
I know I can use regex to validate each email.
However how do I indicate that I want to sort them by anything that comes after the # sign?
I know I can use regex to validate each email.
Careful! The range of valid e-mail addresses is much wider than most people think. The only correct regexes for e-mail validation are on the order of a page in length. If you must use a regex, just check for the # and one ..
However how do I indicate that I want to sort them by anything that comes after the # sign
email_addresses.sort_by {|addr| addr.split('#').last }
I am retrieving emails using the Fetcher plugin for Rails. It is doing a fine job. But I am trying to split the body of the email on newlines but it appears that it is only one really long line.
What is the best way (in Ruby) to split an email up into multiple lines?
Sounds like you need a word wrapping algorithm. Here is a short and clever way of word wrapping in Ruby that I found on the ruby-talk mailing list (link is to Google's cache because the site seems to be down):
puts $<.read.gsub(/\t/," ").gsub(/.{1,50}(?:\s|\Z)/){($& +
5.chr).gsub(/\n\005/,"\n").gsub(/\005/,"\n")}
Here's a slightly prettier version wrapped in a method:
def wordwrap(str, columns=80)
str.gsub(/\t/, " ").gsub(/.{1,#{ columns }}(?:\s|\Z)/) do
($& + 5.chr).gsub(/\n\005/, "\n").gsub(/\005/, "\n")
end
end