Clean string to get Email with Regex [closed] - ruby

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 8 years ago.
Improve this question
I have a ruby code that extracts email addresses from a page. my code outputs the email address, but also captures other text as well.
I would like to pull the actual email out of this string. Sometimes, the string will include a mailto, sometimes it will not. I was trying to get the single word that occurs before the #, and anything that comes after the # by using a split, but I'm having trouble. Any ideas? Thanks!
href="mailto:someonesname#domain.rr.com"> | Email</a></td>

Use something prebuilt:
require 'uri'
addresses = URI.extract(<<EOT, :mailto)
this is some text. mailto:foo#bar.com and more text
and some more http://foo#bar.com text
href="mailto:someonesname#domain.rr.com"> | Email</a></td>
EOT
addresses # => ["mailto:foo#bar.com", "mailto:someonesname#domain.rr.com"]
URI comes with Ruby, and the pattern used to parse out URIs is well tested. It's not bullet-proof, but it works pretty well. If you're getting false-positives, you can use a select, reject or grep block to filter out the unwanted entries returned.
If you can't count on having mailto:, the problem becomes harder, because email addresses aren't simple to parse; There's too much variation to them. The problem is akin to validating an email address using a pattern, because, again, the format for addresses varies too much. "Using a regular expression to validate an email address" and "JavaScript Email Validation when there are (soon to be) 1000's of TLD's?" are good reads for more information.

This should also work nicely though won't account for invalid email formats - it will simply extract the email address based on your two use cases.
string[/[^\"\:](\w+#.*)(?=\")/]

This should work
inputstring[/href="[^"]+"/][6 .. -2].gsub("mailto:", "")
Explanation:
Grab the href attribute and it's contents
Remove the href= and qoutes
Remove the mailto: if it's there
Example:
irb(main):021:0> test = "href=\"mailto:francesco#hawaii.rr.com\"> | Email DuVin</a></td>"
=> "href=\"mailto:francesco#hawaii.rr.com\"> | Email DuVin</a></td>"
irb(main):022:0> test[/href="[^"]+"/][6 .. -2].gsub("mailto:", "")
=> "francesco#hawaii.rr.com"
irb(main):023:0> test = "href=\"francesco#hawaii.rr.com\"> | Email DuVin</a></td>"
=> "href=\"francesco#hawaii.rr.com\"> | Email DuVin</a></td>"
irb(main):024:0> test[/href="[^"]+"/][6 .. -2].gsub("mailto:", "")
=> "francesco#hawaii.rr.com"

Related

Ruby: Remove "double dot" with regex [duplicate]

This question already has answers here:
How can I validate an email address using a regular expression?
(79 answers)
Closed 9 years ago.
I'm using the following regex to validate emails in an email DB of a Rails App:
/\A[\w+\-.]+#[a-z\d\-.]+\.[a-z]+\z/i
it kinda works, but it leaves me with invalid mails wich have this kind of errors:
name#domain..com
How can I rwrite this regex to avoid that or wich is the best regex to clean up an email list I have to leave only valid email addresses? I'm using a method like this one to clean up the list:
def mailSweep
mails = Email.all.lazy
for address in mails do
if address.email.to_s =~ /\A[\w+\-.]+#[a-z\d\-.]+\.[a-z]+\z/i
puts address.email.to_s + " " + "it's valid"
else
puts address.email.to_s + " " + "it's invalid, destroying..."
address.destroy
end
end
end
I suggest
/\A[^#\s]+#([-a-zA-Z0-9]{1,63}\.)+[a-zA-Z0-9]{2,63}\z/
Note that this regex does not allow all legal email addresses. You need to something insanely complicated to allow for all RFC 822 valid email addresses that have been stripped of comments. If you want to include comments, then it is beyond the power of regular expressions and you have to go to a lexical parser.
Even then, you can't be sure the email is really valid. The best check is to try to send to it and see if the receiving system accepts it. Of course, even then it might be silently discarded....

How to pull the email address out of this string?

Here are two possible email string scenarios:
email = "Joe Schmoe <joe#example.com>"
email = "joe#example.com"
I always only want joe#example.com.
So what would the regex or method be that would account for both scenarios?
This passes your examples:
def find_email(string)
string[/<([^>]*)>$/, 1] || string
end
find_email "Joe Schmoe <joe#example.com>" # => "joe#example.com"
find_email "joe#example.com" # => "joe#example.com"
If you know your email is always going to be in the < > then you can do a sub string with those as the starting and ending indexes.
If those are the only two formats, don't use a regex. Just use simple string parsing. IF you find a "<>" pair, then pull the email address out from between them, and if you don't find those characters, treat the whole string as the email address.
Regexes are great when you need them, but if you have very simple patterns, then the overhead of loading in and parsing down the regex and processing with it will be much higher than simple string manipulation. Not loading in extra libraries other than what is very core in a language will almost always be faster than going a different route.
If you are willing to load an extra library, this has already been solved in the TMail gem:
http://lindsaar.net/2008/4/13/tip-5-cleaning-up-an-verifying-an-email-address-with-ruby-on-rails
TMail::Address.parse('Mikel A. <spam#lindsaar.net>').spec
=> "spam#lindsaar.net"

How can I sort an array of emails by the email provider?

So I dumped all the emails from a DB into a txt file and I`m looking to sort them by email provider, basically anything that comes after the # sign.
I know I can use regex to validate each email.
However how do I indicate that I want to sort them by anything that comes after the # sign?
I know I can use regex to validate each email.
Careful! The range of valid e-mail addresses is much wider than most people think. The only correct regexes for e-mail validation are on the order of a page in length. If you must use a regex, just check for the # and one ..
However how do I indicate that I want to sort them by anything that comes after the # sign
email_addresses.sort_by {|addr| addr.split('#').last }

What regex can I use to get the domain name from a url in Ruby?

I am trying to construct a regex to extract a domain given a url.
for:
http://www.abc.google.com/
http://abc.google.com/
https://www.abc.google.com/
http://abc.google.com/
should give:
abc.google.com
URI.parse('http://www.abc.google.com/').host
#=> "www.abc.google.com"
Not a regex, but probably more robust then anything we come up with here.
URI.parse('http://www.abc.google.com/').host.gsub(/^www\./, '')
If you want to remove the www. as well this will work without raising any errors if the www. is not there.
Don't know much about ruby but this regex pattern gives you the last 3 parts of the url excluding the trailing slash with a minumum of 2 characters per part.
([\w-]{2,}\.[\w-]{2,}\.[\w-]{2,})/$
you may be able to use the domain_name gem for this kind of work. From the README:
require "domain_name"
host = DomainName("a.b.example.co.uk")
host.domain #=> "example.co.uk"
Your question is a little bit vague. Can you give a precise specification of what it is exactly that you want to do? (Preferable with a testsuite.) Right now, all your question says is that you want a method that always returns 'abc.google.com'. That's easy:
def extract_domain
return 'abc.google.com'
end
But that's probably not what you meant …
Also, you say that you need a Regexp. Why? What's wrong with, for example, using the URI class? After all, parsing and manipulating URIs is exactly what it was made for!
require 'uri'
URI.parse('https://abc.google.com/').host # => 'abc.google.com'
And lastly, you say you are "trying to extract a domain", but you never specify what you mean by "domain". It looks you are sometimes meaning the FQDN and sometimes randomly dropping parts of the FQDN, but according to what rules? For example, for the FQDN abc.google.com, the domain name is google.com and the host name is abc, but you want it to return abc.google.com which is not just the domain name but the full FQDN. Why?

Extract email addresses from a block of text

How can I create an array of email addresses contained within a block of text?
I've tried
addrs = text.scan(/ .+?#.+? /).map{|e| e[1...-1]}
but (not surprisingly) it doesn't work reliably.
Howabout this for a (slightly) better regular expression
\b[A-Z0-9._%+-]+#[A-Z0-9.-]+\.[A-Z]{2,4}\b
You can find this here:
Email Regex
Just an FYI, the problem with your email is that you allow only one type of separator before or after an email address. You would match "#" alone, if separated by spaces.

Resources