Ruby Regex to extract domain from email address

Ruby Regex to extract domain from email address - ruby

I have no real previous experience using regex, just saying.
I want to extract domain names from email addresses with the below format.
richardc#mydomain.com
so that the regex returns just: mydomain
With an explanation of how/why it works if possible!
Cheers

Here capturing (...) the domain name in group \1 and replace the whole string with that capture, which yields the domain name only at the end.
email = 'richardc#mydomain.com'
domain = email.gsub(/.+#([^.]+).+/, '\1')
# => mydomain
.+ means any character(except \n). So its basically matching the whole email string, and capturing the domain name using ([^.]+) [means anything but dot]

if you want to take the parsing route instead, the mail gem will do the job:
Mail::Address.new("richardc#mydomain.com").domain

Related

How to write a regex to match .com or .org with a "-" in the domain name

How do I write a regex in ruby that will look for a "-" and ".org" or "com" like:
some-thing.org
some-thing.org.sg
some-thing.com
some-thing.com.sg
some-thing.com.* (there are too many countries so for now any suffix is fine- I will deal with this problem later )
but not:
some-thing
some-thing.moc
I wrote : /.-.(org)?|.*(.com)/i
but it fails to stop "some-thing" or "some-thing.moc" :(

Support optional hyphen
I can come with this regex:
(https?:\/\/)?(www\.)?[a-z0-9-]+\.(com|org)(\.[a-z]{2,3})?
Working demo
Keep in mind that I used capturing groups for simplicity, but if you want to avoid capturing the content you can use non capturing groups like this:
(?:https?:\/\/)?(?:www\.)?[a-z0-9-]+\.(?:com|org)(?:\.[a-z]{2,3})?
^--- Notice "?:" to use non capturing groups
Additionally, if you don't want to use protocol and www pattern you can use:
[a-z0-9-]+\.(?:com|org)(?:\.[a-z]{2,3})?
Support mandatory hyphen
However, as Greg Hewgill pointed in his comment, if you want to ensure you have a hyphen at least, you can use this regex:
(?:https?:\/\/)?(?:www\.)?[a-z0-9]+(?:[-][a-z0-9]+)+\.(?:com|org)(?:\.[a-z]{2,3})?
Although, this regex can fall in horrible backtracking issues.
Working demo

This may help :
/[a-z0-9]+-?[a-z0-9]+\.(org|com)(\.[a-z]+)?/i
It matches '-' in the middle optionally, i.e. still matches names without '-'.

I had a similar issue when I was writing an HTTP server...
... I ended up using the following Regexp:
m = url.match /(([a-z0-9A-Z]+):\/\/)?(([^\/\:]+))?(:([0-9]+))?([^\?\#]*)(\?([^\#]*))?/
m[1] # => requested_protocol (optional) - i.e. https, http, ws, ftp etc'
m[4] # => host_name (optional) - i.e. www.my-site.com
m[6] # => port (optional)
m[7] #=> encoded URI - i.e. /index.htm
If what you are trying to do is validate a host name, you can simply make sure it doesn't contain the few illegal characters (:, /) and contains at least one dot separated string.
If you want to validate only .com or .org (+ country codes), you can do something like this:
def is_legit_url?(url)
allowed_master_domains = %w{com org}
allowed_country_domains = %w{sg it uk}
url.match(/[^\/\:]+\.(#{allowed_master_domains.join '|'})(\.#{allowed_country_domains.join '|'})?/i) && true
end
* notice that certain countries use .co, i.e. the UK uses www.amazon.co.uk
I would convert the Regexp to a constant, for performance reasons:
module MyURLReview
ALLOWED_MASTER_DOMAINS = %w{com org}
ALLOWED_COUNTRY_DOMAINS = %w{sg it uk}
DOMAINS_REGEXP = /[^\/\:]+\.(#{ALLOWED_MASTER_DOMAINS.join '|'})(\.#{ALLOWED_COUNTRY_DOMAINS.join '|'})?/i
def self.is_legit_url?(url)
url.match(DOMAINS_REGEXP) && true
end
end
Good Luck!

Regex101 Example
/[a-zA-Z0-9]-[a-zA-Z0-9]+?\.(?:org|com)\.?/
Of course, the above could be simplified depending on how lenient your rules are. The following is a simpler pattern, but would allow s0me-th1ng.com-plete to pass through:
/\w-\w+?\.(?:org|com)\b/

You could use a lookahead:
^(?=[^.]+-[^.]+)([^.]+\.(?:org|com).*)
Demo
Assuming you are looking for the general pattern of letters-letters where letters could be Unicode, you can do:
^(?=\p{L}+-\p{L}+)([^.]+\.(?:org|com).*)
If you want to add digits:
^(?=[\p{L}0-9]+-[\p{L}0-9]+)([^.]+\.(?:org|com).*)
So that you can match sòme1-thing.com
Demo
(Ruby 2.0+ for \p{L} I think...)

How to visit a link inside an email using capybara

I am new to cucumber with capybara. I got an application to test whose flow is:'after submitting a form, an email will be sent to the user which contains the link to another app. In order to access the app we have to open the mail and click the link, which will redirect to the app.'. I don't have access to the mail Id. Is there any way to extract that link and continue with the flow?
Please, give some possible way to do it.
Regards,
Abhisek Das

In your test, use whatever means you need in order to trigger the sending of the email by your application. Once the email is sent, use a regular expression to find the URL from the link within the email body (note this will work only for an email that contains a single link), and then visit the path from that URL with Capybara to continue with your test:
path_regex = /(?:"https?\:\/\/.*?)(\/.*?)(?:")/
email = ActionMailer::Base.deliveries.last
path = email.body.match(path_regex)[1]
visit(path)
Regular expression explained
A regular expression (regex) itself is demarcated by forward slashes, and this regex in particular consists of three groups, each demarcated by pairs of parentheses. The first and third groups both begin with ?:, indicating that they are non-capturing groups, while the second is a capturing group (no ?:). I will explain the significance of this distinction below.
The first group, (?:"https?\:\/\/.*?), is a:
non-capturing group, ?:
that matches a single double quote, "
we match a quote since we anticipate the URL to be in the href="..." attribute of a link tag
followed by the string http
optionally followed by a lowercase s, s?
the question mark makes the preceding match, in this case s, optional
followed by a colon and two forward slashes, \:\/\/
note the backslashes, which are used to escape characters that otherwise have a special meaning in a regex
followed by a wildcard, .*?, which will match any character any number of times up until the next match in the regex is reached
the period, or wildcard, matches any character
the asterisk, *, repeats the preceding match up to an unlimited number of times, depending on the successive match that follows
the question mark makes this a lazy match, meaning the wildcard will match as few characters as possible while still allowing the next match in the regex to be satisfied
The second group, (\/.*?) is a capturing group that:
matches a single forward slash, \/
this will match the first forward slash after the host portion of the URL (e.g. the slash at the end of http://www.example.com/) since the slashes in http:// were already matched by the first group
followed by another lazy wildcard, .*?
The third group, (?:"), is:
another non-capturing group, ?:
that matches a single double quote, "
And thus, our second group will match the portion of the URL starting with the forward slash after the host and going up to, but not including, the double quote at the end of our href="...".
When we call the match method using our regex, it returns an instance of MatchData, which behaves much like an array. The element at index 0 is a string containing the entire matched string (from all of the groups in the regex), while elements at subsequent indices contain only the portions of the string matched by the regex's capturing groups (only our second group, in this case). Thus, to get the corresponding match of our second group—which is the path we want to visit using Capybara—we grab the element at index 1.

You can use Nokogiri to parse the email body and find the link you want to click.
Imagine you want to click a link Change my password:
email = ActionMailer::Base.deliveries.last
html = Nokogiri::HTML(email.html_part.body.to_s)
target_url = html.at("a:contains('Change my password')")['href']
visit target_url
I think this is more semantic and robust that using regular expressions. For example, this would work if the email has many links.

If you're using or willing to use the capybara-email gem, there's now a simpler way of doing this. Let's say you've generated an email to recipient#email.com, which contains the link 'fancy link'.
Then you can just do this in your test suite:
open_email('recipient#email.com') # Allows the current_email method
current_email.click_link 'fancy link'

How do I select all the characters after any of these three extensions with Regex?

My test string :
http://website.me/stuffs/5715?vars=
So my url can be website.com, website.me, or website.dev.
And I basically want a regex statement that would capture all the content after this part:
http://website.me:3000/
So that it returns :
stuffs/5715?vars=

You should really use the URI class from Ruby core:
require 'uri'
URI.parse('http://website.me/stuffs/5715?vars=').request_uri
#=> "/stuffs/5715?vars="

http://[^/]+/(.*)
The (.*) part should capture everything after the domain name (and port number if it's included) and store it in capture group 1. How you go about accessing the capture group is language/implementation specific. This regex works by just matching everything after the first / that appears after the initial http://.

How to pull the email address out of this string?

Here are two possible email string scenarios:
email = "Joe Schmoe <joe#example.com>"
email = "joe#example.com"
I always only want joe#example.com.
So what would the regex or method be that would account for both scenarios?

This passes your examples:
def find_email(string)
string[/<([^>]*)>$/, 1] || string
end
find_email "Joe Schmoe <joe#example.com>" # => "joe#example.com"
find_email "joe#example.com" # => "joe#example.com"

If you know your email is always going to be in the < > then you can do a sub string with those as the starting and ending indexes.

If those are the only two formats, don't use a regex. Just use simple string parsing. IF you find a "<>" pair, then pull the email address out from between them, and if you don't find those characters, treat the whole string as the email address.
Regexes are great when you need them, but if you have very simple patterns, then the overhead of loading in and parsing down the regex and processing with it will be much higher than simple string manipulation. Not loading in extra libraries other than what is very core in a language will almost always be faster than going a different route.

If you are willing to load an extra library, this has already been solved in the TMail gem:
http://lindsaar.net/2008/4/13/tip-5-cleaning-up-an-verifying-an-email-address-with-ruby-on-rails
TMail::Address.parse('Mikel A. <spam#lindsaar.net>').spec
=> "spam#lindsaar.net"

Extract email addresses from a block of text

How can I create an array of email addresses contained within a block of text?
I've tried
addrs = text.scan(/ .+?#.+? /).map{|e| e[1...-1]}
but (not surprisingly) it doesn't work reliably.

Howabout this for a (slightly) better regular expression
\b[A-Z0-9._%+-]+#[A-Z0-9.-]+\.[A-Z]{2,4}\b
You can find this here:
Email Regex
Just an FYI, the problem with your email is that you allow only one type of separator before or after an email address. You would match "#" alone, if separated by spaces.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Ruby Regex to extract domain from email address - ruby

I have no real previous experience using regex, just saying. I want to extract domain names from email addresses with the below format. richardc#mydomain.com so that the regex returns just: mydomain With an explanation of how/why it works if possible! Cheers

if you want to take the parsing route instead, the mail gem will do the job: Mail::Address.new("richardc#mydomain.com").domain

Related

How to write a regex to match .com or .org with a "-" in the domain name

How to visit a link inside an email using capybara

How do I select all the characters after any of these three extensions with Regex?

How to pull the email address out of this string?

Extract email addresses from a block of text

Categories

Resources