Ruby gsub to remove illegal characters in phone number - ruby

I want to remove all characters in a string that do not belong in a phone number string. The first character may or may not be a "+" and all other characters must be digits.
I had gsub(/\D/, ''), but I want to keep the first character if it is a "+" (or a digit, of course). I then tried some negation, but this is not right, either: gsub(/^(\+?(\d))/, '').
How can I ignore the first character with regex iff it is a "+"?

How about using a negative lookahead at the beginning:
gsub(/(?!^\+)\D*/, '')
Basically, the above regex should remove any series of non-digits where the first character is not a single '+' character at the beginning of the string.
Hope it helps.

Unless you absolutely have to do it in one gsub, it might be simpler to pull the plus sign out separately. You could use the [] method, with something like:
my_string[/^\+/].to_s + my_string.gsub(/\D/, '')
The to_s is necessary since the method will return nil if the plus sign isn't found.

Related

regexp match group with the exception of a member of the group

So, there are a number of regular expression which matches a particular group like the following:
/./ - Any character except a newline.
/./m - Any character (the m modifier enables multiline mode)
/\w/ - A word character ([a-zA-Z0-9_])
/\s/ - Any whitespace character
And in ruby:
/[[:punct:]]/ - Punctuation character
/[[:space:]]/ - Whitespace character ([:blank:], newline, carriage return, etc.)
/[[:upper:]]/ - Uppercase alphabetical
So, here is my question: how do I get a regexp to match a group like this, but exempt a character out?
Examples:
match all punctuations apart from the question mark
match all whitespace characters apart from the new line
match all words apart from "go"... etc
Thanks.
You can use character class subtraction.
Rexegg:
The syntax […&&[…]] allows you to use a logical AND on several character classes to ensure that a character is present in them all. Intersecting with a negated character, as in […&&[^…]] allows you to subtract that class from the original class.
Consider this code:
s = "./?!"
res = s.scan(/[[:punct:]&&[^!]]/)
puts res
Output is only ., / and ? since ! is excluded.
Restricting with a lookahead (as sawa has written just now) is also possible, but is not required when you have this subtraction supported. When you need to restrict some longer values (more than 1 character) a lookahead is required.
In many cases, a lookahead must be anchored to a word boundary to return correct results. As an example of using a lookahead to restrict punctuation (single character matching generic pattern):
/(?:(?!!)[[:punct:]])+/
This will match 1 or more punctuation symbols but a !.
The puts "./?!".scan(/(?:(?!!)[[:punct:]])+/) code will output ./? (see demo)
Use character class subtraction whenever you need to restrict with single characters, it is more efficient than using lookaheads.
So, the 3rd scenario regex must look like:
/\b(?!go\b)\w+\b/
^^
If you write /(?!\bgo\b)\b\w+\b/, the regex engine will check each position in the input string. If you use a \b at the beginning, only word boundary positions will be checked, and the pattern will yield better performance. Also note that the ^^ \b is very important since it makes the regex engine check for the whole word go. If you remove it, it will only restrict to the words that do not start with go.
Put what you want to exclude inside a negative lookahead in front of the match. For example,
To match all punctuations apart from the question mark,
/(?!\?)[[:punct:]]/
To match all words apart from "go",
/(?!\bgo\b)\b\w+\b/
This is a general approach that is sometimes useful:
a = []
".?!,:;-".scan(/[[:punct:]]/) { |s| a << s unless s == '?' }
a #=> [".", "!", ",", ":", ";", "-"]
The content of the block is limited only by your imagination.

Replace non-word characters, unless given sequence matches

I have a string like this:
"Jim-Bob's email ###hl###address###endhl### is: jb#example.com"
I want to replace all non-word characters (symbols and whitespace), except the ### delimiters.
I'm currently using:
str.gsub(/[^\w#]+/, 'X')
which yields:
"JimXBobXsXemailX###hl###address###endhl###XisXjb#exampleXcom"
In practice, this is good enough, but it offends me for two reasons:
The # in the email address is not replaced.
The use of [^\w] instead of \W feels sloppy.
How do I replace all non-word characters, unless those characters make up the ###hl### or ###endhl### delimiter strings?
str.gsub(/(###.*?###|\w+)|./) { $1 || "X" }
# => "JimXBobXsXemailX###hl###address###endhl###XisXXjbXexampleXcom"
This approach uses the fact that alternations work like case structure: the first matching one consumes the corresponding string, then no further matching is done on it. Thus, ###.*?### will consume a marker (like ###hl###; nothing else will be matched inside it. We also match any sequence of word characters. If any of those are captured, we can just return them as-is ($1). If not, then we match any other character (i.e. not inside a marker, and not a word character) and replace it with "X".
Regarding your second point, I think you are asking too much; there is no simple way to avoid that.
Regarding the first point, a simple way is to temporarily replace "###" with a character that you will never use (let's say you are using a system without "\r", so that that character is not used; we can use that as a temporal replacement).
"Jim-Bob's email ###hl###address###endhl### is: jb#example.com"
.gsub("###", "\r").gsub(/[^\w\r]/, "X").gsub("\r", "###")
# => "JimXBobXsXemailX###hl###address###endhl###XisXXjbXexampleXcom"

How do I tune this regex to return the matches I want?

So I have a string that looks like this:
#jackie#test.com, #mike#test.com
What I want to do is before any email in this comma separated list, I want to remove the #. The issue I keep running into is that if I try to do a regular \A flag like so /[\A#]+/, it finds all the instances of # in that string...including the middle crucial #.
The same thing happens if I do /[\s#]+/. I can't figure out how to just look at the beginning of each string, where each string is a complete email address.
Edit 1
Note that all I need is the regex, I already have the rest of the stuff I need to do what I want. Specifically, I am achieving everything else like this:
str.gsub(/#/, '').split(',').map(&:strip)
Where str is my string.
All I am looking for is the regex portion for my gsub.
You may use the below negative lookbehind based regex.
str.gsub(/(?<!\S)#/, '').split(',').map(&:strip)
(?<!\S) Negative lookbehind asserts that the character or substring we are going to match would be preceeded by any but not of a non-space character. So this matches the # which exists at the start or the # which exists next to a space character.
Difference between my answer and hwnd's str.gsub(/\B#/, '') is, mine won't match the # which exists in :# but hwnd's answer does. \B matches between two word characters or two non-word characters.
Here is one solution
str = "#jackie#test.com, #mike#test.com"
p str.split(/,[ ]+/).map{ |i| i.gsub(/^#/, '')}
Output
["jackie#test.com", "mike#test.com"]

Regex matching plus or minus

Could someone please look at the following function and explain the regex for me as I don't understand it and I don't like using something I don't understand as then I won't be able to replicate it for use in the future and nor do I learn from it.
Also can someone explain the double !! in front, I know single means not so does double mean not "not"?
The function is a patch to String to check if it's capable of being converted to an integer or not.
class String
def is_i?
!!(self =~ /\A[-+]?[0-9]+\z/)
end
end
The main thing that's giving me trouble is [-+] as it makes little sense to me, if you could explain in the context given it would be very helpful.
EDIT:
Since people missed the second part of the question I'll be a little more explicit.
What does !! Mean in front of the check, I know a single ! means NOT but I can't find what !! means.
The [-+] Character Class
[-+] is a character class. It means "match one character specified by the class", i.e. - or +.
Hyphens in Character Classes
I can see how this particular class can be confusing because the hyphen often plays a special role in a character class: it links two characters to form a character range. For instance, [a-z] means "match one character between a and z, and [a-z0-9] means "match one character between a and z or between 0 and 9.
However, in this case, the hypen in [-+] is positioned in a place where it cannot be used to specify a range, and the - is just a literal hyphen.
Decoding the entire expression
Assert position at the beginning of the string \A
Match a single character from the list “-+” [-+]?
Between zero and one times, as many times as possible, giving back as needed (greedy) ?
Match a single character in the range between “0” and “9” [0-9]+
Between one and unlimited times, as many times as possible, giving back as needed (greedy) +
Assert position at the very end of the string \z
A Character Class defines a set of characters, any one of which can occur in a string for a match to succeed.
For example, the regular expression [-+]?[0-9]+ will match 123, -123, or +123 because it defines a character class (accepting either -, +, or neither one) as its first character.
In context:
\A asserts position at the start of the string.
[-+] any character of: - or + (? optional, meaning between zero and one time)
[0-9] any character of: 0 to 9 (+ quantifier meaning 1 or more times)
\z asserts position at the very end of the string.
What does !! mean?
!! placed together converts the value to a boolean.
explain the regex for me as I don't understand it
Pattern explanation: \A[-+]?[0-9]+\z
\A Start of string
[-+]? plus or minus sign [zero or one time (optional)]
[0-9]+ 0 to 9 any digit [one or more times]
\z End of string
The above regex pattern is able to match any positive and negative integer number that has + or - sign optional.
Read more about Character Classes and test your regex pattern online at Rubular

How can I write a regex in Ruby that will determine if a string meets this criteria?

How can I write a regex in Ruby 1.9.2 that will determine if a string meets this criteria:
Can only include letters, numbers and the - character
Cannot be an empty string, i.e. cannot have a length of 0
Must contain at least one letter
/\A[a-z0-9-]*[a-z][a-z0-9-]*\z/i
It goes like
beginning of string
some (or zero) letters, digits and/or dashes
a letter
some (or zero) letters, digits and/or dashes
end of string
I suppose these two will help you: /\A[a-z0-9\-]{1,}\z/i and /[a-z]{1,}/i. The first one checks on first two rules and the second one checks for the last condition.
No regex:
str.count("a-zA-Z") > 0 && str.count("^a-zA-Z0-9-") == 0
You can take a look at this tutorial for how to use regular expressions in ruby. With regards to what you need, you can use the following:
^[A-Za-z0-9\-]+$
The ^ will instruct the regex engine to start matching from the very beginning of the string.
The [..] will instruct the regex engine to match any one of the characters they contain.
A-Z mean any upper case letter, a-z means any lower case letter and 0-9 means any number.
The \- will instruct the regex engine to match the -. The \ is used infront of it because the - in regex is a special symbol, so it needs to be escaped
The $ will instruct the regex engine to stop matching at the end of the line.
The + instructs the regex engine to match what is contained between the square brackets one or more time.
You can also use the \i flag to make your search case insensitive, so the regex might become something like this:
^[a-z0-9\-]+/i$

Resources