Regular Expression of only spaces, letters, and numbers no special characters - ruby

I am trying to make a regular expression that validates letters, numbers and spaces ONLY. I dont want any special characters accepted (i.e. # )(*&^%$#!,)
I have been trying several things but nothing has given me letters(uppercase & lowercase), numbers, and spaces.
So it should accept something like this...
John Stevens 12
james stevens
willcall12
or
12cell space
but not this
12cell space!
John#Stevens
james 12 Fall#
I have tried the following
^[a-zA-Z0-9]+$
[\w _]+
^[\w_ ]+$
but they allow special characters or dont allow spaces. This is for a ruby validation.

You almost got it right. You could use this:
/\A[a-z0-9\s]+\Z/i
\s matches whitespace characters including tab. You could use (space) within square brackets if you need exact match for space.
/i at the end means match is not case sensitive.
Take a look at Rubular for testing your regexes.
EDIT: As pointed out by Jesus Castello, for some scenarios one should use \A and \Z instead of ^ and $ to denote string boundaries. See Difference between \A \Z and ^ $ in Ruby regular expressions for the explanation.

Here is a working example that will print matching results:
VALIDATION = /\A[a-zA-Z0-9 ]+\Z/
words = ["willcall12", "John Stevens 12", "12cell space!", "John#Stevens"]
words.each do |word|
m = word.match(VALIDATION)
puts m[0] if m
end
I can recommend this article if you would like to learn more about regular expressions.

Related

Split sentence by period followed by a capital letter

I'm trying to find a regex that will split a piece of text into sentences at ./?/! that is followed by a space that is followed by a capital letter.
"Hello there, my friend. In other words, i.e. what's up, man."
should split to:
Hello there, my friend| In other words, i.e. what's up, man|
I can get it to split on ./?/!, but I have no luck getting the space and capital letter criteria.
What I came up with:
.split("/. \s[A-Z]/")
Split a piece of text into sentences based on the criteria that it is a ./?/! that is followed by a space that is followed by a capital letter.
You may use a regex based on a lookahead:
s = "Hello there, my friend. In other words, i.e. what's up, man."
puts s.split(/[!?.](?=\s+\p{Lu})/)
See the Ruby demo. In case you also need to split with the punctuation at the end of the string, use /[!?.](?=(?:\s+\p{Lu})|\s*\z)/.
Details:
[!?.] - matches a !, ? or . that is...
(?=\s+\p{Lu}) - (a positive lookahead) followed with 1+ whitespaces followed with 1 uppercase letter immediately to the right of the current location.
See the Rubular demo.
NOTE: If you need to split regular English text into sentences, you should consider using existing NLP solutions/libraries. See:
Pragmatic Segmenter
srx-english
The latter is based on regex, and can easily be extended with more regular expressions.
Apart from Wiktor's Answer you can also use lookarounds to find zero width and split on it.
Regex: (?<=[.?!]\s)(?=[A-Z]) finds zero width preceded by either [.?!] and space and followed by an upper case letter.
s = "Hello there, my friend. In other words, i.e. what's up, man."
puts s.split(/(?<=[.?!]\s)(?=[A-Z])/)
Output
Hello there, my friend.
In other words, i.e. what's up, man.
Ruby Demo
Update: Based on Cary Swoveland's comment.
If the OP wanted to break the string into sentences I'd suggest (?<=[.?!])\s+(?=[A-Z]), as it removes spaces between sentences and permits the number of such spaces to be greater than one

regexp match group with the exception of a member of the group

So, there are a number of regular expression which matches a particular group like the following:
/./ - Any character except a newline.
/./m - Any character (the m modifier enables multiline mode)
/\w/ - A word character ([a-zA-Z0-9_])
/\s/ - Any whitespace character
And in ruby:
/[[:punct:]]/ - Punctuation character
/[[:space:]]/ - Whitespace character ([:blank:], newline, carriage return, etc.)
/[[:upper:]]/ - Uppercase alphabetical
So, here is my question: how do I get a regexp to match a group like this, but exempt a character out?
Examples:
match all punctuations apart from the question mark
match all whitespace characters apart from the new line
match all words apart from "go"... etc
Thanks.
You can use character class subtraction.
Rexegg:
The syntax […&&[…]] allows you to use a logical AND on several character classes to ensure that a character is present in them all. Intersecting with a negated character, as in […&&[^…]] allows you to subtract that class from the original class.
Consider this code:
s = "./?!"
res = s.scan(/[[:punct:]&&[^!]]/)
puts res
Output is only ., / and ? since ! is excluded.
Restricting with a lookahead (as sawa has written just now) is also possible, but is not required when you have this subtraction supported. When you need to restrict some longer values (more than 1 character) a lookahead is required.
In many cases, a lookahead must be anchored to a word boundary to return correct results. As an example of using a lookahead to restrict punctuation (single character matching generic pattern):
/(?:(?!!)[[:punct:]])+/
This will match 1 or more punctuation symbols but a !.
The puts "./?!".scan(/(?:(?!!)[[:punct:]])+/) code will output ./? (see demo)
Use character class subtraction whenever you need to restrict with single characters, it is more efficient than using lookaheads.
So, the 3rd scenario regex must look like:
/\b(?!go\b)\w+\b/
^^
If you write /(?!\bgo\b)\b\w+\b/, the regex engine will check each position in the input string. If you use a \b at the beginning, only word boundary positions will be checked, and the pattern will yield better performance. Also note that the ^^ \b is very important since it makes the regex engine check for the whole word go. If you remove it, it will only restrict to the words that do not start with go.
Put what you want to exclude inside a negative lookahead in front of the match. For example,
To match all punctuations apart from the question mark,
/(?!\?)[[:punct:]]/
To match all words apart from "go",
/(?!\bgo\b)\b\w+\b/
This is a general approach that is sometimes useful:
a = []
".?!,:;-".scan(/[[:punct:]]/) { |s| a << s unless s == '?' }
a #=> [".", "!", ",", ":", ";", "-"]
The content of the block is limited only by your imagination.

Matching only a single standalone letter

I'm trying to write a regular expression that matches only a single standalone letter only, such as a,C,f,G, but, NOT abc or de for instance.
I tried [a-zA-z], but all of the above match.
What should I do in this case?
^[a-zA-Z]$
Add ^$ or anchors to limit match to just one character.
or
(?:^|(?<=[^a-zA-Z]))[a-zA-Z](?=[^a-zA-Z]|$)
There are several ways to do this, depending on your content. This could work:
[^a-zA-Z][a-zA-Z][^a-zA-Z]
Or there's a regex code for that, the \b:
\b[a-zA-Z]\b
which is more useful since it allows matches at the start and end of a line.
Your regex [a-zA-z] matches not only letters but also matches [, ], \, ^, _ and `. Moreover, it has no anchors and thus will match both a and t in at.
You can make use of the POSIX bracket expression alpha to match a single letter substring together with a word boundary \b:
puts 'a,C,f,G, but, NOT abc de'.scan(/\b[[:alpha:]]\b/)
See IDEONE demo
Output:
a
C
f
G

Regular expression to find first letter in a string

Consider this example string:
mystr ="1. moody"
I want to capitalize the first letter that occurs in mystr. I am trying this regular expression in Ruby but still returns all the letters in mystr (moody) instead of the letter m only.
puts mystr.scan(/[a-zA-Z]{1}/)
Any help appreciated!
Do as below using String#sub
(arup~>~)$ pry --simple-prompt
>> s = "1. moody"
=> "1. moody"
>> s.sub(/[a-z]/i,&:upcase)
=> "1. Moody"
>>
If you want to modify the source string use s.sub!(/[a-z]/,&:upcase).
Just for completeness, although it doesn’t directly answer your question as posed but could be relevant, consider this variation:
mystr ="1. école"
The line mystr.sub(/[a-z]/i,&:upcase) (as in Arup Rakshit’s answer) will match the second letter of the word, producing
1. éCole
The line mystr.sub /\b\s?[a-zA-Z]{1}/, &:upcase (diego.greyrobot’s answer) won’t match at all and so the line will be unchanged.
There are two problems here. The first is that [a-zA-Z] doesn’t match accented characters, so é isn’t matched. The fix for this is to use the \p{Letter} character property:
mystr.sub /\p{Letter}/, &:upcase
This will match the character in question, but won’t change it. This is due to the second problem, which is that upcase (and downcase) only works on characters in the ASCII range. This is almost as easy to fix, but relies on using an external library such as unicode_utils:
require 'unicode_utils'
mystr.sub(/\p{Letter}/) { |c| UnicodeUtils.upcase(c)}
This results in:
1. École
which is probably what is wanted in this case.
This may not affect you if you are sure all your data is just ASCII, but is worth knowing for other situations.
The reason your attempt returns all the letters is because you are using the scan method which does just that, it returns all the characters which match the regex, in your case letters. For your use case you should use sub since you only want to substitute 1 letter.
I use http://rubular.com to practice my Ruby Regexes. Here's what I came up with http://rubular.com/r/fAQEDFVEVn
The regex is: /\b[a-z]/
It uses \b to find a word boundary, and finally we ask for one letter only with [a-zA-Z]
Finally we'll use sub to replace it with its upcased version:
"1. moody".sub /\b[a-z]/, &:upcase
=> "1. Moody"
Hope that helps.

Strip words beginning with a specific letter from a sentence using regex

I'm not sure how to use regular expressions in a function so that I could grab all the words in a sentence starting with a particular letter. I know that I can do:
word =~ /^#{letter}/
to check if the word starts with the letter, but how do I go from word to word. Do I need to convert the string to an array and then iterate through each word or is there a faster way using regex? I'm using ruby so that would look like:
matching_words = Array.new
sentance.split(" ").each do |word|
matching_words.push(word) if word =~ /^#{letter}/
end
Scan may be a good tool for this:
#!/usr/bin/ruby1.8
s = "I think Paris in the spring is a beautiful place"
p s.scan(/\b[it][[:alpha:]]*/i)
# => ["I", "think", "in", "the", "is"]
\b means 'word boundary."
[:alpha:] means upper or lowercase alpha (a-z).
You can use \b. It matches word boundaries--the invisible spot just before and after a word. (You can't see them, but oh they're there!) Here's the regex:
/\b(a\w*)\b/
The \w matches a word character, like letters and digits and stuff like that.
You can see me testing it here: http://rubular.com/regexes/13347
Similar to Anon.'s answer:
/\b(a\w*)/g
and then see all the results with (usually) $n, where n is the n-th hit. Many libraries will return /g results as arrays on the $n-th set of parenthesis, so in this case $1 would return an array of all the matching words. You'll want to double-check with whatever library you're using to figure out how it returns matches like this, there's a lot of variation on global search returns, sadly.
As to the \w vs [a-zA-Z], you can sometimes get faster execution by using the built-in definitions of things like that, as it can easily have an optimized path for the preset character classes.
The /g at the end makes it a "global" search, so it'll find more than one. It's still restricted by line in some languages / libraries, though, so if you wish to check an entire file you'll sometimes need /gm, to make it multi-line
If you want to remove results, like your title (but not question) suggests, try:
/\ba\w*//g
which does a search-and-replace in most languages (/<search>/<replacement>/). Sometimes you need a "s" at the front. Depends on the language / library. In Ruby's case, use:
string.gsub(/(\b)a\w*(\b)/, "\\1\\2")
to retain the non-word characters, and optionally put any replacement text between \1 and \2. gsub for global, sub for the first result.
/\ba[a-z]*\b/i
will match any word starting with 'a'.
The \b indicates a word boundary - we want to only match starting from the beginning of a word, after all.
Then there's the character we want our word to start with.
Then we have as many as possible letter characters, followed by another word boundary.
To match all words starting with t, use:
\bt\w+
That will match test but not footest; \b means "word boundary".
Personally i think that regex is overkill for this application, simply running a select is more than capable of solving this particular problem.
"this is a test".split(' ').select{ |word| word[0,1] == 't' }
result => ["this", "test"]
or if you are determined to use regex then go with grep
"this is a test".split(' ').grep(/^t/)
result => ["this", "test"]
Hope this helps.

Resources