Regex matching a character different from the first - ruby

Im trying to use regex to match a pattern like this:
(any letter) (a different letter) (the same letter again)
so for example:
these are all valid examples:
aba
bcb
dbd
these are not valid:
aab
aaa
bac
Im trying to do it in this way:
(.)[^\1]\1
However, this still matches case where the second letter is similar to the first letter (e.g: aaa). See here: http://rubular.com/r/TTGEcyhE9g
Is there a way in regex to match any letter except the captured one?

Backreferences are not valid in character ranges. As explained by Wiktor Stribiżew below, you are defining raw characters here, in your case the \x01 (SOH, Start of heading) character.
As a workaround, you could use a negative lookahead as follows:
(.)(?!\1).\1
Here, you are matching any character which is not followed by the same character (which is not consumed) followed by any character (but a different one because of the negative lookahead), followed by the first character again.
You can learn more about lookahead and lookbehind in the Ruby documentation.

If you like using regex, then Wiktor's suggestion has you covered. But, it is easy enough to write a basic Ruby script which does the assertions:
input = "aea hello"
if input[0] == input[2] && input[0] != input[1]
print "match"
else
print "no match"
end

Related

Split sentence by period followed by a capital letter

I'm trying to find a regex that will split a piece of text into sentences at ./?/! that is followed by a space that is followed by a capital letter.
"Hello there, my friend. In other words, i.e. what's up, man."
should split to:
Hello there, my friend| In other words, i.e. what's up, man|
I can get it to split on ./?/!, but I have no luck getting the space and capital letter criteria.
What I came up with:
.split("/. \s[A-Z]/")
Split a piece of text into sentences based on the criteria that it is a ./?/! that is followed by a space that is followed by a capital letter.
You may use a regex based on a lookahead:
s = "Hello there, my friend. In other words, i.e. what's up, man."
puts s.split(/[!?.](?=\s+\p{Lu})/)
See the Ruby demo. In case you also need to split with the punctuation at the end of the string, use /[!?.](?=(?:\s+\p{Lu})|\s*\z)/.
Details:
[!?.] - matches a !, ? or . that is...
(?=\s+\p{Lu}) - (a positive lookahead) followed with 1+ whitespaces followed with 1 uppercase letter immediately to the right of the current location.
See the Rubular demo.
NOTE: If you need to split regular English text into sentences, you should consider using existing NLP solutions/libraries. See:
Pragmatic Segmenter
srx-english
The latter is based on regex, and can easily be extended with more regular expressions.
Apart from Wiktor's Answer you can also use lookarounds to find zero width and split on it.
Regex: (?<=[.?!]\s)(?=[A-Z]) finds zero width preceded by either [.?!] and space and followed by an upper case letter.
s = "Hello there, my friend. In other words, i.e. what's up, man."
puts s.split(/(?<=[.?!]\s)(?=[A-Z])/)
Output
Hello there, my friend.
In other words, i.e. what's up, man.
Ruby Demo
Update: Based on Cary Swoveland's comment.
If the OP wanted to break the string into sentences I'd suggest (?<=[.?!])\s+(?=[A-Z]), as it removes spaces between sentences and permits the number of such spaces to be greater than one

regexp match group with the exception of a member of the group

So, there are a number of regular expression which matches a particular group like the following:
/./ - Any character except a newline.
/./m - Any character (the m modifier enables multiline mode)
/\w/ - A word character ([a-zA-Z0-9_])
/\s/ - Any whitespace character
And in ruby:
/[[:punct:]]/ - Punctuation character
/[[:space:]]/ - Whitespace character ([:blank:], newline, carriage return, etc.)
/[[:upper:]]/ - Uppercase alphabetical
So, here is my question: how do I get a regexp to match a group like this, but exempt a character out?
Examples:
match all punctuations apart from the question mark
match all whitespace characters apart from the new line
match all words apart from "go"... etc
Thanks.
You can use character class subtraction.
Rexegg:
The syntax […&&[…]] allows you to use a logical AND on several character classes to ensure that a character is present in them all. Intersecting with a negated character, as in […&&[^…]] allows you to subtract that class from the original class.
Consider this code:
s = "./?!"
res = s.scan(/[[:punct:]&&[^!]]/)
puts res
Output is only ., / and ? since ! is excluded.
Restricting with a lookahead (as sawa has written just now) is also possible, but is not required when you have this subtraction supported. When you need to restrict some longer values (more than 1 character) a lookahead is required.
In many cases, a lookahead must be anchored to a word boundary to return correct results. As an example of using a lookahead to restrict punctuation (single character matching generic pattern):
/(?:(?!!)[[:punct:]])+/
This will match 1 or more punctuation symbols but a !.
The puts "./?!".scan(/(?:(?!!)[[:punct:]])+/) code will output ./? (see demo)
Use character class subtraction whenever you need to restrict with single characters, it is more efficient than using lookaheads.
So, the 3rd scenario regex must look like:
/\b(?!go\b)\w+\b/
^^
If you write /(?!\bgo\b)\b\w+\b/, the regex engine will check each position in the input string. If you use a \b at the beginning, only word boundary positions will be checked, and the pattern will yield better performance. Also note that the ^^ \b is very important since it makes the regex engine check for the whole word go. If you remove it, it will only restrict to the words that do not start with go.
Put what you want to exclude inside a negative lookahead in front of the match. For example,
To match all punctuations apart from the question mark,
/(?!\?)[[:punct:]]/
To match all words apart from "go",
/(?!\bgo\b)\b\w+\b/
This is a general approach that is sometimes useful:
a = []
".?!,:;-".scan(/[[:punct:]]/) { |s| a << s unless s == '?' }
a #=> [".", "!", ",", ":", ";", "-"]
The content of the block is limited only by your imagination.

How do I tune this regex to return the matches I want?

So I have a string that looks like this:
#jackie#test.com, #mike#test.com
What I want to do is before any email in this comma separated list, I want to remove the #. The issue I keep running into is that if I try to do a regular \A flag like so /[\A#]+/, it finds all the instances of # in that string...including the middle crucial #.
The same thing happens if I do /[\s#]+/. I can't figure out how to just look at the beginning of each string, where each string is a complete email address.
Edit 1
Note that all I need is the regex, I already have the rest of the stuff I need to do what I want. Specifically, I am achieving everything else like this:
str.gsub(/#/, '').split(',').map(&:strip)
Where str is my string.
All I am looking for is the regex portion for my gsub.
You may use the below negative lookbehind based regex.
str.gsub(/(?<!\S)#/, '').split(',').map(&:strip)
(?<!\S) Negative lookbehind asserts that the character or substring we are going to match would be preceeded by any but not of a non-space character. So this matches the # which exists at the start or the # which exists next to a space character.
Difference between my answer and hwnd's str.gsub(/\B#/, '') is, mine won't match the # which exists in :# but hwnd's answer does. \B matches between two word characters or two non-word characters.
Here is one solution
str = "#jackie#test.com, #mike#test.com"
p str.split(/,[ ]+/).map{ |i| i.gsub(/^#/, '')}
Output
["jackie#test.com", "mike#test.com"]

How can I write a regex in Ruby that will determine if a string meets this criteria?

How can I write a regex in Ruby 1.9.2 that will determine if a string meets this criteria:
Can only include letters, numbers and the - character
Cannot be an empty string, i.e. cannot have a length of 0
Must contain at least one letter
/\A[a-z0-9-]*[a-z][a-z0-9-]*\z/i
It goes like
beginning of string
some (or zero) letters, digits and/or dashes
a letter
some (or zero) letters, digits and/or dashes
end of string
I suppose these two will help you: /\A[a-z0-9\-]{1,}\z/i and /[a-z]{1,}/i. The first one checks on first two rules and the second one checks for the last condition.
No regex:
str.count("a-zA-Z") > 0 && str.count("^a-zA-Z0-9-") == 0
You can take a look at this tutorial for how to use regular expressions in ruby. With regards to what you need, you can use the following:
^[A-Za-z0-9\-]+$
The ^ will instruct the regex engine to start matching from the very beginning of the string.
The [..] will instruct the regex engine to match any one of the characters they contain.
A-Z mean any upper case letter, a-z means any lower case letter and 0-9 means any number.
The \- will instruct the regex engine to match the -. The \ is used infront of it because the - in regex is a special symbol, so it needs to be escaped
The $ will instruct the regex engine to stop matching at the end of the line.
The + instructs the regex engine to match what is contained between the square brackets one or more time.
You can also use the \i flag to make your search case insensitive, so the regex might become something like this:
^[a-z0-9\-]+/i$

Regular expression Unix shell script

I need to filter all lines with words starting with a letter followed by zero or more letters or numbers, but no special characters (basically names which could be used for c++ variable).
egrep '^[a-zA-Z][a-zA-Z0-9]*'
This works fine for words such as "a", "ab10", but it also includes words like "b.b". I understand that * at the end of expression is problem. If I replace * with + (one or more) it skips the words which contain one letter only, so it doesn't help.
EDIT:
I should be more precise. I want to find lines with any number of possible words as described above. Here is an example:
int = 5;
cout << "hello";
//some comments
In that case it should print all of the lines above as they all include at least one word which fits the described conditions, and line does not have to began with letter.
Your solution will look roughly like this example. In this case, the regex requires that the "word" be preceded by space or start-of-line and then followed by space or end-of-line. You will need to modify the boundary requirements (the parenthesized stuff) as needed.
'(^| )[a-zA-Z][a-zA-Z0-9]*( |$)'
Assuming the line ends after the word:
'^[a-zA-Z][a-zA-Z0-9]+|^[a-zA-Z]$'
You have to add something to it. It might be that the rest of it can be white spaces or you can just append the end of line.(AFAIR it was $ )
Your problem lies in the ^ and $ anchors that match the start and end of the line respectively. You want the line to match if it does contain a word, getting rid of the anchors does what you want:
egrep '[a-zA-Z][a-zA-Z0-9]+'
Note the + matches words of length 2 and higher, a * in that place would signel chars too.

Resources