Skipping Characters in Regex - ruby

I have the following data
Animals = Dog Cat Turtle \
Mouse Parrot \
Snake
I would like the regex to construct a match of just the animals with none of the backslashes: Dog Cat Turtle Mouse Parrot Snake
I've got a regex, but need some help finishing it off.
/ANIMALS\s*=\s*([^\\\n]*)/

Since you specified a language, I need to ask you this: Why are you relying on the regex for everything? Don't make the problem harder than it has to be by forcing the regex to do everything.
Try this approach instead...
Use gsub! to get rid of the backslashes.
split the string on any whitespace.
Shift out the first two tokens ("Animals", "=").
Join the array with a single space, or whatever other delimiter.
So the only regex you might need is one for the whitespace delimiter split. I don't know Ruby well enough to say exactly how you would do that, though.

How 'bout the regex \b(?!Animals\b)\w+\b which matches all words that aren't Animals? Use the scan method to collect all such matches, e.g.
matchArray = sourceString.scan(/\b(?!Animals\b)\w+\b/)

Make sure you are matching with ignore-case, because ANIMALS will not match Animals without it.

Related

Non-greedy regex for watir-webdriver

Say I want to get rid of the first occurrence of 'my' in the following string:
my Marsha Tammy
My current regex setup is greedy I think:
.sub(/my/,"")
Which gets rid of all instances. Data will look like this:
my Bill Port
my Samy Gonzalez
my Ulm Germany
Only want first occurrence of 'my' gone.
Try .sub(/^my/,"")
This is a working example:
https://regex101.com/r/tD2jI2/1
EDIT: Or even better - .sub(/^my /,"") to get rid of the trailing space
According to the Ruby docs for String#sub:
Returns a copy of str with the first occurrence of pattern replaced by the second argument.
So, you should be in the clear in terms of it only replacing the first instance of your regexp. If you wanted it to replace all instances then you need to use String#gsub.

Regex negative lookbehinds with a wildcard

I'm trying to match some text if it does not have another block of text in its vicinity. For example, I would like to match "bar" if "foo" does not precede it. I can match "bar" if "foo" does not immediately precede it using negative look behind in this regex:
/(?<!foo)bar/
but I also like to not match "foo 12345 bar". I tried:
/(?<!foo.{1,10})bar/
but using a wildcard + a range appears to be an invalid regex in Ruby. Am I thinking about the problem wrong?
You are thinking about it the right way. But unfortunately lookbehinds usually have be of fixed-length. The only major exception to that is .NET's regex engine, which allows repetition quantifiers inside lookbehinds. But since you only need a negative lookbehind and not a lookahead, too. There is a hack for you. Reverse the string, then try to match:
/rab(?!.{0,10}oof)/
Then reverse the result of the match or subtract the matching position from the string's length, if that's what you are after.
Now from the regex you have given, I suppose that this was only a simplified version of what you actually need. Of course, if bar is a complex pattern itself, some more thought needs to go into how to reverse it correctly.
Note that if your pattern required both variable-length lookbehinds and lookaheads, you would have a harder time solving this. Also, in your case, it would be possible to deconstruct your lookbehind into multiple variable length ones (because you use neither + nor *):
/(?<!foo)(?<!foo.)(?<!foo.{2})(?<!foo.{3})(?<!foo.{4})(?<!foo.{5})(?<!foo.{6})(?<!foo.{7})(?<!foo.{8})(?<!foo.{9})(?<!foo.{10})bar/
But that's not all that nice, is it?
As m.buettner already mentions, lookbehind in Ruby regex has to be of fixed length, and is described so in the document. So, you cannot put a quantifier within a lookbehind.
You don't need to check all in one step. Try doing multiple steps of regex matches to get what you want. Assuming that existence of foo in front of a single instance of bar breaks the condition regardless of whether there is another bar, then
string.match(/bar/) and !string.match(/foo.*bar/)
will give you what you want for the example.
If you rather want the match to succeed with bar foo bar, then you can do this
string.scan(/foo|bar/).first == "bar"

How do I split names with a regular expression?

I am new to ruby and regular expressions and trying to figure out how to attack seperating the attached string of baseball players into first/last name combinations.
This is a sample string:
"JohnnyCuetoJ.J.PutzBrianMcCann"
This is the desired output:
Johnny Cueto
J.J. Putz
Brian McCann
I have figured out how to separate by capital letters which gets me close, but the outlier names like J.J. and McCann mess that pattern up. Anyone have ideas on the best way to approach this?
If you don't have to do it in one single gsub than it gets a bit easier.
string = "JohnnyCuetoJ.J.PutzBrianMcCann"
string.gsub!(/([A-Z][^A-Z]+)/, '\1 ') # separate by capital letters
string.gsub!(/(\.) ([A-Z]\.)/, '\1\2') # paste together "J. J." -> "J.J."
string.gsub!(/Mc /, 'Mc') # Remove the space in "Mc "
string.strip # Remove the extra space after "Cann "
...and of course you can put this on a single line by chaining the gsub calls, but that will basically kill the readability of the code (but on the other hand, how readable is a block of regexen anyway?)

How to conflate consecutive gsubs in ruby

I have the following
address.gsub(/^\d*/, "").gsub(/\d*-?\d*$/, "").gsub(/\# ?\d*/,"")
Can this be done in one gsub? I would like to pass a list of patterns rather then just one pattern - they are all being replaced by the same thing.
You could combine them with an alternation operator (|):
address = '6 66-666 #99 11-23'
address.gsub(/^\d*|\d*-?\d*$|\# ?\d*/, "")
# " 66-666 "
address = 'pancakes 6 66-666 # pancakes #99 11-23'
address.gsub(/^\d*|\d*-?\d*$|\# ?\d*/,"")
# "pancakes 6 66-666 pancakes "
You might want to add little more whitespace cleanup. And you might want to switch to one of:
/\A\d*|\d*-?\d*\z|\# ?\d*/
/\A\d*|\d*-?\d*\Z|\# ?\d*/
depending on what your data really looks like and how you need to handle newlines.
Combining the regexes is a good idea--and relatively simple--but I'd like to recommend some additional changes. To wit:
address.gsub(/^\d+|\d+(?:-\d+)?$|\# *\d+/, "")
Of your original regexes, ^\d* and \d*-?\d*$ will always match, because they don't have to consume any characters. So you're guaranteed to perform two replacements on every line, even if that's just replacing empty strings with empty strings. Of my regexes, ^\d+ doesn't bother to match unless there's at least one digit at the beginning of the line, and \d+(?:-\d+)?$ matches what looks like an integer-or-range expression at the end of the line.
Your third regex, \# ?\d*, will match any # character, and if the # is followed by a space and some digits, it'll take those as well. Judging by your other regexes and my experience with other questions, I suspect you meant to match a # only if it's followed by one or more digits, with optional spaces intervening. That's what my third regex does.
If any of my guesses are wrong, please describe what you were trying to do, and I'll do my best to come up with the right regex. But I really don't think those first two regexes, at least, are what you want.
EDIT (in answer to the comment): When working with regexes, you should always be aware of the distinction between a regex the matches nothing and a regex that doesn't match. You say you're applying the regexes to street addresses. If an address doesn't happen to start with a house number, ^\d* will match nothing--that is, it will report a successful match, said match consisting of the empty string preceding the first character in the address.
That doesn't matter to you, you're just replacing it with another empty string anyway. But why bother doing the replacement at all? If you change the regex to ^\d+, it will report a failed match and no replacement will be performed. The result is the same either way, but the "matches noting" scenario (^\d*) results in a lot of extra work that the "doesn't match" scenario avoids. In a high-throughput situation, that could be a life-saver.
The other two regexes bring additional complications: \d*-?\d*$ could match a hyphen at the end of the string (e.g. "123-", or even "-"); and \# ?\d* could match a hash symbol anywhere in string, not just as part of an apartment/office number. You know your data, so you probably know neither of those problems will ever arise; I'm just making sure you're aware of them. My regex \d+(?:-\d+)?$ deals with the trailing-hyphen issue, and \# *\d+ at least makes sure there are digits after the hash symbol.
I think that if you combine them together in a single gsub() regex, as an alternation,
it changes the context of the starting search position.
Example, each of these lines start at the beginning of the result of the previous
regex substitution.
s/^\d*//g
s/\d*-?\d*$//g
s/\# ?\d*//g
and this
s/^\d*|\d*-?\d*$|\# ?\d*//g
resumes search/replace where the last match left off and could potentially produce a different overall output, especially since a lot of the subexpressions search for similar
if not the same characters, distinguished only by line anchors.
I think your regex's are unique enough in this case, and of course changing the order
changes the result.

Strip words beginning with a specific letter from a sentence using regex

I'm not sure how to use regular expressions in a function so that I could grab all the words in a sentence starting with a particular letter. I know that I can do:
word =~ /^#{letter}/
to check if the word starts with the letter, but how do I go from word to word. Do I need to convert the string to an array and then iterate through each word or is there a faster way using regex? I'm using ruby so that would look like:
matching_words = Array.new
sentance.split(" ").each do |word|
matching_words.push(word) if word =~ /^#{letter}/
end
Scan may be a good tool for this:
#!/usr/bin/ruby1.8
s = "I think Paris in the spring is a beautiful place"
p s.scan(/\b[it][[:alpha:]]*/i)
# => ["I", "think", "in", "the", "is"]
\b means 'word boundary."
[:alpha:] means upper or lowercase alpha (a-z).
You can use \b. It matches word boundaries--the invisible spot just before and after a word. (You can't see them, but oh they're there!) Here's the regex:
/\b(a\w*)\b/
The \w matches a word character, like letters and digits and stuff like that.
You can see me testing it here: http://rubular.com/regexes/13347
Similar to Anon.'s answer:
/\b(a\w*)/g
and then see all the results with (usually) $n, where n is the n-th hit. Many libraries will return /g results as arrays on the $n-th set of parenthesis, so in this case $1 would return an array of all the matching words. You'll want to double-check with whatever library you're using to figure out how it returns matches like this, there's a lot of variation on global search returns, sadly.
As to the \w vs [a-zA-Z], you can sometimes get faster execution by using the built-in definitions of things like that, as it can easily have an optimized path for the preset character classes.
The /g at the end makes it a "global" search, so it'll find more than one. It's still restricted by line in some languages / libraries, though, so if you wish to check an entire file you'll sometimes need /gm, to make it multi-line
If you want to remove results, like your title (but not question) suggests, try:
/\ba\w*//g
which does a search-and-replace in most languages (/<search>/<replacement>/). Sometimes you need a "s" at the front. Depends on the language / library. In Ruby's case, use:
string.gsub(/(\b)a\w*(\b)/, "\\1\\2")
to retain the non-word characters, and optionally put any replacement text between \1 and \2. gsub for global, sub for the first result.
/\ba[a-z]*\b/i
will match any word starting with 'a'.
The \b indicates a word boundary - we want to only match starting from the beginning of a word, after all.
Then there's the character we want our word to start with.
Then we have as many as possible letter characters, followed by another word boundary.
To match all words starting with t, use:
\bt\w+
That will match test but not footest; \b means "word boundary".
Personally i think that regex is overkill for this application, simply running a select is more than capable of solving this particular problem.
"this is a test".split(' ').select{ |word| word[0,1] == 't' }
result => ["this", "test"]
or if you are determined to use regex then go with grep
"this is a test".split(' ').grep(/^t/)
result => ["this", "test"]
Hope this helps.

Resources