Ignoring a character along with word boundary in regex - ruby

I am using gsub in Ruby to make a word within text bold. I am using a word boundary so as to not make letters within other words bold, but am finding that this ignores words that have a quote after them. For example:
text.gsub(/#{word}\b/i, "<b>#{word}</b>")
text = "I said, 'look out below'"
word = below
In this case the word below is not made bold. Is there any way to ignore certain characters along with a word boundary?

All that escaping in the Regexp.new is looking quite ugly. You could greatly simplify that by using a Regexp literal:
word = 'below'
text = "I said, 'look out below'"
reg = /\b#{word}\b/i
text.gsub!(reg, '<b>\0</b>')
Also, you could use the modifier form of gsub! directly, unless that string is aliased in some other place in your code that you are not showing us. Lastly, if you use the single quoted string literal inside your gsub call, you don't need to escape the backslash.

Be very careful with your \b boundaries. Here’s why.

The #{word} syntax doesn't work for regular expressions. Use Regexp.new instead:
word = "below"
text = "I said, 'look out below'"
reg = Regexp.new("\\b#{word}\\b", true)
text = text.gsub(reg, "<b>\\0</b>")
Note that when using sting you need to escape \b to \\b, or it is interpreted as a backspace. If word may contain special regex characters, escape it using Regexp.escape.
Also, by replacing the string to <b>#{word}</b> you may change casing of the string: "BeloW" will be replaced to "below". \0 corrects this by replacing with the found word. In addition, I added \\b at the beginning, you don't want to look for "day" and end up with "sunday".

Related

regexp match group with the exception of a member of the group

So, there are a number of regular expression which matches a particular group like the following:
/./ - Any character except a newline.
/./m - Any character (the m modifier enables multiline mode)
/\w/ - A word character ([a-zA-Z0-9_])
/\s/ - Any whitespace character
And in ruby:
/[[:punct:]]/ - Punctuation character
/[[:space:]]/ - Whitespace character ([:blank:], newline, carriage return, etc.)
/[[:upper:]]/ - Uppercase alphabetical
So, here is my question: how do I get a regexp to match a group like this, but exempt a character out?
Examples:
match all punctuations apart from the question mark
match all whitespace characters apart from the new line
match all words apart from "go"... etc
Thanks.
You can use character class subtraction.
Rexegg:
The syntax […&&[…]] allows you to use a logical AND on several character classes to ensure that a character is present in them all. Intersecting with a negated character, as in […&&[^…]] allows you to subtract that class from the original class.
Consider this code:
s = "./?!"
res = s.scan(/[[:punct:]&&[^!]]/)
puts res
Output is only ., / and ? since ! is excluded.
Restricting with a lookahead (as sawa has written just now) is also possible, but is not required when you have this subtraction supported. When you need to restrict some longer values (more than 1 character) a lookahead is required.
In many cases, a lookahead must be anchored to a word boundary to return correct results. As an example of using a lookahead to restrict punctuation (single character matching generic pattern):
/(?:(?!!)[[:punct:]])+/
This will match 1 or more punctuation symbols but a !.
The puts "./?!".scan(/(?:(?!!)[[:punct:]])+/) code will output ./? (see demo)
Use character class subtraction whenever you need to restrict with single characters, it is more efficient than using lookaheads.
So, the 3rd scenario regex must look like:
/\b(?!go\b)\w+\b/
^^
If you write /(?!\bgo\b)\b\w+\b/, the regex engine will check each position in the input string. If you use a \b at the beginning, only word boundary positions will be checked, and the pattern will yield better performance. Also note that the ^^ \b is very important since it makes the regex engine check for the whole word go. If you remove it, it will only restrict to the words that do not start with go.
Put what you want to exclude inside a negative lookahead in front of the match. For example,
To match all punctuations apart from the question mark,
/(?!\?)[[:punct:]]/
To match all words apart from "go",
/(?!\bgo\b)\b\w+\b/
This is a general approach that is sometimes useful:
a = []
".?!,:;-".scan(/[[:punct:]]/) { |s| a << s unless s == '?' }
a #=> [".", "!", ",", ":", ";", "-"]
The content of the block is limited only by your imagination.

Faster alternative than a regex? Maybe also the regex approach wrong [duplicate]

I am using gsub in Ruby to make a word within text bold. I am using a word boundary so as to not make letters within other words bold, but am finding that this ignores words that have a quote after them. For example:
text.gsub(/#{word}\b/i, "<b>#{word}</b>")
text = "I said, 'look out below'"
word = below
In this case the word below is not made bold. Is there any way to ignore certain characters along with a word boundary?
All that escaping in the Regexp.new is looking quite ugly. You could greatly simplify that by using a Regexp literal:
word = 'below'
text = "I said, 'look out below'"
reg = /\b#{word}\b/i
text.gsub!(reg, '<b>\0</b>')
Also, you could use the modifier form of gsub! directly, unless that string is aliased in some other place in your code that you are not showing us. Lastly, if you use the single quoted string literal inside your gsub call, you don't need to escape the backslash.
Be very careful with your \b boundaries. Here’s why.
The #{word} syntax doesn't work for regular expressions. Use Regexp.new instead:
word = "below"
text = "I said, 'look out below'"
reg = Regexp.new("\\b#{word}\\b", true)
text = text.gsub(reg, "<b>\\0</b>")
Note that when using sting you need to escape \b to \\b, or it is interpreted as a backspace. If word may contain special regex characters, escape it using Regexp.escape.
Also, by replacing the string to <b>#{word}</b> you may change casing of the string: "BeloW" will be replaced to "below". \0 corrects this by replacing with the found word. In addition, I added \\b at the beginning, you don't want to look for "day" and end up with "sunday".

How do I tune this regex to return the matches I want?

So I have a string that looks like this:
#jackie#test.com, #mike#test.com
What I want to do is before any email in this comma separated list, I want to remove the #. The issue I keep running into is that if I try to do a regular \A flag like so /[\A#]+/, it finds all the instances of # in that string...including the middle crucial #.
The same thing happens if I do /[\s#]+/. I can't figure out how to just look at the beginning of each string, where each string is a complete email address.
Edit 1
Note that all I need is the regex, I already have the rest of the stuff I need to do what I want. Specifically, I am achieving everything else like this:
str.gsub(/#/, '').split(',').map(&:strip)
Where str is my string.
All I am looking for is the regex portion for my gsub.
You may use the below negative lookbehind based regex.
str.gsub(/(?<!\S)#/, '').split(',').map(&:strip)
(?<!\S) Negative lookbehind asserts that the character or substring we are going to match would be preceeded by any but not of a non-space character. So this matches the # which exists at the start or the # which exists next to a space character.
Difference between my answer and hwnd's str.gsub(/\B#/, '') is, mine won't match the # which exists in :# but hwnd's answer does. \B matches between two word characters or two non-word characters.
Here is one solution
str = "#jackie#test.com, #mike#test.com"
p str.split(/,[ ]+/).map{ |i| i.gsub(/^#/, '')}
Output
["jackie#test.com", "mike#test.com"]

CamelCase regexp not accounting for spaces

I created a regexp to match the following scenerios: SomethingCool, HelloWorld, MyNameIsDonato, etc. However, it does not account for spaces:
> 'Something Cooler' =~ /([A-Z][a-z0-9]+)+/
=> 0
That passes and it should not pass. A space is not an alphanumeric character. So why does this pass and how can I fix it?
You need to anchor the regex to the beginning and end of the string, or it will just match one of the words:
^([A-Z][a-z0-9]+)+$
^ and $ anchor the beginnings and ends of lines, respectively. To anchor to the beginning and end of the string, use \A and \Z.
It's worth noting that this is useless if you're trying to find camelcase names within a larger string. For that, use your original regex.

Search and Replace ENTIRE WORDS ONLY in VB

A word processor program features a search and replace function. However, partial words (character combinations found within words) are also replaced. To fix this, I plan to remove extra spaces and use the split function to change the string into an array of words by using " " as a delimiter.
However, once I search through the array, replace the appropriate words, and put the array back into a string separated by spaces, the original formatting of the user will be lost. For example, if the original string was "This is a sentence." and the user wanted "a" to be replaced with "the", the output will be "This is the sentence.", with no additional spaces.
So, my question is whether there is any way to search and replace entire words only while still preserving the formatting (extra spaces) of the user in Visual Basic.
What about using a regex?
In a regex the code \b is a word boundary so for example the regex \ba\b will match a only when a is a whole word.
So for example your code would be:
Dim strPattern As String: strPattern = "\ba\b"
Dim regex As New RegExp
regex.Global = True
regex.Pattern = strPattern
result = regex.Replace("This is a sentence.", "the")
If you use the Split function without removing your extra spaces first your array will have empty items in it so you would not lose the extra spaces and can reconstruct your document with the original formatting in tact.
Why is your formatting lost? If you split the text by space, just attach a space after each element when composing it back from an array. But you will also have to take into account words that end not with a space but punctuation.
in "This is a simple sentence, eh?", "eh" will be stored as "eh?" because u split by space. So you will have to program a complex punctuation-friendly formula or simply use regex. Be prepared - regex is... tricky.

Resources