Faster alternative than a regex? Maybe also the regex approach wrong [duplicate] - ruby

I am using gsub in Ruby to make a word within text bold. I am using a word boundary so as to not make letters within other words bold, but am finding that this ignores words that have a quote after them. For example:
text.gsub(/#{word}\b/i, "<b>#{word}</b>")
text = "I said, 'look out below'"
word = below
In this case the word below is not made bold. Is there any way to ignore certain characters along with a word boundary?

All that escaping in the Regexp.new is looking quite ugly. You could greatly simplify that by using a Regexp literal:
word = 'below'
text = "I said, 'look out below'"
reg = /\b#{word}\b/i
text.gsub!(reg, '<b>\0</b>')
Also, you could use the modifier form of gsub! directly, unless that string is aliased in some other place in your code that you are not showing us. Lastly, if you use the single quoted string literal inside your gsub call, you don't need to escape the backslash.

Be very careful with your \b boundaries. Here’s why.

The #{word} syntax doesn't work for regular expressions. Use Regexp.new instead:
word = "below"
text = "I said, 'look out below'"
reg = Regexp.new("\\b#{word}\\b", true)
text = text.gsub(reg, "<b>\\0</b>")
Note that when using sting you need to escape \b to \\b, or it is interpreted as a backspace. If word may contain special regex characters, escape it using Regexp.escape.
Also, by replacing the string to <b>#{word}</b> you may change casing of the string: "BeloW" will be replaced to "below". \0 corrects this by replacing with the found word. In addition, I added \\b at the beginning, you don't want to look for "day" and end up with "sunday".

Related

Regex that considers apostrophes as word characters? [duplicate]

So I want to split a string in java on any non-alphanumeric characters.
Currently I have been doing it like this
words= Str.split("\\W+");
However I want to keep apostrophes("'") in there. Is there any regular expression to preserve apostrophes but kick the rest of the junk? Thanks.
words = Str.split("[^\\w']+");
Just add it to the character class. \W is equivalent to [^\w], which you can then add ' to.
Do note, however, that \w also actually includes underscores. If you want to split on underscores as well, you should be using [^a-zA-Z0-9'] instead.
For basic English characters, use
words = Str.split("[^a-zA-Z0-9']+");
If you want to include English words with special characters (such as fiancé) or for languages that use non-English characters, go with
words = Str.split("[^\\p{L}0-9']+");

Search and Replace ENTIRE WORDS ONLY in VB

A word processor program features a search and replace function. However, partial words (character combinations found within words) are also replaced. To fix this, I plan to remove extra spaces and use the split function to change the string into an array of words by using " " as a delimiter.
However, once I search through the array, replace the appropriate words, and put the array back into a string separated by spaces, the original formatting of the user will be lost. For example, if the original string was "This is a sentence." and the user wanted "a" to be replaced with "the", the output will be "This is the sentence.", with no additional spaces.
So, my question is whether there is any way to search and replace entire words only while still preserving the formatting (extra spaces) of the user in Visual Basic.
What about using a regex?
In a regex the code \b is a word boundary so for example the regex \ba\b will match a only when a is a whole word.
So for example your code would be:
Dim strPattern As String: strPattern = "\ba\b"
Dim regex As New RegExp
regex.Global = True
regex.Pattern = strPattern
result = regex.Replace("This is a sentence.", "the")
If you use the Split function without removing your extra spaces first your array will have empty items in it so you would not lose the extra spaces and can reconstruct your document with the original formatting in tact.
Why is your formatting lost? If you split the text by space, just attach a space after each element when composing it back from an array. But you will also have to take into account words that end not with a space but punctuation.
in "This is a simple sentence, eh?", "eh" will be stored as "eh?" because u split by space. So you will have to program a complex punctuation-friendly formula or simply use regex. Be prepared - regex is... tricky.

How can I remove the string "\n" from within a Ruby string?

I have this string:
"some text\nandsomemore"
I need to remove the "\n" from it. I've tried
"some text\nandsomemore".gsub('\n','')
but it doesn't work. How do I do it? Thanks for reading.
You need to use "\n" not '\n' in your gsub. The different quote marks behave differently.
Double quotes " allow character expansion and expression interpolation ie. they let you use escaped control chars like \n to represent their true value, in this case, newline, and allow the use of #{expression} so you can weave variables and, well, pretty much any ruby expression you like into the text.
While on the other hand, single quotes ' treat the string literally, so there's no expansion, replacement, interpolation or what have you.
In this particular case, it's better to use either the .delete or .tr String method to delete the newlines.
See here for more info
If you want or don't mind having all the leading and trailing whitespace from your string removed you can use the strip method.
" hello ".strip #=> "hello"
"\tgoodbye\r\n".strip #=> "goodbye"
as mentioned here.
edit The original title for this question was different. My answer is for the original question.
When you want to remove a string, rather than replace it you can use String#delete (or its mutator equivalent String#delete!), e.g.:
x = "foo\nfoo"
x.delete!("\n")
x now equals "foofoo"
In this specific case String#delete is more readable than gsub since you are not actually replacing the string with anything.
You don't need a regex for this. Use tr:
"some text\nandsomemore".tr("\n","")
use chomp or strip functions from Ruby:
"abcd\n".chomp => "abcd"
"abcd\n".strip => "abcd"

Ignoring a character along with word boundary in regex

I am using gsub in Ruby to make a word within text bold. I am using a word boundary so as to not make letters within other words bold, but am finding that this ignores words that have a quote after them. For example:
text.gsub(/#{word}\b/i, "<b>#{word}</b>")
text = "I said, 'look out below'"
word = below
In this case the word below is not made bold. Is there any way to ignore certain characters along with a word boundary?
All that escaping in the Regexp.new is looking quite ugly. You could greatly simplify that by using a Regexp literal:
word = 'below'
text = "I said, 'look out below'"
reg = /\b#{word}\b/i
text.gsub!(reg, '<b>\0</b>')
Also, you could use the modifier form of gsub! directly, unless that string is aliased in some other place in your code that you are not showing us. Lastly, if you use the single quoted string literal inside your gsub call, you don't need to escape the backslash.
Be very careful with your \b boundaries. Here’s why.
The #{word} syntax doesn't work for regular expressions. Use Regexp.new instead:
word = "below"
text = "I said, 'look out below'"
reg = Regexp.new("\\b#{word}\\b", true)
text = text.gsub(reg, "<b>\\0</b>")
Note that when using sting you need to escape \b to \\b, or it is interpreted as a backspace. If word may contain special regex characters, escape it using Regexp.escape.
Also, by replacing the string to <b>#{word}</b> you may change casing of the string: "BeloW" will be replaced to "below". \0 corrects this by replacing with the found word. In addition, I added \\b at the beginning, you don't want to look for "day" and end up with "sunday".

Strip words beginning with a specific letter from a sentence using regex

I'm not sure how to use regular expressions in a function so that I could grab all the words in a sentence starting with a particular letter. I know that I can do:
word =~ /^#{letter}/
to check if the word starts with the letter, but how do I go from word to word. Do I need to convert the string to an array and then iterate through each word or is there a faster way using regex? I'm using ruby so that would look like:
matching_words = Array.new
sentance.split(" ").each do |word|
matching_words.push(word) if word =~ /^#{letter}/
end
Scan may be a good tool for this:
#!/usr/bin/ruby1.8
s = "I think Paris in the spring is a beautiful place"
p s.scan(/\b[it][[:alpha:]]*/i)
# => ["I", "think", "in", "the", "is"]
\b means 'word boundary."
[:alpha:] means upper or lowercase alpha (a-z).
You can use \b. It matches word boundaries--the invisible spot just before and after a word. (You can't see them, but oh they're there!) Here's the regex:
/\b(a\w*)\b/
The \w matches a word character, like letters and digits and stuff like that.
You can see me testing it here: http://rubular.com/regexes/13347
Similar to Anon.'s answer:
/\b(a\w*)/g
and then see all the results with (usually) $n, where n is the n-th hit. Many libraries will return /g results as arrays on the $n-th set of parenthesis, so in this case $1 would return an array of all the matching words. You'll want to double-check with whatever library you're using to figure out how it returns matches like this, there's a lot of variation on global search returns, sadly.
As to the \w vs [a-zA-Z], you can sometimes get faster execution by using the built-in definitions of things like that, as it can easily have an optimized path for the preset character classes.
The /g at the end makes it a "global" search, so it'll find more than one. It's still restricted by line in some languages / libraries, though, so if you wish to check an entire file you'll sometimes need /gm, to make it multi-line
If you want to remove results, like your title (but not question) suggests, try:
/\ba\w*//g
which does a search-and-replace in most languages (/<search>/<replacement>/). Sometimes you need a "s" at the front. Depends on the language / library. In Ruby's case, use:
string.gsub(/(\b)a\w*(\b)/, "\\1\\2")
to retain the non-word characters, and optionally put any replacement text between \1 and \2. gsub for global, sub for the first result.
/\ba[a-z]*\b/i
will match any word starting with 'a'.
The \b indicates a word boundary - we want to only match starting from the beginning of a word, after all.
Then there's the character we want our word to start with.
Then we have as many as possible letter characters, followed by another word boundary.
To match all words starting with t, use:
\bt\w+
That will match test but not footest; \b means "word boundary".
Personally i think that regex is overkill for this application, simply running a select is more than capable of solving this particular problem.
"this is a test".split(' ').select{ |word| word[0,1] == 't' }
result => ["this", "test"]
or if you are determined to use regex then go with grep
"this is a test".split(' ').grep(/^t/)
result => ["this", "test"]
Hope this helps.

Resources