How to build a regexp for matching word without `"."` - ruby

Suppose I have a string like this:
w="abc#name,xy.abc=abc"
I want to replace the first and the third "abc" with another string. I used this code:
puts w.gsub(/\babc\b/,"replacer");
# => replacer#name,xy.replacer=replacer
where the second "abc" is replaced, which was not what I expected. Then I changed to the following pattern:
puts w.gsub(/[^\.]\babc\b/,"replacer");
# => abc#name,xy.abcreplacer
where the first "abc" is not replaced. I have no idea now how to fix it.

You can try
/\b(?<!\.)abc\b/
but it's a rather brute-force solution with negative look-behind.

Similar to Tass but using a negative look-ahead
/\babc\b(?!=)/

I'd simplify the regex and rely on gsub's ability to take a block:
target = 'abc'
replacement = 'foo'
'w="abc#name,xy.abc=abc"'.gsub(/#{ target }#|=#{ target }/) { |s| s.sub(target, replacement) }
=> "w=\"foo#name,xy.abc=foo\""
The patterns you want are simple:
<target>#
=<target>
Find those, then do a simple string substitution.
Doing it this way isn't encapsulating all the logic into the regex, it's breaking it into two separate steps, which simplifies the logic, speeds up development time, and results in code that's easier to maintain.
Regex is a powerful tool, but sometimes you don't need a complicated pneumatic hammer, you need a small, simple, claw hammer and a screw driver.

Related

scripting logic : matching patterns

I am trying to figure out regex/scripting logic to parse something out like this;
RAW DATA
{CLNDSDB=MedGen:OMIM:SNOMED_CT;CLNDSDBID=C0432243:271640:254100000}
Here, the value is;
MedGen = C0432243
OMIM = 271640
SNOMED_CT = 254100000
Result: 271640
I am envisaging a convoluted if-else loop to get the result. Just wanted to know if there any simple way of get the same result. Much appreciate your answers.
Perhaps something like this: (assuming there is always three fields)
(?<=[=:])(?<key>[^:;]+)(?=[:=;](?:[^:;=]+[=;:]){3}(?<val>[^:]+))
The idea is to capture the field values inside a lookahead assertion so as not to be interfering with overlapping substrings.
However, there is probably a cleaner way that uses successive split.
It's difficult to tell from the question whether the input string is two lines or one:
str = 'RAW DATA
{CLNDSDB=MedGen:OMIM:SNOMED_CT;CLNDSDBID=C0432243:271640:254100000}
'
or
str = '{CLNDSDB=MedGen:OMIM:SNOMED_CT;CLNDSDBID=C0432243:271640:254100000}'
but, in either case I'd use a simple pattern:
str = '{CLNDSDB=MedGen:OMIM:SNOMED_CT;CLNDSDBID=C0432243:271640:254100000}'
medgen, omim, snomed_ct = str.match(/(\w+):(\w+):(\w+)}/).captures
medgen # => "C0432243"
omim # => "271640"
snomed_ct # => "254100000"
Here's the pattern at Rubular.
I am envisaging a convoluted if-else loop to get the result.
Well, don't do that. Most programming solutions are surprisingly simple, so start simple. As you learn, your programming toolbox will grow as you become familiar with new ways of doing things, and you'll find certain tools are more useful for certain tasks. Still, always start from "simple", get the basics working, then carefully add to handle the corner cases.
In this case, when using a regular expression, it's important to look for landmarks in the string that you can use to locate your target text. In this case the trailing '}' is usable, so I wrote three simple captures to find \w strings separated by :.

Regex negative lookbehinds with a wildcard

I'm trying to match some text if it does not have another block of text in its vicinity. For example, I would like to match "bar" if "foo" does not precede it. I can match "bar" if "foo" does not immediately precede it using negative look behind in this regex:
/(?<!foo)bar/
but I also like to not match "foo 12345 bar". I tried:
/(?<!foo.{1,10})bar/
but using a wildcard + a range appears to be an invalid regex in Ruby. Am I thinking about the problem wrong?
You are thinking about it the right way. But unfortunately lookbehinds usually have be of fixed-length. The only major exception to that is .NET's regex engine, which allows repetition quantifiers inside lookbehinds. But since you only need a negative lookbehind and not a lookahead, too. There is a hack for you. Reverse the string, then try to match:
/rab(?!.{0,10}oof)/
Then reverse the result of the match or subtract the matching position from the string's length, if that's what you are after.
Now from the regex you have given, I suppose that this was only a simplified version of what you actually need. Of course, if bar is a complex pattern itself, some more thought needs to go into how to reverse it correctly.
Note that if your pattern required both variable-length lookbehinds and lookaheads, you would have a harder time solving this. Also, in your case, it would be possible to deconstruct your lookbehind into multiple variable length ones (because you use neither + nor *):
/(?<!foo)(?<!foo.)(?<!foo.{2})(?<!foo.{3})(?<!foo.{4})(?<!foo.{5})(?<!foo.{6})(?<!foo.{7})(?<!foo.{8})(?<!foo.{9})(?<!foo.{10})bar/
But that's not all that nice, is it?
As m.buettner already mentions, lookbehind in Ruby regex has to be of fixed length, and is described so in the document. So, you cannot put a quantifier within a lookbehind.
You don't need to check all in one step. Try doing multiple steps of regex matches to get what you want. Assuming that existence of foo in front of a single instance of bar breaks the condition regardless of whether there is another bar, then
string.match(/bar/) and !string.match(/foo.*bar/)
will give you what you want for the example.
If you rather want the match to succeed with bar foo bar, then you can do this
string.scan(/foo|bar/).first == "bar"

Is Regex faster than array comparison in this case?

Say I have an incoming string that I want scan to see if it contains any of the words I have chosen to be "bad." :)
Is it faster to split the string into an array, as well as keep the bad words in an array, and then iterate through each bad word as well as each incoming word and see if there's a match, kind of like:
badwords.each do |badword|
incoming.each do |word|
trigger = true if badword == word
end
end
OR is it faster to do this:
incoming.each do |word|
trigger = true if badwords.include? word
end
OR is it faster to leave the string as it is and run a .match() with a regex that looks something like:
/\bbadword1\b|\bbadword2\b|\bbadword3\b/
Or is the performance difference almost completely negligible? Been wondering this for a while.
You're giving the regex an advantage by not stopping your loop when it finds a match. Try:
incoming.find{|word| badwords.include? word}
My money is still on the regex though which should be simplified to:
/\b(badword1|badword2|badword3)\b/
or to make it a fair fight:
/\a(badword1|badword2|badword3)\z/
Once it is compiled, the Regex is the fastest in real live (i.e. really long incoming string, many similar bad words, etc.) since it can run on incoming in situ and will handle overlapping parts of your "bad words" really well.
The answer probably depends on the number of bad words to check: if there is only one bad word it probably doesn't make a huge difference, if there are 50 then checking an array would probably get slow. On the other hand, with tens or hundreds of thousands of words the regexp probably won't be too fast either
If you need to handle large numbers of bad words, you might want to consider splitting into individual words and then using a bloomfilter to test whether the word is likely to be bad or not.
This does not excatly answer your question but this will definitely help solve it.
Take some examples what your are tring to acheive and put them to bench marks.
you can find how to do benchmarking in ruby here
Just put the varoius forms between report block and get the benchmarks and decide yourself what suits you the best.
http://ruby.about.com/od/tasks/f/benchmark.htm
http://ruby-doc.org/stdlib-1.9.3/libdoc/benchmark/rdoc/Benchmark.html
For better solutions use the real data to test.
Benchmarks are always better than discussions :)
If you want to scan a string for occurrences of words, use scan to find them.
Use Regexp.union to build a pattern that will find the strings in your black-list. You will want to wrap the result with \b to force matching word-boundaries, and use a case-insensitive search.
To give you an idea of how Regexp.union can help:
words = %w[foo bar]
Regexp.union(words)
=> /foo|bar/
'Daniel Foo killed him a bar'.scan(/\b#{Regexp.union(words)}\b/i)
=> ["foo", "bar"]
You could also build the pattern using Regexp.new or /.../ if you want a bit more control:
Regexp.new('\b(?:' + words.join('|') + ')\b', Regexp::IGNORECASE)
=> /\b(?:foo|bar)\b/i
/\b(?:#{words.join('|')})\b/i
=> /\b(?:foo|bar)\b/i
'Daniel Foo killed him a bar'.scan(/\b(?:#{words.join('|')})\b/i)
=> ["Foo", "bar"]
As a word of advice, black-listing words you find offensive is easily tricked by a user, and often gives results that are wrong because many "offensive" words are only offensive in a certain context. A user can deliberately misspell them or use "l33t" speak and have an almost inexhaustible supply of alternate spellings that will make you constantly update your list. It's a source of enjoyment to some people to fool a system.
I was once given a similar task and wrote a translator to supply alternate spellings for "offensive" words. I started with a list of words and terms I'd gleaned from the Internet and started my code running. After several million alternates were added to the database I pulled the plug and showed management it was a fools-errand because it was trivial to fool it.

Regex can this be achieved

I'm too ambitious or is there a way do this
to add a string if not present ?
and
remove a the same string if present?
Do all of this using Regex and avoid the if else statement
Here an example
I have string
"admin,artist,location_manager,event_manager"
so can the substring location_manager be added or removed with regards to above conditions
basically I'm looking to avoid the if else statement and do all of this plainly in regex
"admin,artist,location_manager,event_manager".test(/some_regex/)
The some_regex will remove location_manager from the string if present else it will add it
Am I over over ambitions
You will need to use some sort of logic.
str += ',location_manager' unless str.gsub!(/location_manager,/,'')
I'm assuming that if it's not present you append it to the end of the string
Regex will not actually add or remove anything in any language that I am aware of. It is simply used to match. You must use some other language construct (a regex based replacement function for example) to achieve this functionality. It would probably help to mention your specific language so as to get help from those users.
Here's one kinda off-the-wall solution. It doesn't use regexes, but it also doesn't use any if/else statements either. It's more academic than production-worthy.
Assumptions: Your string is a comma-separated list of titles, and that these are a unique set (no duplicates), and that order doesn't matter:
titles = Set.new(str.split(','))
#=> #<Set: {"admin", "artist", "location_manager", "event_manager"}>
titles_to_toggle = ["location_manager"]
#=> ["location_manager"]
titles ^= titles_to_toggle
#=> #<Set: {"admin", "artist", "event_manager"}>
titles ^= titles_to_toggle
#=> #<Set: {"location_manager", "admin", "artist", "event_manager"}>
titles.to_a.join(",")
#=> "location_manager,admin,artist,event_manager"
All this assumes that you're using a string as a kind of set. If so, you should probably just use a set. If not, and you actually need string-manipulation functions to operate on it, there's probably no way around except for using if-else, or a variant, such as the ternary operator, or unless, or Bergi's answer
Also worth noting regarding regex as a solution: Make sure you consider the edge cases. If 'location_manager' is in the middle of the string, will you remove the extraneous comma? Will you handle removing commas correctly if it's at the beginning or the end of the string? Will you correctly add commas when it's added? For these reasons treating a set as a set or array instead of a string makes more sense.
No. Regex can only match/test whether "a string" is present (or not). Then, the function you've used can do something based on that result, for example replace can remove a match.
Yet, you want to do two actions (each can be done with regex), remove if present and add if not. You can't execute them sequentially, because they overlap - you need to execute either the one or the other. This is where if-else structures (or ternary operators) come into play, and they are required if there is no library/native function that contains them to do exactly this job. I doubt there is one in Ruby.
If you want to avoid the if-else-statement (for one-liners or expressions), you can use the ternary operator. Or, you can use a labda expression returning the correct value:
# kind of pseudo code
string.replace(/location,?|$/, function($0) return $0 ? "" : ",location" )
This matches the string "location" (with optional comma) or the string end, and replaces that with nothing if a match was found or the string ",location" otherwise. I'm sure you can adapt this to Ruby.
to remove something matching a pattern is really easy:
(admin,?|artist,?|location_manager,?|event_manager,?)
then choose the string to replace the match -in your case an empty string- and pass everything to the replace method.
The other operation you suggested was more difficult to achieve with regex only. Maybe someone knows a better answer

Using regex to replace all spaces NOT in quotes in Ruby

I'm trying to write a regex to replace all spaces that are not included in quotes so something like this:
a = 4, b = 2, c = "space here"
would return this:
a=4,b=2,c="space here"
I spent some time searching this site and I found a similar q/a ( Split a string by spaces -- preserving quoted substrings -- in Python ) that would replace all the spaces inside quotes with a token that could be re-substituted in after wiping all the other spaces...but I was hoping there was a cleaner way of doing it.
It's worth noting that any regular expression solution will fail in cases like the following:
a = 4, b = 2, c = "space" here"
While it is true that you could construct a regexp to handle the three-quote case specifically, you cannot solve the problem in the general sense. This is a mathematically provable limitation of simple DFAs, of which regexps are a direct representation. To perform any serious brace/quote matching, you will need the more powerful pushdown automaton, usually in the form of a text parser library (ANTLR, Bison, Parsec).
With that said, it sounds like regular expressions should be sufficient for your needs. Just be aware of the limitations.
This seems to work:
result = string.gsub(/( |(".*?"))/, "\\2")
I consider this very clean:
mystring.scan(/((".*?")|([^ ]))/).map { |x| x[0] }.join
I doubt gsub could do any better (assuming you want a pure regex approach).
try this one, string in single/double quoter is also matched (so you need to filter them, if you only need space):
/( |("([^"\\]|\\.)*")|('([^'\\]|\\.)*'))/

Resources