Regex can this be achieved - ruby

I'm too ambitious or is there a way do this
to add a string if not present ?
and
remove a the same string if present?
Do all of this using Regex and avoid the if else statement
Here an example
I have string
"admin,artist,location_manager,event_manager"
so can the substring location_manager be added or removed with regards to above conditions
basically I'm looking to avoid the if else statement and do all of this plainly in regex
"admin,artist,location_manager,event_manager".test(/some_regex/)
The some_regex will remove location_manager from the string if present else it will add it
Am I over over ambitions

You will need to use some sort of logic.
str += ',location_manager' unless str.gsub!(/location_manager,/,'')
I'm assuming that if it's not present you append it to the end of the string

Regex will not actually add or remove anything in any language that I am aware of. It is simply used to match. You must use some other language construct (a regex based replacement function for example) to achieve this functionality. It would probably help to mention your specific language so as to get help from those users.

Here's one kinda off-the-wall solution. It doesn't use regexes, but it also doesn't use any if/else statements either. It's more academic than production-worthy.
Assumptions: Your string is a comma-separated list of titles, and that these are a unique set (no duplicates), and that order doesn't matter:
titles = Set.new(str.split(','))
#=> #<Set: {"admin", "artist", "location_manager", "event_manager"}>
titles_to_toggle = ["location_manager"]
#=> ["location_manager"]
titles ^= titles_to_toggle
#=> #<Set: {"admin", "artist", "event_manager"}>
titles ^= titles_to_toggle
#=> #<Set: {"location_manager", "admin", "artist", "event_manager"}>
titles.to_a.join(",")
#=> "location_manager,admin,artist,event_manager"
All this assumes that you're using a string as a kind of set. If so, you should probably just use a set. If not, and you actually need string-manipulation functions to operate on it, there's probably no way around except for using if-else, or a variant, such as the ternary operator, or unless, or Bergi's answer
Also worth noting regarding regex as a solution: Make sure you consider the edge cases. If 'location_manager' is in the middle of the string, will you remove the extraneous comma? Will you handle removing commas correctly if it's at the beginning or the end of the string? Will you correctly add commas when it's added? For these reasons treating a set as a set or array instead of a string makes more sense.

No. Regex can only match/test whether "a string" is present (or not). Then, the function you've used can do something based on that result, for example replace can remove a match.
Yet, you want to do two actions (each can be done with regex), remove if present and add if not. You can't execute them sequentially, because they overlap - you need to execute either the one or the other. This is where if-else structures (or ternary operators) come into play, and they are required if there is no library/native function that contains them to do exactly this job. I doubt there is one in Ruby.
If you want to avoid the if-else-statement (for one-liners or expressions), you can use the ternary operator. Or, you can use a labda expression returning the correct value:
# kind of pseudo code
string.replace(/location,?|$/, function($0) return $0 ? "" : ",location" )
This matches the string "location" (with optional comma) or the string end, and replaces that with nothing if a match was found or the string ",location" otherwise. I'm sure you can adapt this to Ruby.

to remove something matching a pattern is really easy:
(admin,?|artist,?|location_manager,?|event_manager,?)
then choose the string to replace the match -in your case an empty string- and pass everything to the replace method.
The other operation you suggested was more difficult to achieve with regex only. Maybe someone knows a better answer

Related

Case-sensitive substitutions with gsub

As an exercise, I'm working on an accent translation dictionary. My dictionary is contained in a hash, and I'm thinking of using #gsub! to run inputted strings through the translator.
I'm wondering if there's any way to make the substitutions case-sensitive. For example, I want "didja" to translate to "did you" and "Didja" to translate to "Did you", but I don't want to have to create multiple dictionary entries to deal with case.
I know I can use regex syntax to find strings to replace case-insensitively, with str.gsub!(/#{x}/i,dictionary[x]) where x is a variable. The problem is that this replaces "Didja" with "did you", rather than matching the original case.
Is there any way to make it match the original case?
Suppose we have:
a method to_key that converts a string str to a key in a hash DICTIONARY; and
a method transform that converts the pair [str, DICTIONARY[to_key(str)]] to the replacement for str.
Then str is to be replaced with:
transform(str, DICTIONARY[to_key(str)]])
Without lose of generality, I think we can assume that DICTIONARY's keys and values are all of the same case (say, lower case) and that to_key is simply:
def to_key(str)
str.downcase
end
So all that is necessary is to define the method transform. However, the specification provided does not imply a unique mapping. We therefore must decide what transform should do.
For example, suppose the rule is simply that, if the first character of str and the first character of the dictionary value are both letters, the latter is to be converted to upper case if the former is upper case. Then:
def transform(str, dict_value)
(str[0] =~ /[A-Z]/) ? dict_value.capitalize : dict_value
end
(I originally had dict_value[0] = dict_value[0].upcase if..., but came to my senses after reading #sawa's answer.)
Note that if DICTIONARY['cat'] => 'dog', 'Cat' will be converted to 'Dog'.
One might think that another possibility is that all characters of str that are letters should maintain their case. This is problematic, however, as the dictionary mapping may (without further specification) remove letters, and it may not be clear from DICTIONARY[str] which letters of str were removed, some of which may be lower case and others upper case.
It is not clear what capitalization patterns you have in mind. I assume that you only need to deal with words that are all low case or all low case except the first letter.
str.gsub!(/#{x}/i){|x| x.downcase! ? dictionary[x].capitalize : dictionary[x]}
I don't think this is possible since in this scenario you need to specify the exact string that must take place of the replaced string.
With that in mind, this is the best I can suggest:
subs = {'didja' => 'did you'}
subs.clone.each{ |k, v| subs[k.capitalize] = v.capitalize }
# if you want to replace all occurrences i.e. even substrings:
regex = /#{subs.keys.join('|')}/
# if you want to remove complete words only: (as the Tin man points out)
regex = /\b(?:#{subs.keys.join('|')})\b/ # \b checks for word-boundaries
"didja Didja".gsub(regex, subs)
Update:
Because in your example, the case-sensitive character isn't to be replaced by another value, you could use this:
regex = /(?<=(d))idja/i # again, keep in mind the substrings
"didja Didja".gsub(regex, "id you")

Changing "word" to "Word" using a RegEx like [A-Z]([a-z]*)\b

The title sums up my conundrum pretty well. I've been searching around the net for a while, and being new to Ruby and Regular Expressions as a whole, I'm stuck trying to figure out how to alter the case of a single word string using a RegEx "filter" such as [A-Z]([a-z]*)\b.
Basically I want the flow to be
input: woRD
filter: [A-Z]([a-z]*)\b
output: Word
I already have the words filtered into a list, so I don't need to match words; I only need to filter the case of the word using a RegEx filter.
I do not want to use standard capitalization methods, I want this to be done using Regular Expressions.
You can use
"woRD".downcase.capitalize
Ruby provides some predefined methods for these type of functionality. Try to use them instead of regex. which saves coding time!
Well, for some reason you want to use regexps. Here you go:
# prepare hashes for gsub
to_down = (to_upper = Hash[('a'..'z').zip('A'..'Z')]).invert
# convert to downcase
downcased = 'woRD'.gsub(/[A-Z]/, to_down)
# ⇛ 'word'
titlecased = downcased.gsub(/^\w/, to_upper)
# ⇒ 'Word'
Hope it helps. Note the usage of String#gsub(re, hash) method.
You can't use Regex to such altering as you want to do.
Please read carefully this topic: How to change case of letters in string using regex in Ruby.
The best way to solve your problem is to use:
"woRD".downcase.capitalize
or
name_of_your_variable.downcase!.capitalize!
if you want to alter string in your variable permanently without need of assign it to other variable.

Variable Declaration Regex

I'm trying to make a simple Ruby regex to detect a JavaScript Declaration, but it fails.
Regex:
lines.each do |line|
unminifiedvar = /var [0-9a-zA-Z] = [0-9];/.match(line)
next if unminifiedvar == nil #no variable declarations on the line
#...
end
Testing Line:
var testvariable10 = 9;
A variable name can have more than one character, so you need a + after the character-set [...]. (Also, JS variable names can contain other characters besides alphanumerics.) A numeric literal can have more than one character, so you want a + on the RHS too.
More importantly, though, there are lots of other bits of flexibility that you'll find more painful to process with a regular expression. For instance, consider var x = 1+2+3; or var myString = "foo bar baz";. A variable declaration may span several lines. It need not end with a semicolon. It may have comments in the middle of it. And so on. Regular expressions are not really the right tool for this job.
Of course, it may happen that you're parsing code from a particular source with a very special structure and can guarantee that every declaration has the particular form you're looking for. In that case, go ahead, but if there's any danger that the nature of the code you're processing might change then you're going to be facing a painful problem that really isn't designed to be solved with regular expressions.
[EDITED about a day after writing, to fix a mistake kindly pointed out by "the Tin Man".]
You forgot the +, as in, more than one character for the variable name.
var [0-9a-zA-Z]+ = [0-9];
You may also want to add a + after the [0-9]. That way it can match multiple digits.
var [0-9a-zA-Z]+ = [0-9]+;
http://rubular.com/r/kPlNcGRaHA
Try /var [0-9a-zA-Z]+ = \d+;/
Without the +, [0-9a-zA-Z] will only match a single alphanumeric character. With +, it can match 1 or more alphanumeric characters.
By the way, to make it more robust, you may want to make it match any number of spaces between the tokens, not just exactly one space each. You may also want to make the semicolon at the end optional (because Javascript syntax doesn't require a semicolon). You might also want to make it always match against the whole line, not just a part of the line. That would be:
/\Avar\s+[0-9a-zA-Z]+\s*=\s*\d+;?\Z/
(There is a way to write [0-9a-zA-Z] more concisely, but it has slipped my memory; if someone else knows, feel free to edit this answer.)

A more elegant way to parse a string with ruby regular expression using variable grouping?

At the moment I have a regular expression that looks like this:
^(cat|dog|bird){1}(cat|dog|bird)?(cat|dog|bird)?$
It matches at least 1, and at most 3 instances of a long list of words and makes the matching words for each group available via the corresponding variable.
Is there a way to revise this so that I can return the result for each word in the string without specifying the number of groups beforehand?
^(cat|dog|bird)+$
works but only returns the last match separately , because there is only one group.
OK, so I found a solution to this.
It doesn't look like it is possible to create an unknown number of groups, so I went digging for another way of achieving the desired outcome: To be able to tell if a string was made up of words in a given list; and to match the longest words possible in each position.
I have been reading Mastering Regular Expressions by Jeffrey E. F. Friedl and it shed some light on things for me. It turns out that NFA based Regexp engines (like the one used in Ruby) are sequential as well as lazy/greedy. This means that you can dictate how a pattern is matched using the order in which you give it choices. This explains why scan was returning variable results, it was looking for the first word in the list that matched the criteria and then moved on to the next match. By design it was not looking for the longest match, but the first one. So in order to rectify this all I needed to do was reorder the array of words used to generate the regular expression from alphabetical order, to length order (longest to shortest).
array = %w[ as ascarid car id ]
list = array.sort_by {|word| -word.length }
regexp = Regexp.union(list)
Now the first match found by scan will be the longest word available. It is also pretty simple to tell if a string contains only words in the list using scan:
if "ascarid".scan(regexp).join.length == word.length
return true
else
return false
end
Thanks to everyone that posted in response to this question, I hope that this will help others in the future.
You could do it in two steps:
Use /^(cat|dog|bird)+$/ (or better /\A(cat|dog|bird)+\z/) to make sure it matches.
Then string.scan(/cat|dog|bird/) to get the pieces.
You could also use split and a Set to do both at once. Suppose you have your words in the array a and your string in s, then:
words = Set.new(a)
re = /(#{a.map{|w| Regexp.quote(w)}.join('|')})/
parts = s.split(re).reject(&:empty?)
if(parts.any? {|w| !words.include?(w) })
# 's' didn't match what you expected so throw a
# hissy fit, format the hard drive, set fire to
# the backups, or whatever is appropriate.
else
# Everything you were looking for is in 'parts'
# so you can check the length (if you care about
# how many matches there were) or something useful
# and productive.
end
When you use split with a pattern that contains groups then
the respective matches will be returned in the array as well.
In this case, the split will hand us something like ["", "cat", "", "dog"] and the empty strings will only occur between the separators that we're looking for and so we can reject them and pretend they don't exist. This may be an unexpected use of split since we're more interested in the delimiters more than what is being delimited (except to make sure that nothing is being delimited) but it gets the job done.
Based on your comments, it looks like you want an ordered alternation so that (ascarid|car|as|id) would try to match from left to right. I can't find anything in the Ruby Oniguruma (the Ruby 1.9 regex engine) docs that says that | is ordered or unordered; Perl's alternation appears to be specified (or at least strongly implied) to be ordered and Ruby's certainly behaves as though it is ordered:
>> 'pancakes' =~ /(pan|pancakes)/; puts $1
pan
So you could sort your words from longest to shortest when building your regex:
re = /(#{a.sort_by{|w| -w.length}.map{|w| Regexp.quote(w)}.join('|')})/
and hope that Oniguruma really will match alternations from left to right. AFAIK, Ruby's regexes will be eager because they support backreferences and lazy/non-greedy matching so this approach should be safe.
Or you could be properly paranoid and parse it in steps; first you'd make sure your string looks like what you want:
if(s !~ /\A(#{a.map{|w| Regexp.quote(w)}.join('|')})+\z/)
# Bail out and complain that 's' doesn't look right
end
The group your words by length:
by_length = a.group_by(&:length)
and scan for the groups from the longest words to the shortest words:
# This loses the order of the substrings within 's'...
matches = [ ]
by_length.keys.sort_by { |k| -k }.each do |group|
re = /(#{a.map{|w| Regexp.quote(w)}.join('|')})/
s.gsub!(re) { |w| matches.push(w); '' }
end
# 's' should now be empty and the matched substrings will be
# in 'matches'
There is still room for possible overlaps in these approaches but at least you'd be extracting the longest matches.
If you need to repeat parts of a regex, one option is to store the repeated part in a variable and just reference that, for example:
r = "(cat|dog|bird)"
str.match(/#{r}#{r}?#{r}?/)
You can do it with .Net regular expressions. If I write the following in PowerShell
$pat = [regex] "^(cat|dog|bird)+$"
$m = $pat.match('birddogcatbird')
$m.groups[1].captures | %{$_.value}
then I get
bird
dog
cat
bird
when I run it. I know even less about IronRuby than I do about PowerShell, but perhaps this means you can do it in IronRuby as well.

Very odd issue with Ruby and regex

I am getting completely different reults from string.scan and several regex testers...
I am just trying to grab the domain from the string, it is the last word.
The regex in question:
/([a-zA-Z0-9\-]*\.)*\w{1,4}$/
The string (1 single line, verified in Ruby's runtime btw)
str = 'Show more results from software.informer.com'
Work fine, but in ruby....
irb(main):050:0> str.scan /([a-zA-Z0-9\-]*\.)*\w{1,4}$/
=> [["informer."]]
I would think that I would get a match on software.informer.com ,which is my goal.
Your regex is correct, the result has to do with the way String#scan behaves. From the official documentation:
"If the pattern contains groups, each individual result is itself an array containing one entry per group."
Basically, if you put parentheses around the whole regex, the first element of each array in your results will be what you expect.
It does not look as if you expect more than one result (especially as the regex is anchored). In that case there is no reason to use scan.
'Show more results from software.informer.com'[ /([a-zA-Z0-9\-]*\.)*\w{1,4}$/ ]
#=> "software.informer.com"
If you do need to use scan (in which case you obviously need to remove the anchor), you can use (?:) to create non-capturing groups.
'foo.bar.baz lala software.informer.com'.scan( /(?:[a-zA-Z0-9\-]*\.)*\w{1,4}/ )
#=> ["foo.bar.baz", "lala", "software.informer.com"]
You are getting a match on software.informer.com. Check the value of $&. The return of scan is an array of the captured groups. Add capturing parentheses around the suffix, and you'll get the .com as part of the return value from scan as well.
The regex testers and Ruby are not disagreeing about the fundamental issue (the regex itself). Rather, their interfaces are differing in what they are emphasizing. When you run scan in irb, the first thing you'll see is the return value from scan (an Array of the captured subpatterns), which is not the same thing as the matched text. Regex testers are most likely oriented toward displaying the matched text.
How about doing this :
/([a-zA-Z0-9\-]*\.*\w{1,4})$/
This returns
informer.com
On your test string.
http://rubular.com/regexes/13670

Resources