A more elegant way to parse a string with ruby regular expression using variable grouping? - ruby

At the moment I have a regular expression that looks like this:
^(cat|dog|bird){1}(cat|dog|bird)?(cat|dog|bird)?$
It matches at least 1, and at most 3 instances of a long list of words and makes the matching words for each group available via the corresponding variable.
Is there a way to revise this so that I can return the result for each word in the string without specifying the number of groups beforehand?
^(cat|dog|bird)+$
works but only returns the last match separately , because there is only one group.

OK, so I found a solution to this.
It doesn't look like it is possible to create an unknown number of groups, so I went digging for another way of achieving the desired outcome: To be able to tell if a string was made up of words in a given list; and to match the longest words possible in each position.
I have been reading Mastering Regular Expressions by Jeffrey E. F. Friedl and it shed some light on things for me. It turns out that NFA based Regexp engines (like the one used in Ruby) are sequential as well as lazy/greedy. This means that you can dictate how a pattern is matched using the order in which you give it choices. This explains why scan was returning variable results, it was looking for the first word in the list that matched the criteria and then moved on to the next match. By design it was not looking for the longest match, but the first one. So in order to rectify this all I needed to do was reorder the array of words used to generate the regular expression from alphabetical order, to length order (longest to shortest).
array = %w[ as ascarid car id ]
list = array.sort_by {|word| -word.length }
regexp = Regexp.union(list)
Now the first match found by scan will be the longest word available. It is also pretty simple to tell if a string contains only words in the list using scan:
if "ascarid".scan(regexp).join.length == word.length
return true
else
return false
end
Thanks to everyone that posted in response to this question, I hope that this will help others in the future.

You could do it in two steps:
Use /^(cat|dog|bird)+$/ (or better /\A(cat|dog|bird)+\z/) to make sure it matches.
Then string.scan(/cat|dog|bird/) to get the pieces.
You could also use split and a Set to do both at once. Suppose you have your words in the array a and your string in s, then:
words = Set.new(a)
re = /(#{a.map{|w| Regexp.quote(w)}.join('|')})/
parts = s.split(re).reject(&:empty?)
if(parts.any? {|w| !words.include?(w) })
# 's' didn't match what you expected so throw a
# hissy fit, format the hard drive, set fire to
# the backups, or whatever is appropriate.
else
# Everything you were looking for is in 'parts'
# so you can check the length (if you care about
# how many matches there were) or something useful
# and productive.
end
When you use split with a pattern that contains groups then
the respective matches will be returned in the array as well.
In this case, the split will hand us something like ["", "cat", "", "dog"] and the empty strings will only occur between the separators that we're looking for and so we can reject them and pretend they don't exist. This may be an unexpected use of split since we're more interested in the delimiters more than what is being delimited (except to make sure that nothing is being delimited) but it gets the job done.
Based on your comments, it looks like you want an ordered alternation so that (ascarid|car|as|id) would try to match from left to right. I can't find anything in the Ruby Oniguruma (the Ruby 1.9 regex engine) docs that says that | is ordered or unordered; Perl's alternation appears to be specified (or at least strongly implied) to be ordered and Ruby's certainly behaves as though it is ordered:
>> 'pancakes' =~ /(pan|pancakes)/; puts $1
pan
So you could sort your words from longest to shortest when building your regex:
re = /(#{a.sort_by{|w| -w.length}.map{|w| Regexp.quote(w)}.join('|')})/
and hope that Oniguruma really will match alternations from left to right. AFAIK, Ruby's regexes will be eager because they support backreferences and lazy/non-greedy matching so this approach should be safe.
Or you could be properly paranoid and parse it in steps; first you'd make sure your string looks like what you want:
if(s !~ /\A(#{a.map{|w| Regexp.quote(w)}.join('|')})+\z/)
# Bail out and complain that 's' doesn't look right
end
The group your words by length:
by_length = a.group_by(&:length)
and scan for the groups from the longest words to the shortest words:
# This loses the order of the substrings within 's'...
matches = [ ]
by_length.keys.sort_by { |k| -k }.each do |group|
re = /(#{a.map{|w| Regexp.quote(w)}.join('|')})/
s.gsub!(re) { |w| matches.push(w); '' }
end
# 's' should now be empty and the matched substrings will be
# in 'matches'
There is still room for possible overlaps in these approaches but at least you'd be extracting the longest matches.

If you need to repeat parts of a regex, one option is to store the repeated part in a variable and just reference that, for example:
r = "(cat|dog|bird)"
str.match(/#{r}#{r}?#{r}?/)

You can do it with .Net regular expressions. If I write the following in PowerShell
$pat = [regex] "^(cat|dog|bird)+$"
$m = $pat.match('birddogcatbird')
$m.groups[1].captures | %{$_.value}
then I get
bird
dog
cat
bird
when I run it. I know even less about IronRuby than I do about PowerShell, but perhaps this means you can do it in IronRuby as well.

Related

Regular expression for validating long, complicated dns targets

The DNS entries i am trying to validate are quite long. Here's an example of what the structure might look like:
qwer-0123a4bcd567890e1-uuuuu3xx.qwer-gfd-1e098765dcb4a3210.ps-sdlk-6.qwer.domain.com
These entries can be thought of as three distinct parts:
qwer-0123a4bcd567890e1-uuuuu3xx.qwer-gfd-1e098765dcb4a3210.
Always starts with qwer-
Followed by 17 alphanumerics, a -, 8 more alphanumerics
Followed by qwer-gfd-
Followed by 17 more alphanumerics and a .
ps-sdlk-6
Always starts with ps-sdlk-
Followed by either one or two alphanumeric. In this case it could be ps-sdlk-6 or something like ps-sdlk-6e
.qwer.domain.com
The domain target always ends with .qwer.domain.com
I've been hacking together a regex and came up with this monstrosity:
qwer-[\w]{17}-[\w]{8}.qwer-gfd-[\w]{17}.(.*)(qwer.domain.com)
That solution is pretty hideous and it returns multiple match groups which doesn't give me much confidence in the accuracy. I'm using ruby 2.5 but non std lib stuff is difficult to import in this case.
Is there a more sensible and complete/accurate regex to confirm the validity of these dns targets? Is there a better way to do this without regex?
Considering the complexity of testing longish regular expressions, and also the possibility--if not the probability--that changes will be needed in future, I would be inclined to split the string on hyphens and test each string in the resulting array.
PIECES = [['qwer'],
['0123a4bcd567890e1'.size],
['uuuuu3xx'.size, '.qwer'],
['gfd'],
['1e098765dcb4a3210'.size, '.ps'],
['sdlk'],
[[1, 2], '.qwer.domain.com']].
map do |a|
Regexp.new(
a.each_with_object('\A') do |o,s|
case o
when String
s << o.gsub('.', '\.')
when Integer
s << "\\p{Alnum}{#{o}}"
else # array
s << "\\p{Alnum}{#{o.first},#{o.last}}"
end
end << '\z')
end
#=> [/\Aqwer\z/, /\A\p{Alnum}{17}\z/, /\A\p{Alnum}{8}\.qwer\z/,
# /\Agfd\z/, /\A\p{Alnum}{17}\.ps\z/, /\Asdlk\z/,
# /\A\p{Alnum}{1,2}\.qwer\.domain\.com\z/]
Notice that I've used single quotes in most places to be able to write, for example, '\A' rather than "\\A". However, double quotes are needed for the two lines where interpolation is performed (#{o}). I've also used strings from the example to determine the lengths of various runs of alphanumeric characters and have escaped periods and added anchors in simple code. I did that to both reduce the chance of counting errors and help readers of the code understand what is being done. Though the elements of PIECES (regular expressions) are here being used to test the string used to construct PIECES that is of course irrelevant, assuming, as we must, that all strings to be tested will have the same pattern.
def valid?(str)
arr = str.split('-')
return false unless arr.size == PIECES.size
arr.zip(PIECES).all? { |s,r| s.match? r }
end
If Enumerable#all?'s block returns false all? immediately returns false. This is sometimes referred to as short-circuiting behaviour.
For the string given in the example, str,
valid?(str)
#=> true
Note the following intermediate calculation.
str.split('-').zip(PIECES)
#=> [["qwer", /\Aqwer\z/],
# ["0123a4bcd567890e1", /\A\p{Alnum}{17}\z/],
# ["uuuuu3xx.qwer", /\A\p{Alnum}{8}\.qwer\z/],
# ["gfd", /\Agfd\z/],
# ["1e098765dcb4a3210.ps", /\A\p{Alnum}{17}\.ps\z/],
# ["sdlk", /\Asdlk\z/],
# ["6.qwer.domain.com", /\A\p{Alnum}{1,2}\.qwer\.domain\.com\z/]]
This may seem overkill (and I'm not certain it isn't), but it does facilitate debugging and testing and if, in future, the string pattern changes (within limits) it should be relatively easy to modify the matching test (by changing the array of arrays above from which PIECES is derived).
I think given your input data you have no choice but an ugly regex e.g.
^qwer-\w{17}-\w{8}\.qwer-gfd-\w{17}\.ps-sdlk-\w{1,2}\.qwer\.domain\.com$
Note that I have used \w as you did, however \w also matches _ as well as alphanumeric characters, so you may want to replace it with [A-Za-z0-9]. Also, . will match any character, so to specifically match a . you need \. in your regex.
Demo on regex101.com

Case-sensitive substitutions with gsub

As an exercise, I'm working on an accent translation dictionary. My dictionary is contained in a hash, and I'm thinking of using #gsub! to run inputted strings through the translator.
I'm wondering if there's any way to make the substitutions case-sensitive. For example, I want "didja" to translate to "did you" and "Didja" to translate to "Did you", but I don't want to have to create multiple dictionary entries to deal with case.
I know I can use regex syntax to find strings to replace case-insensitively, with str.gsub!(/#{x}/i,dictionary[x]) where x is a variable. The problem is that this replaces "Didja" with "did you", rather than matching the original case.
Is there any way to make it match the original case?
Suppose we have:
a method to_key that converts a string str to a key in a hash DICTIONARY; and
a method transform that converts the pair [str, DICTIONARY[to_key(str)]] to the replacement for str.
Then str is to be replaced with:
transform(str, DICTIONARY[to_key(str)]])
Without lose of generality, I think we can assume that DICTIONARY's keys and values are all of the same case (say, lower case) and that to_key is simply:
def to_key(str)
str.downcase
end
So all that is necessary is to define the method transform. However, the specification provided does not imply a unique mapping. We therefore must decide what transform should do.
For example, suppose the rule is simply that, if the first character of str and the first character of the dictionary value are both letters, the latter is to be converted to upper case if the former is upper case. Then:
def transform(str, dict_value)
(str[0] =~ /[A-Z]/) ? dict_value.capitalize : dict_value
end
(I originally had dict_value[0] = dict_value[0].upcase if..., but came to my senses after reading #sawa's answer.)
Note that if DICTIONARY['cat'] => 'dog', 'Cat' will be converted to 'Dog'.
One might think that another possibility is that all characters of str that are letters should maintain their case. This is problematic, however, as the dictionary mapping may (without further specification) remove letters, and it may not be clear from DICTIONARY[str] which letters of str were removed, some of which may be lower case and others upper case.
It is not clear what capitalization patterns you have in mind. I assume that you only need to deal with words that are all low case or all low case except the first letter.
str.gsub!(/#{x}/i){|x| x.downcase! ? dictionary[x].capitalize : dictionary[x]}
I don't think this is possible since in this scenario you need to specify the exact string that must take place of the replaced string.
With that in mind, this is the best I can suggest:
subs = {'didja' => 'did you'}
subs.clone.each{ |k, v| subs[k.capitalize] = v.capitalize }
# if you want to replace all occurrences i.e. even substrings:
regex = /#{subs.keys.join('|')}/
# if you want to remove complete words only: (as the Tin man points out)
regex = /\b(?:#{subs.keys.join('|')})\b/ # \b checks for word-boundaries
"didja Didja".gsub(regex, subs)
Update:
Because in your example, the case-sensitive character isn't to be replaced by another value, you could use this:
regex = /(?<=(d))idja/i # again, keep in mind the substrings
"didja Didja".gsub(regex, "id you")

How to match full words and not substrings in Ruby

This is my code
stopwordlist = "a|an|all"
File.open('0_9.txt').each do |line|
line.downcase!
line.gsub!( /\b#{stopwordlist}\b/,'')
File.open('0_9_2.txt', 'w') { |f| f.write(line) }
end
I wanted to remove words - a,an and all
But, instead it matches substrings also and removes them
For an example input -
Bromwell High is a cartoon comedy. It ran at the same time as some other programs about school life
I get the output -
bromwell high is cartoon comedy. it r t the same time s some other programs bout school life
As you can see, it matched the substring.
How do I make it just match the word and not substrings ?
The | operator in regex takes the widest scope possible. Your original regex matches either \ba or an or all\b.
Change the whole regex to:
/\b(?:#{stopwordlist})\b/
or change stopwordlist into a regex instead of a string.
stopwordlist = /a|an|all/
Even better, you may want to use Regexp.union.
\ba\b|\ban\b|\ball\b
try this.this will look for word boundaries.

Regex Match Until Word Contained in Array

Using Ruby 1.8.7
I need to grab everything up to a certain word - and I would like to match against words in an array. Example:
match_words = ['title','author','pages']
item = "Title: Jurassic Park\n"
item += "Author: Michael Crichton\n"
if item =~ /title: (.*)#{match any word in match_words array}/i
#do something here
end
So, this would ideally return "Jurassic Park\n". I am currently matching on newlines but have found that the data I will be matching against might have newlines in strange places, like the middle of the sentence. So, I think matching to the next match_word would be a good idea.
Is this possible, or maybe can be done another way?
Try this on for size
item.scan(/(title|author|pages):\s*?(.+)/i)
What this says is find all the results that start (case-insensitive) with either title, author or pages, are then followed by a colon and option white space and then characters. Capture the label and then the characters following the whitespace. The scan method will match as many times as it can.
Just iterate over the match words and do the regex compare as you normally would.
match_words.each do |word|
if item =~ /#{word}/ # Plus case sensitivity, start/end of item, etc.
# etc.
end
end
But if you know that the things you care about are at the beginning of the lines, then split the input string on \n and just use start_with instead of bothering with the regex--that partially depends on what the real data looks like.
First, create a | separated list of keywords from match_words.
Then, use string.scan to split the string apart, giving you an array of arrays with your results. See the end of this tutorial for a reference.
Here's my best shot:
keywords = match_words.join('|')
results = item.scan(/(#{keywords}):\s*(.+?)\s*(?= (#{keywords}):)/im)
Results: [["Title", "Jurassic Park"], ["Author", "Michael Crichton"]]
Don't forget to use the /m switch to indicate that you want . to match newlines.
To explain the pattern: we look for a keyword, then use a "look ahead" (?= ) to find the next keyword without capturing it. We capture all characters in between using a "lazy" expression .+?, so that we don't capture other keywords.

Very odd issue with Ruby and regex

I am getting completely different reults from string.scan and several regex testers...
I am just trying to grab the domain from the string, it is the last word.
The regex in question:
/([a-zA-Z0-9\-]*\.)*\w{1,4}$/
The string (1 single line, verified in Ruby's runtime btw)
str = 'Show more results from software.informer.com'
Work fine, but in ruby....
irb(main):050:0> str.scan /([a-zA-Z0-9\-]*\.)*\w{1,4}$/
=> [["informer."]]
I would think that I would get a match on software.informer.com ,which is my goal.
Your regex is correct, the result has to do with the way String#scan behaves. From the official documentation:
"If the pattern contains groups, each individual result is itself an array containing one entry per group."
Basically, if you put parentheses around the whole regex, the first element of each array in your results will be what you expect.
It does not look as if you expect more than one result (especially as the regex is anchored). In that case there is no reason to use scan.
'Show more results from software.informer.com'[ /([a-zA-Z0-9\-]*\.)*\w{1,4}$/ ]
#=> "software.informer.com"
If you do need to use scan (in which case you obviously need to remove the anchor), you can use (?:) to create non-capturing groups.
'foo.bar.baz lala software.informer.com'.scan( /(?:[a-zA-Z0-9\-]*\.)*\w{1,4}/ )
#=> ["foo.bar.baz", "lala", "software.informer.com"]
You are getting a match on software.informer.com. Check the value of $&. The return of scan is an array of the captured groups. Add capturing parentheses around the suffix, and you'll get the .com as part of the return value from scan as well.
The regex testers and Ruby are not disagreeing about the fundamental issue (the regex itself). Rather, their interfaces are differing in what they are emphasizing. When you run scan in irb, the first thing you'll see is the return value from scan (an Array of the captured subpatterns), which is not the same thing as the matched text. Regex testers are most likely oriented toward displaying the matched text.
How about doing this :
/([a-zA-Z0-9\-]*\.*\w{1,4})$/
This returns
informer.com
On your test string.
http://rubular.com/regexes/13670

Resources