Ruby regex to remove all consecutive letters from string - ruby

Here is the problem (from Codeforces)
Polycarp thinks about the meaning of life very often. He does this constantly, even when typing in the editor. Every time he starts brooding he can no longer fully concentrate and repeatedly presses the keys that need to be pressed only once. For example, instead of the phrase "how are you" he can type "hhoow aaaare yyoouu".
Polycarp decided to automate the process of correcting such errors. He decided to write a plug-in to the text editor that will remove pairs of identical consecutive letters (if there are any in the text). Of course, this is not exactly what Polycarp needs, but he's got to start from something!
Help Polycarp and write the main plug-in module. Your program should remove from a string all pairs of identical letters, which are consecutive. If after the removal there appear new pairs, the program should remove them as well. Technically, its work should be equivalent to the following: while the string contains a pair of consecutive identical letters, the pair should be deleted. Note that deleting of the consecutive identical letters can be done in any order, as any order leads to the same result.
Here is my solution.. For some reason it fails an extremely large test case. Mine seems to get rid of more letters than it is supposed to. Is this regular expression incorrect?
str = gets.chomp
while str =~ /(.)\1/
str.gsub!(/(.)\1+/,'')
end
puts str
EDIT -- This solution doesn't work because it gets rid of all consecutive groups of characters. It should only get rid of duplicates. If I do it this way, which I believe to be correct, it times out on extremely large strings:
str = gets.chomp
while str =~ /(.)\1/
str.gsub!(/(.)\1/,'')
end
puts str

Why does it have to be a regular expression?
'foobar'.squeeze
=> "fobar"
"hhoow aaaare yyoouu".squeeze
=> "how are you"
squeeze is a useful tool for compressing runs of all characters, or specific ones. Here are some examples from the documentation:
"yellow moon".squeeze #=> "yelow mon"
" now is the".squeeze(" ") #=> " now is the"
"putters shoot balls".squeeze("m-z") #=> "puters shot balls"
If "aab" becomes "b", then you're not following the example given in the question, which is "hhoow" turns into "how". By your statement it would be "w", and "yyoouu" would be "". I think you're reading too much into it and not understanding the problem based on their sample input and sample output.

ha, harder than I thought when I first read. What about
s = "hhoow aaaareer yyoouu"
while s.gsub!(/(.)\1+/, '')
end
puts s
which leaves s == 'w' if i understand the problem correctly.

Related

Regular expression for validating long, complicated dns targets

The DNS entries i am trying to validate are quite long. Here's an example of what the structure might look like:
qwer-0123a4bcd567890e1-uuuuu3xx.qwer-gfd-1e098765dcb4a3210.ps-sdlk-6.qwer.domain.com
These entries can be thought of as three distinct parts:
qwer-0123a4bcd567890e1-uuuuu3xx.qwer-gfd-1e098765dcb4a3210.
Always starts with qwer-
Followed by 17 alphanumerics, a -, 8 more alphanumerics
Followed by qwer-gfd-
Followed by 17 more alphanumerics and a .
ps-sdlk-6
Always starts with ps-sdlk-
Followed by either one or two alphanumeric. In this case it could be ps-sdlk-6 or something like ps-sdlk-6e
.qwer.domain.com
The domain target always ends with .qwer.domain.com
I've been hacking together a regex and came up with this monstrosity:
qwer-[\w]{17}-[\w]{8}.qwer-gfd-[\w]{17}.(.*)(qwer.domain.com)
That solution is pretty hideous and it returns multiple match groups which doesn't give me much confidence in the accuracy. I'm using ruby 2.5 but non std lib stuff is difficult to import in this case.
Is there a more sensible and complete/accurate regex to confirm the validity of these dns targets? Is there a better way to do this without regex?
Considering the complexity of testing longish regular expressions, and also the possibility--if not the probability--that changes will be needed in future, I would be inclined to split the string on hyphens and test each string in the resulting array.
PIECES = [['qwer'],
['0123a4bcd567890e1'.size],
['uuuuu3xx'.size, '.qwer'],
['gfd'],
['1e098765dcb4a3210'.size, '.ps'],
['sdlk'],
[[1, 2], '.qwer.domain.com']].
map do |a|
Regexp.new(
a.each_with_object('\A') do |o,s|
case o
when String
s << o.gsub('.', '\.')
when Integer
s << "\\p{Alnum}{#{o}}"
else # array
s << "\\p{Alnum}{#{o.first},#{o.last}}"
end
end << '\z')
end
#=> [/\Aqwer\z/, /\A\p{Alnum}{17}\z/, /\A\p{Alnum}{8}\.qwer\z/,
# /\Agfd\z/, /\A\p{Alnum}{17}\.ps\z/, /\Asdlk\z/,
# /\A\p{Alnum}{1,2}\.qwer\.domain\.com\z/]
Notice that I've used single quotes in most places to be able to write, for example, '\A' rather than "\\A". However, double quotes are needed for the two lines where interpolation is performed (#{o}). I've also used strings from the example to determine the lengths of various runs of alphanumeric characters and have escaped periods and added anchors in simple code. I did that to both reduce the chance of counting errors and help readers of the code understand what is being done. Though the elements of PIECES (regular expressions) are here being used to test the string used to construct PIECES that is of course irrelevant, assuming, as we must, that all strings to be tested will have the same pattern.
def valid?(str)
arr = str.split('-')
return false unless arr.size == PIECES.size
arr.zip(PIECES).all? { |s,r| s.match? r }
end
If Enumerable#all?'s block returns false all? immediately returns false. This is sometimes referred to as short-circuiting behaviour.
For the string given in the example, str,
valid?(str)
#=> true
Note the following intermediate calculation.
str.split('-').zip(PIECES)
#=> [["qwer", /\Aqwer\z/],
# ["0123a4bcd567890e1", /\A\p{Alnum}{17}\z/],
# ["uuuuu3xx.qwer", /\A\p{Alnum}{8}\.qwer\z/],
# ["gfd", /\Agfd\z/],
# ["1e098765dcb4a3210.ps", /\A\p{Alnum}{17}\.ps\z/],
# ["sdlk", /\Asdlk\z/],
# ["6.qwer.domain.com", /\A\p{Alnum}{1,2}\.qwer\.domain\.com\z/]]
This may seem overkill (and I'm not certain it isn't), but it does facilitate debugging and testing and if, in future, the string pattern changes (within limits) it should be relatively easy to modify the matching test (by changing the array of arrays above from which PIECES is derived).
I think given your input data you have no choice but an ugly regex e.g.
^qwer-\w{17}-\w{8}\.qwer-gfd-\w{17}\.ps-sdlk-\w{1,2}\.qwer\.domain\.com$
Note that I have used \w as you did, however \w also matches _ as well as alphanumeric characters, so you may want to replace it with [A-Za-z0-9]. Also, . will match any character, so to specifically match a . you need \. in your regex.
Demo on regex101.com

Finding and Editing Multiple Regex Matches on the Same Line

I want to add markdown to key phrases in a (gollum) wiki page that will link to the relevant wiki page in the form:
This is the key phrase.
Becomes
This is the [[key phrase|Glossary#key phrase]].
I have a list of key phrases such as:
keywords = ["golden retriever", "pomeranian", "cat"]
And a document:
Sue has 1 golden retriever. John has two cats.
Jennifer has one pomeranian. Joe has three pomeranians.
I want to iterate over every line and find every match (that isn't already a link) for each keyword. My current attempt looks like this:
File.foreach(target_file) do |line|
glosses.each do |gloss|
len = gloss.length
# Create the regex. Avoid anything that starts with [
# or (, ends with ] or ), and ignore case.
re = /(?<![\[\(])#{gloss}(?![\]\)])/i
# Find every instance of this gloss on this line.
positions = line.enum_for(:scan, re).map {Regexp.last_match.begin(0) }
positions.each do |pos|
line.insert(pos, "[[")
# +2 because we just inserted 2 ahead.
line.insert(pos+len+2, "|#{page}\##{gloss}]]")
end
end
puts line
end
However, this will run into a problem if there are two matches for the same key phrase on the same line. Because I insert things into the line, the position I found for each match isn't accurate after the first one. I know I could adjust for the size of my insertions every time but, because my insertions are a different size for each gloss, it seems like the most brute-force, hacky solution.
Is there a solution that allows me to make multiple insertions on the same line at the same time without several arbitrary adjustments each time?
After looking at #BryceDrew's online python version, I realized ruby probably also has a way to fill in the match. I now have a much more concise and faster solution.
First, I needed to make regexes of my glosses:
glosses.push(/(?<![\[\(])#{gloss}(?![\]\)])/i)
Note: The majority of that regex is look-ahead and look-behind assertions to prevent catching a phrase that's already part of a link.
Then, I needed to make a union of all of them:
re = Regexp.union(glosses)
After that, it's as simple as doing gsub on every line, and filling in my matches:
File.foreach(target_file) do |line|
line = line.gsub(re) {|match| "[[#{match}|Glossary##{match.downcase}]]"}
puts line
end

Case-sensitive substitutions with gsub

As an exercise, I'm working on an accent translation dictionary. My dictionary is contained in a hash, and I'm thinking of using #gsub! to run inputted strings through the translator.
I'm wondering if there's any way to make the substitutions case-sensitive. For example, I want "didja" to translate to "did you" and "Didja" to translate to "Did you", but I don't want to have to create multiple dictionary entries to deal with case.
I know I can use regex syntax to find strings to replace case-insensitively, with str.gsub!(/#{x}/i,dictionary[x]) where x is a variable. The problem is that this replaces "Didja" with "did you", rather than matching the original case.
Is there any way to make it match the original case?
Suppose we have:
a method to_key that converts a string str to a key in a hash DICTIONARY; and
a method transform that converts the pair [str, DICTIONARY[to_key(str)]] to the replacement for str.
Then str is to be replaced with:
transform(str, DICTIONARY[to_key(str)]])
Without lose of generality, I think we can assume that DICTIONARY's keys and values are all of the same case (say, lower case) and that to_key is simply:
def to_key(str)
str.downcase
end
So all that is necessary is to define the method transform. However, the specification provided does not imply a unique mapping. We therefore must decide what transform should do.
For example, suppose the rule is simply that, if the first character of str and the first character of the dictionary value are both letters, the latter is to be converted to upper case if the former is upper case. Then:
def transform(str, dict_value)
(str[0] =~ /[A-Z]/) ? dict_value.capitalize : dict_value
end
(I originally had dict_value[0] = dict_value[0].upcase if..., but came to my senses after reading #sawa's answer.)
Note that if DICTIONARY['cat'] => 'dog', 'Cat' will be converted to 'Dog'.
One might think that another possibility is that all characters of str that are letters should maintain their case. This is problematic, however, as the dictionary mapping may (without further specification) remove letters, and it may not be clear from DICTIONARY[str] which letters of str were removed, some of which may be lower case and others upper case.
It is not clear what capitalization patterns you have in mind. I assume that you only need to deal with words that are all low case or all low case except the first letter.
str.gsub!(/#{x}/i){|x| x.downcase! ? dictionary[x].capitalize : dictionary[x]}
I don't think this is possible since in this scenario you need to specify the exact string that must take place of the replaced string.
With that in mind, this is the best I can suggest:
subs = {'didja' => 'did you'}
subs.clone.each{ |k, v| subs[k.capitalize] = v.capitalize }
# if you want to replace all occurrences i.e. even substrings:
regex = /#{subs.keys.join('|')}/
# if you want to remove complete words only: (as the Tin man points out)
regex = /\b(?:#{subs.keys.join('|')})\b/ # \b checks for word-boundaries
"didja Didja".gsub(regex, subs)
Update:
Because in your example, the case-sensitive character isn't to be replaced by another value, you could use this:
regex = /(?<=(d))idja/i # again, keep in mind the substrings
"didja Didja".gsub(regex, "id you")

Regular expression to find first letter in a string

Consider this example string:
mystr ="1. moody"
I want to capitalize the first letter that occurs in mystr. I am trying this regular expression in Ruby but still returns all the letters in mystr (moody) instead of the letter m only.
puts mystr.scan(/[a-zA-Z]{1}/)
Any help appreciated!
Do as below using String#sub
(arup~>~)$ pry --simple-prompt
>> s = "1. moody"
=> "1. moody"
>> s.sub(/[a-z]/i,&:upcase)
=> "1. Moody"
>>
If you want to modify the source string use s.sub!(/[a-z]/,&:upcase).
Just for completeness, although it doesn’t directly answer your question as posed but could be relevant, consider this variation:
mystr ="1. école"
The line mystr.sub(/[a-z]/i,&:upcase) (as in Arup Rakshit’s answer) will match the second letter of the word, producing
1. éCole
The line mystr.sub /\b\s?[a-zA-Z]{1}/, &:upcase (diego.greyrobot’s answer) won’t match at all and so the line will be unchanged.
There are two problems here. The first is that [a-zA-Z] doesn’t match accented characters, so é isn’t matched. The fix for this is to use the \p{Letter} character property:
mystr.sub /\p{Letter}/, &:upcase
This will match the character in question, but won’t change it. This is due to the second problem, which is that upcase (and downcase) only works on characters in the ASCII range. This is almost as easy to fix, but relies on using an external library such as unicode_utils:
require 'unicode_utils'
mystr.sub(/\p{Letter}/) { |c| UnicodeUtils.upcase(c)}
This results in:
1. École
which is probably what is wanted in this case.
This may not affect you if you are sure all your data is just ASCII, but is worth knowing for other situations.
The reason your attempt returns all the letters is because you are using the scan method which does just that, it returns all the characters which match the regex, in your case letters. For your use case you should use sub since you only want to substitute 1 letter.
I use http://rubular.com to practice my Ruby Regexes. Here's what I came up with http://rubular.com/r/fAQEDFVEVn
The regex is: /\b[a-z]/
It uses \b to find a word boundary, and finally we ask for one letter only with [a-zA-Z]
Finally we'll use sub to replace it with its upcased version:
"1. moody".sub /\b[a-z]/, &:upcase
=> "1. Moody"
Hope that helps.

A more elegant way to parse a string with ruby regular expression using variable grouping?

At the moment I have a regular expression that looks like this:
^(cat|dog|bird){1}(cat|dog|bird)?(cat|dog|bird)?$
It matches at least 1, and at most 3 instances of a long list of words and makes the matching words for each group available via the corresponding variable.
Is there a way to revise this so that I can return the result for each word in the string without specifying the number of groups beforehand?
^(cat|dog|bird)+$
works but only returns the last match separately , because there is only one group.
OK, so I found a solution to this.
It doesn't look like it is possible to create an unknown number of groups, so I went digging for another way of achieving the desired outcome: To be able to tell if a string was made up of words in a given list; and to match the longest words possible in each position.
I have been reading Mastering Regular Expressions by Jeffrey E. F. Friedl and it shed some light on things for me. It turns out that NFA based Regexp engines (like the one used in Ruby) are sequential as well as lazy/greedy. This means that you can dictate how a pattern is matched using the order in which you give it choices. This explains why scan was returning variable results, it was looking for the first word in the list that matched the criteria and then moved on to the next match. By design it was not looking for the longest match, but the first one. So in order to rectify this all I needed to do was reorder the array of words used to generate the regular expression from alphabetical order, to length order (longest to shortest).
array = %w[ as ascarid car id ]
list = array.sort_by {|word| -word.length }
regexp = Regexp.union(list)
Now the first match found by scan will be the longest word available. It is also pretty simple to tell if a string contains only words in the list using scan:
if "ascarid".scan(regexp).join.length == word.length
return true
else
return false
end
Thanks to everyone that posted in response to this question, I hope that this will help others in the future.
You could do it in two steps:
Use /^(cat|dog|bird)+$/ (or better /\A(cat|dog|bird)+\z/) to make sure it matches.
Then string.scan(/cat|dog|bird/) to get the pieces.
You could also use split and a Set to do both at once. Suppose you have your words in the array a and your string in s, then:
words = Set.new(a)
re = /(#{a.map{|w| Regexp.quote(w)}.join('|')})/
parts = s.split(re).reject(&:empty?)
if(parts.any? {|w| !words.include?(w) })
# 's' didn't match what you expected so throw a
# hissy fit, format the hard drive, set fire to
# the backups, or whatever is appropriate.
else
# Everything you were looking for is in 'parts'
# so you can check the length (if you care about
# how many matches there were) or something useful
# and productive.
end
When you use split with a pattern that contains groups then
the respective matches will be returned in the array as well.
In this case, the split will hand us something like ["", "cat", "", "dog"] and the empty strings will only occur between the separators that we're looking for and so we can reject them and pretend they don't exist. This may be an unexpected use of split since we're more interested in the delimiters more than what is being delimited (except to make sure that nothing is being delimited) but it gets the job done.
Based on your comments, it looks like you want an ordered alternation so that (ascarid|car|as|id) would try to match from left to right. I can't find anything in the Ruby Oniguruma (the Ruby 1.9 regex engine) docs that says that | is ordered or unordered; Perl's alternation appears to be specified (or at least strongly implied) to be ordered and Ruby's certainly behaves as though it is ordered:
>> 'pancakes' =~ /(pan|pancakes)/; puts $1
pan
So you could sort your words from longest to shortest when building your regex:
re = /(#{a.sort_by{|w| -w.length}.map{|w| Regexp.quote(w)}.join('|')})/
and hope that Oniguruma really will match alternations from left to right. AFAIK, Ruby's regexes will be eager because they support backreferences and lazy/non-greedy matching so this approach should be safe.
Or you could be properly paranoid and parse it in steps; first you'd make sure your string looks like what you want:
if(s !~ /\A(#{a.map{|w| Regexp.quote(w)}.join('|')})+\z/)
# Bail out and complain that 's' doesn't look right
end
The group your words by length:
by_length = a.group_by(&:length)
and scan for the groups from the longest words to the shortest words:
# This loses the order of the substrings within 's'...
matches = [ ]
by_length.keys.sort_by { |k| -k }.each do |group|
re = /(#{a.map{|w| Regexp.quote(w)}.join('|')})/
s.gsub!(re) { |w| matches.push(w); '' }
end
# 's' should now be empty and the matched substrings will be
# in 'matches'
There is still room for possible overlaps in these approaches but at least you'd be extracting the longest matches.
If you need to repeat parts of a regex, one option is to store the repeated part in a variable and just reference that, for example:
r = "(cat|dog|bird)"
str.match(/#{r}#{r}?#{r}?/)
You can do it with .Net regular expressions. If I write the following in PowerShell
$pat = [regex] "^(cat|dog|bird)+$"
$m = $pat.match('birddogcatbird')
$m.groups[1].captures | %{$_.value}
then I get
bird
dog
cat
bird
when I run it. I know even less about IronRuby than I do about PowerShell, but perhaps this means you can do it in IronRuby as well.

Resources