Regexp to match repeated substring - ruby

I would like to verify a string containing repeated substrings. The substrings have a particular structure. Whole string has a particular structure (substring split by "|"). For instance, the string can be:
1=23.00|6=22.12|12=21.34|112=20.34
1=23.00|6=22.12|12=21.34
1=23.00|12=21.34
1=23.00**
How can I check that all repeated substrings match a regexp? I tried to check it with:
"1=23.00|6=22.12|12=21.34".match(/([1-9][0-9]*[=][0-9\.]+)+/)
But checking gives true even when several substrings do not match the regexp:
"1=23.00|6=ass|=21.34".match(/([1-9][0-9]*[=][0-9\.]+)+/)
# => #<MatchData "1=23.00" 1:"1=23.00">

The question is whether every repeated substring matches a regex. I understand that the substrings are separated by the character | or $/, the latter being the end of a line. We first need to obtain the repeated substrings:
a = str.split(/[#{$/}\|]/)
.map(&:strip)
.group_by {|s| s}
.select {|_,v| v.size > 1 }
.keys
Next we specify whatever regex you wish to use. I am assuming it is this:
REGEX = /[1-9][0-9]*=[1-9]+\.[0-9]+/
but it could be altered if you have other requirements.
As we wish to determine if all repeated substrings match the regex, that is simply:
a.all? {|s| s =~ REGEX}
Here are the calculations:
str =<<_
1=23.00|6=22.12|12=21.34|112=20.34
1=23.00|6=22.12|12=21.34
1=23.00|12=21.34
1=23.00**
_
c = str.split(/[#{$/}\|]/)
#=> ["1=23.00", "6=22.12", "12=21.34", "112=20.34", "1=23.00",
# "6=22.12", "12=21.34", "1=23.00", "12=21.34", "1=23.00**"]
d = c.map(&:strip)
# same as c, possibly not needed or not wanted
e = d.group_by {|s| s}
# => {"1=23.00" =>["1=23.00", "1=23.00", "1=23.00"],
# "6=22.12" =>["6=22.12", "6=22.12"],
# "12=21.34" =>["12=21.34", "12=21.34", "12=21.34"],
# "112=20.34"=>["112=20.34"], "1=23.00**"=>["1=23.00**"]}
f = e.select {|_,v| v.size > 1 }
#=> {"1=23.00"=>["1=23.00", "1=23.00" , "1=23.00"],
# "6=22.12"=>["6=22.12", "6=22.12"],
# "12=21.34"=>["12=21.34", "12=21.34", "12=21.34"]}
a = f.keys
#=> ["1=23.00", "6=22.12", "12=21.34"]
a.all? {|s| s =~ REGEX}
#=> true

This will return true if there are any duplicates, false if there are not:
s = "1=23.00|6=22.12|12=21.34|112=20.34|3=23.00"
arr = s.split(/\|/).map { |s| s.gsub(/\d=/, "") }
arr != arr.uniq # => true

If you want to resolve it through regexp (not ruby), you should match whole string, not substrings. Well, I added [|] symbol and line ending to your regexp and it should works like you want.
([1-9][0-9]*[=][0-9\.]+[|]*)+$
Try it out.

Related

Regex doesn't work on the first time

I have a string e.g. 02112016. I want to make a datetime from this string.
I have tried:
s = "02112016"
s.sub(/(\d{2})(\d{2})(\d{4})/, "#{$1}-#{$2}-#{$3}")
But there is a problem. It returns "--".
If I try this s.sub(/(\d{2})(\d{2})(\d{4})/, "#{$1}-#{$2}-#{$3}") again, it works: "02-11-2016". Now I can use to_datetime method.
But why doesn't the s.sub(/(\d{2})(\d{2})(\d{4})/, "#{$1}-#{$2}-#{$3}") work on the first time?
It's really a simple change here. $1 and friends are only assigned after the match succeeds, not during the match itself. If you want to use immediate values, do this:
s = "02112016"
s.sub(/(\d{2})(\d{2})(\d{4})/, '\1-\2-\3')
# => "02-11-2016"
Here \1 corresponds to what will be assigned to $1. This is especially important if you're using gsub since $1 tends to be the last match only while \1 is evaluated for each match individually.
I prefer the following.
r = /
\d{2} # match two digits
(?=\d{4}) # match four digits in a positive lookahead
/x # free-spacing regex definition mode
which is the same as
r = /\d{2}(?=\d{4})/
to be used with String#gsub:
s.gsub(r) { |s| "#{s}-" }
Try it:
"02112016".gsub(r) { |s| "#{s}-" }
#=> "02-11-2016"
What is happening is the first time you ran it, $1, $2, and $3 are empty
You are essentially subbing the numbers for empty strings.
So if we do
s = "02112016"
p $1 #=> nil
p $2 #=> nil
p $3 #=> nil
s.sub(/(\d{2})(\d{2})(\d{4})/, "#{$1}-#{$2}-#{$3}") #=> "--"
p $1 #=> "02"
p $2 #=> "11"
p $3 #=> "2016"
s.sub(/(\d{2})(\d{2})(\d{4})/, "#{$1}-#{$2}-#{$3}") #=> "02-11-2016"
That is why it works the second time.
Since the string is always the same length, you can use the [] method to break it up.
s = "#{s[0..1]}-#{s[2..3]}-#{s[4..-1]}"
This will return the desired result
"02-11-2016"

Remove a string pattern and symbols from string

I need to clean up a string from the phrase "not" and hashtags(#). (I also have to get rid of spaces and capslock and return them in arrays, but I got the latter three taken care of.)
Expectation:
"not12345" #=> ["12345"]
" notabc " #=> ["abc"]
"notone, nottwo" #=> ["one", "two"]
"notCAPSLOCK" #=> ["capslock"]
"##doublehash" #=> ["doublehash"]
"h#a#s#h" #=> ["hash"]
"#notswaggerest" #=> ["swaggerest"]
This is the code I have
def some_method(string)
string.split(", ").map{|n| n.sub(/(not)/,"").downcase.strip}
end
All of the above test does what I need to do except for the hash ones. I don't know how to get rid of the hashes; I have tried modifying the regex part: n.sub(/(#not)/), n.sub(/#(not)/), n.sub(/[#]*(not)/) to no avail. How can I make Regex to remove #?
arr = ["not12345", " notabc", "notone, nottwo", "notCAPSLOCK",
"##doublehash:", "h#a#s#h", "#notswaggerest"].
arr.flat_map { |str| str.downcase.split(',').map { |s| s.gsub(/#|not|\s+/,"") } }
#=> ["12345", "abc", "one", "two", "capslock", "doublehash:", "hash", "swaggerest"]
When the block variable str is set to "notone, nottwo",
s = str.downcase
#=> "notone, nottwo"
a = s.split(',')
#=> ["notone", " nottwo"]
b = a.map { |s| s.gsub(/#|not|\s+/,"") }
#=> ["one", "two"]
Because I used Enumerable#flat_map, "one" and "two" are added to the array being returned. When str #=> "notCAPSLOCK",
s = str.downcase
#=> "notcapslock"
a = s.split(',')
#=> ["notcapslock"]
b = a.map { |s| s.gsub(/#|not|\s+/,"") }
#=> ["capslock"]
Here is one more solution that uses a different technique of capturing what you want rather than dropping what you don't want: (for the most part)
a = ["not12345", " notabc", "notone, nottwo",
"notCAPSLOCK", "##doublehash:","h#a#s#h", "#notswaggerest"]
a.map do |s|
s.downcase.delete("#").scan(/(?<=not)\w+|^[^not]\w+/)
end
#=> [["12345"], ["abc"], ["one", "two"], ["capslock"], ["doublehash"], ["hash"], ["swaggerest"]]
Had to delete the # because of h#a#s#h otherwise delete could have been avoided with a regex like /(?<=not|^#[^not])\w+/
You can use this regex to solve your problem. I tested and it works for all of your test cases.
/^\s*#*(not)*/
^ means match start of string
\s* matches any space at the start
#* matches 0 or more #
(not)* matches the phrase "not" zero or more times.
Note: this regex won't work for cases where "not" comes before "#", such as not#hash would return #hash
Fun problem because it can use the most common string functions in Ruby:
result = values.map do |string|
string.strip # Remove spaces in front and back.
.tr('#','') # Transform single characters. In this case remove #
.gsub('not','') # Substitute patterns
.split(', ') # Split into arrays.
end
p result #=>[["12345"], ["abc"], ["one", "two"], ["CAPSLOCK"], ["doublehash"], ["hash"], ["swaggerest"]]
I prefer this way rather than a regexp as it is easy to understand the logic of each line.
Ruby regular expressions allow comments, so to match the octothorpe (#) you can escape it:
"#foo".sub(/\#/, "") #=> "foo"

Removing trailings zeros in string

I have a string and I need to remove trailing zeros after the 2nd decimal place:
remove_zeros("1,2,3,4.2300") #=> "1,2,3,4.23"
remove_zeros("1,2,3,4.20300") #=> "1,2,3,4.203"
remove_zeros("1,2,3,4.0200") #=> "1,2,3,4.02"
remove_zeros("1,2,3,4.0000") #=> "1,2,3,4.00"
Missing zeros don't have to be appended, i.e.
remove_zeros("1,2,3,4.0") #=> "1,2,3,4.0"
How could I do this in Ruby? I tried with converting into Float but it terminates the string when I encounter a ,. Can I write any regular expression for this?
Yes, a regular expression could be used.
R = /
\. # match a decimal
\d*? # match one or more digits lazily
\K # forget all matches so far
0+ # match one or more zeroes
(?!\d) # do not match a digit (negative lookahead)
/x # free-spacing regex definition mode
def truncate_floats(str)
str.gsub(R,"")
end
truncate_floats "1,2,3,4.2300"
#=> "1,2,3,4.23"
truncate_floats "1.34000,2,3,4.23000"
#=> "1.34,2,3,4.23"
truncate_floats "1,2,3,4.23003500"
#=> "1,2,3,4.230035"
truncate_floats "1,2,3,4.3"
#=> "1,2,3,4.3"
truncate_floats "1,2,3,4.000"
#=> "1,2,3,4."
> a = "1,2,3,4.2300"
> a.split(",").map{|e| e.include?(".") ? e.to_f : e}.join(",")
#=> "1,2,3,4.23"
> a = "1,2,3,4.20300"
> a.split(",").map{|e| e.include?(".") ? e.to_f : e}.join(",")
#=> "1,2,3,4.203"
First, you need to parse the string into its component numbers, then remove the trailing zeros on each number. This can be done by:
1) splitting the string on ',' to get an array of numeric strings
2) for each numeric string, convert it to a Float, then back to a string:
#!/usr/bin/env ruby
def parse_and_trim(string)
number_strings = string.split(',')
number_strings.map { |s| Float(s).to_s }.join(',')
end
p parse_and_trim('1,2,3,4.2300') # => "1.0,2.0,3.0,4.23"
If you really want to remove the trailing '.0' fragments, you could replace the script with this one:
#!/usr/bin/env ruby
def parse_and_trim_2(string)
original_strings = string.split(',')
converted_strings = original_strings.map { |s| Float(s).to_s }
trimmed_strings = converted_strings.map do |s|
s.end_with?('.0') ? s[0..-3] : s
end
trimmed_strings.join(',')
end
p parse_and_trim_2('1,2,3,4.2300') # => "1,2,3,4.23"
These could of course be made more concise, but I've used intermediate variables to clarify what's going on.

Ruby transform string of range measurements into a list of the measurements?

I have a sample string that I would like to transform, from this:
#21inch-#25inch
to this:
#21inch #22inch #23inch #24inch #25inch
Using Ruby, please show me how this can be done.
You can scan your string and working with range of strings:
numbers = "#21inch-#25inch".scan(/\d+/)
=> ["21", "25"]
Range.new(*numbers).map{ |s| "##{s}inch" }.join(" ")
=> "#21inch #22inch #23inch #24inch #25inch"
This solution working only if your string has a format like in your instance. For other cases you should write your own specific solution.
R = /
(\D*) # match zero or more non-digits in capture group 1
(\d+) # match one or more digits in capture group 2
([^\d-]+) # match on or more chars other the digits and hyphens in capture group 3
/x # free-spacing regex definition mode
def spin_out(str)
(prefix, first, units),(_, last, _) = str.scan(R)
(first..last).map { |s| "%s%s%s" % [prefix,s,units] }.join(' ')
end
spin_out "#21inch-#25inch"
#=> "#21inch #22inch #23inch #24inch #25inch"
spin_out "#45cm-#53cm"
#=> "#45cm #46cm #47cm #48cm #49cm #50cm #51cm #52cm #53cm"
spin_out "sz 45cm-sz 53cm"
#=> "sz 45cm sz 46cm sz 47cm sz 48cm sz 49cm sz 50cm sz 51cm sz 52cm sz 53cm"
spin_out "45cm-53cm"
#=> "45cm 46cm 47cm 48cm 49cm 50cm 51cm 52cm 53cm"
For str = "#21inch-#25inch", we obtain
(prefix, first, units),(_, last, _) = str.scan(R)
#=> [["#", "21", "inch"], ["-#", "25", "inch"]]
prefix
#=> "#"
first
#=> "21"
units
#=> "inch"
last
#=> "25"
The subsequent mapping is straightforward.
You can use a regex gsub with a block match replacement, like this:
string = "#21inch-#25inch"
new_string = string.gsub(/#\d+\w+-#\d+\w+/) do |match|
first_capture, last_capture = match.split("-")
first_num = first_capture.gsub(/\D+/, "").to_i
last_num = last_capture.gsub(/\D+/, "").to_i
pattern = first_capture.split(/\d+/)
(first_num..last_num).map {|num| pattern.join(num.to_s) }.join(" ")
end
puts "#{new_string}"
Running this will produce this output:
First: #21inch Last: #25inch
First num: 21 Last num: 25
Pattern: ["#", "inch"]
#21inch #22inch #23inch #24inch #25inch
The last line of output is the answer, and the previous lines show the progression of logic to get there.
This approach should work for other, slightly different unit formats, as well:
#32ft-#49ft
#1mm-5mm
#2acres-5acres
Making this suit multiple purposes will be quite simple. With a slight variation in the regex, you could also support a range format #21inch..#25inch:
/(#\d+\w+)[-.]+(#\d+\w+)/
Happy parsing!

How to create an ordered list of matches from multiple Regexps in a string?

How can one get a list of matches in a string from multiple different Regexps, and have these matches ordered relatively by their position in the string?
The string can contain multiple matches from the same Regexp.
Based on sepp2k's answer, here's the solution I implemented (simplified example):
test_data = "
a_word
another_word
23445
12432423
third_word
"
regexps = /(?<word>[a-zA-Z_]+)/, /(?<number>[\d]+)/
words = regexps.map{|re| re.names}.flatten!
matches = []
test_data.scan(Regexp.union(regexps)) do
words.each do |word|
m = Regexp.last_match
matches << {word => m.to_s} if m[word]
end
end
p matches
This outputs:
[{"word"=>"a_word"}, {"word"=>"another_word"}, {"number"=>"23445"}, {"number"=>"12432423"}, {"word"=>"third_word"}]
You can use Regexp.union to turn all the regexps into one regexp and then use String#scan to find all matches. The array returned by scan will be ordered by the position of the match.
That seems awfully complex when inject and a case statement will do IMHO:
> %w{a_word another_word 23445 12432423 third_word}.inject([]) {|s,v| s << case v when /^[a-zA-Z_]+$/ then {'word' => v} when /^\d+$/ then {'number' => v} end }
=> [{"word"=>"a_word"}, {"word"=>"another_word"}, {"number"=>"23445"}, {"number"=>"12432423"}, {"word"=>"third_word"}]
For readability you could have the following:
data = <<EOD
a_word
another_word
23445
12432423
third_word
EOD
data.split.inject([]) do |s,v|
s << case v
when /^[a-zA-Z_]+$/
{'word' => v}
when /^\d+$/
{'number' => v}
end
end

Resources