Optional whitespace in regexp

Optional whitespace in regexp - ruby

I want to create a regexp to specify words between parenthesis. For example, I have a string like this:
"something(a,b)"
and
"something(c, d)"
and I want to extract the letters from between the parentheses.
In the first string I want to get an array ['a','b']. In the second, I want the array ['c','d'].
I have following method:
def suffixes(t)
(t.scan /\((\w+),(\w+)\)/).flatten
end
but this works only for the first case. In the second variant I have:
def suffixes(t)
(t.scan /\((\w+),[\s](\w+)\)/).flatten
end
But this works only for the second case. I don't know what regexp will operate in both cases.

You can use:
def suffixes(t)
(t.scan /\((\w+)\s*,\s*(\w+)\)/).flatten
end
\s* will match 0 or more spaces before and after comma.

Make the inbetween \s as optional.
def suffixes(t)
(t.scan /\((\w+),\s?(\w+)\)/).flatten
end
? after the \s would turn the space to optional (0 or 1).

I would suggest you to distinguish "scanning" for the text between parentheses and "splitting" the result by comma:
s = "something(c, d)"
s.match( /\((.+)\)/ )[1] # found text between parentheses
.split(/,/) # split the result by comma
.map(&:strip) # stripped the values
It’s more Ruby-like, in my understanding. Hope it helps.
UPD Thanks #theTinMan, there are two possibilities to improve an answer. First of all, s[/\((.+)\)/, 1] looks better ans executes faster than s.match( /\((.+)\)/ )[1]. Secondary, splitting by string is faster than splitting by regexp. The summing up:
s = "something(c, d)"
s[ /\((.+)\)/, 1 ] # found text between parentheses
.split(',') # split the result by comma
.map(&:strip) # stripped the values
Proof:
require 'benchmark'
n = 1_000_000
s = "something(c, d)"
Benchmark.bm do |x|
x.report { n.times { s.match( /\((.+)\)/ )[1].split(/,/).map(&:strip) } }
x.report { n.times { s.match( /\((.+)\)/ )[1].split(',').map(&:strip) } }
x.report { n.times { s[/\((.+)\)/, 1].split(/,/).map(&:strip) } }
x.report { n.times { s[/\((.+)\)/, 1].split(',').map(&:strip) } }
end
# user system total real
# 3.590000 0.000000 3.590000 ( 3.598151)
# 3.030000 0.000000 3.030000 ( 3.028137)
# 2.940000 0.000000 2.940000 ( 2.942490)
# 2.180000 0.000000 2.180000 ( 2.182447)

\((\w+)|(?!^)\G\s*,\s*(\w+)
Try this.This will work for all arguments.See demo.
https://regex101.com/r/vN3sH3/27

Related

Regexp to match repeated substring

I would like to verify a string containing repeated substrings. The substrings have a particular structure. Whole string has a particular structure (substring split by "|"). For instance, the string can be:
1=23.00|6=22.12|12=21.34|112=20.34
1=23.00|6=22.12|12=21.34
1=23.00|12=21.34
1=23.00**
How can I check that all repeated substrings match a regexp? I tried to check it with:
"1=23.00|6=22.12|12=21.34".match(/([1-9][0-9]*[=][0-9\.]+)+/)
But checking gives true even when several substrings do not match the regexp:
"1=23.00|6=ass|=21.34".match(/([1-9][0-9]*[=][0-9\.]+)+/)
# => #<MatchData "1=23.00" 1:"1=23.00">

The question is whether every repeated substring matches a regex. I understand that the substrings are separated by the character | or $/, the latter being the end of a line. We first need to obtain the repeated substrings:
a = str.split(/[#{$/}\|]/)
.map(&:strip)
.group_by {|s| s}
.select {|_,v| v.size > 1 }
.keys
Next we specify whatever regex you wish to use. I am assuming it is this:
REGEX = /[1-9][0-9]*=[1-9]+\.[0-9]+/
but it could be altered if you have other requirements.
As we wish to determine if all repeated substrings match the regex, that is simply:
a.all? {|s| s =~ REGEX}
Here are the calculations:
str =<<_
1=23.00|6=22.12|12=21.34|112=20.34
1=23.00|6=22.12|12=21.34
1=23.00|12=21.34
1=23.00**
_
c = str.split(/[#{$/}\|]/)
#=> ["1=23.00", "6=22.12", "12=21.34", "112=20.34", "1=23.00",
# "6=22.12", "12=21.34", "1=23.00", "12=21.34", "1=23.00**"]
d = c.map(&:strip)
# same as c, possibly not needed or not wanted
e = d.group_by {|s| s}
# => {"1=23.00" =>["1=23.00", "1=23.00", "1=23.00"],
# "6=22.12" =>["6=22.12", "6=22.12"],
# "12=21.34" =>["12=21.34", "12=21.34", "12=21.34"],
# "112=20.34"=>["112=20.34"], "1=23.00**"=>["1=23.00**"]}
f = e.select {|_,v| v.size > 1 }
#=> {"1=23.00"=>["1=23.00", "1=23.00" , "1=23.00"],
# "6=22.12"=>["6=22.12", "6=22.12"],
# "12=21.34"=>["12=21.34", "12=21.34", "12=21.34"]}
a = f.keys
#=> ["1=23.00", "6=22.12", "12=21.34"]
a.all? {|s| s =~ REGEX}
#=> true

This will return true if there are any duplicates, false if there are not:
s = "1=23.00|6=22.12|12=21.34|112=20.34|3=23.00"
arr = s.split(/\|/).map { |s| s.gsub(/\d=/, "") }
arr != arr.uniq # => true

If you want to resolve it through regexp (not ruby), you should match whole string, not substrings. Well, I added [|] symbol and line ending to your regexp and it should works like you want.
([1-9][0-9]*[=][0-9\.]+[|]*)+$
Try it out.

How to find word with the greatest number of repeated letters

My goal is to find the word with greatest number of repeated letters in a given string. For example, "aabcc ddeeteefef iijjfff" would return "ddeeteefef" because "e" is repeated five times in this word and that is more than all other repeating characters.
So far this is what I got, but it has many problems and is not complete:
def LetterCountI(str)
s = str.split(" ")
i = 0
result = []
t = s[i].scan(/((.)\2+)/).map(&:max)
u = t.max { |a, b| a.length <=> b.length }
return u.split(//).count
end
The code I have only finds consecutive patterns; if the pattern is interrupted (such as with "aabaaa", it counts a three times instead of five).

str.scan(/\w+/).max_by{ |w| w.chars.group_by(&:to_s).values.map(&:size).max }
scan(/\w+/) — create an array of all sequences of 'word' characters
max_by{ … } — find the word that gives the largest value inside this block
chars — split the string into characters
group_by(&:to_s) — create a hash mapping each character to an array of all the occurrences
values — just get all the arrays of the occurrences
map(&:size) — convert each array to the number of characters in that array
max — find the largest characters and use this as the result for max_by to examine
Edit: Written less compactly:
str.scan(/\w+/).max_by do |word|
word.chars
.group_by{ |char| char }
.map{ |char,array| array.size }
.max
end
Written less functionally and with less Ruby-isms (to make it look more like "other" languages):
words_by_most_repeated = []
str.split(" ").each do |word|
count_by_char = {} # hash mapping character to count of occurrences
word.chars.each do |char|
count_by_char[ char ] = 0 unless count_by_char[ char ]
count_by_char[ char ] += 1
end
maximum_count = 0
count_by_char.each do |char,count|
if count > maximum_count then
maximum_count = count
end
end
words_by_most_repeated[ maximum_count ] = word
end
most_repeated = words_by_most_repeated.last

I'd do as below :
s = "aabcc ddeeteefef iijjfff"
# intermediate calculation that's happening in the final code
s.split(" ").map { |w| w.chars.max_by { |e| w.count(e) } }
# => ["a", "e", "f"] # getting the max count character from each word
s.split(" ").map { |w| w.count(w.chars.max_by { |e| w.count(e) }) }
# => [2, 5, 3] # getting the max count character's count from each word
# final code
s.split(" ").max_by { |w| w.count(w.chars.max_by { |e| w.count(e) }) }
# => "ddeeteefef"
update
each_with_object gives better result than group_by method.
require 'benchmark'
s = "aabcc ddeeteefef iijjfff"
def phrogz(s)
s.scan(/\w+/).max_by{ |word| word.chars.group_by(&:to_s).values.map(&:size).max }
end
def arup_v1(s)
max_string = s.split.max_by do |w|
h = w.chars.each_with_object(Hash.new(0)) do |e,hsh|
hsh[e] += 1
end
h.values.max
end
end
def arup_v2(s)
s.split.max_by { |w| w.count(w.chars.max_by { |e| w.count(e) }) }
end
n = 100_000
Benchmark.bm do |x|
x.report("Phrogz:") { n.times {|i| phrogz s } }
x.report("arup_v2:"){ n.times {|i| arup_v2 s } }
x.report("arup_v1:"){ n.times {|i| arup_v1 s } }
end
output
user system total real
Phrogz: 1.981000 0.000000 1.981000 ( 1.979198)
arup_v2: 0.874000 0.000000 0.874000 ( 0.878088)
arup_v1: 1.684000 0.000000 1.684000 ( 1.685168)

Similar to sawa's answer:
"aabcc ddeeteefef iijjfff".split.max_by{|w| w.length - w.chars.uniq.length}
=> "ddeeteefef"
In Ruby 2.x, this works as-is because String#chars returns an array. In earlier versions of ruby, String#chars yields an enumerator so you need to add .to_a before applying uniq. I did my testing in Ruby 2.0, and overlooked this until it was pointed out by Stephens.
I believe this is valid, since the question was "greatest number of repeated letters in a given string" rather than greatest number of repeats for a single letter in a given string.

"aabcc ddeeteefef iijjfff"
.split.max_by{|w| w.chars.sort.chunk{|e| e}.map{|e| e.last.length}.max}
# => "ddeeteefef"

Check the characters inside two strings are the same in Ruby

I have two strings, a and b, in Ruby.
a="scar"
b="cars"
What is the easiest way in Ruby to find whether a and b contain the same characters?
UPDATE
I am building an Anagram game ,so scar is an anagram of cars.So i want a way to compare a and b and come to conclusion that its an anagram
So c="carcass" should not be a match

You could do like this:
a = 'scar'
b = 'cars'
a.chars.sort == b.chars.sort
# => true
a = 'cars'
b = 'carcass'
a.chars.sort == b.chars.sort
# => false

Just for testing arrays vs string vs delete comparation. Assuming we compare strings with equal length.
In the real anagram search you need to sort first word a once. And then compare it to bunch of b's.
a="scar"
b="cars"
require 'benchmark'
n = 1000000
Benchmark.bm do |x|
x.report('string') { a = a.chars.sort.join; n.times do ; a == b.chars.sort.join ; end }
x.report('arrays') { a = a.chars.sort; n.times do ; a == b.chars.sort ; end }
end
The result:
user system total real
string 6.030000 0.010000 6.040000 ( 6.061088)
arrays 6.420000 0.010000 6.430000 ( 6.473158)
But, if you sort a each time (for delete we don't need to sort any word):
x.report('string') { n.times do ; a.chars.sort.join == b.chars.sort.join ; end }
x.report('arrays') { n.times do ; a.chars.sort == b.chars.sort ; end }
x.report('delete') { n.times do ; a.delete(b).empty? ; end }
The result is:
user system total real
string 11.800000 0.020000 11.820000 ( 11.989071)
arrays 11.210000 0.020000 11.230000 ( 11.263627)
delete 1.680000 0.000000 1.680000 ( 1.673979)

What is the easiest way in Ruby to find whether a and b contain the same characters?
As per the definitions of Anagram the below written code should work :
a="scar"
b="cars"
a.size == b.size && a.delete(b).empty?

require 'set'
Set.new(a.chars) == Set.new(b.chars)
updated to take into account comment from sawa

Convert Input Value to Integer or Float, as Appropriate Using Ruby

I believe I have a good answer to this issue, but I wanted to make sure ruby-philes didn't have a much better way to do this.
Basically, given an input string, I would like to convert the string to an integer, where appropriate, or a float, where appropriate. Otherwise, just return the string.
I'll post my answer below, but I'd like to know if there is a better way out there.
Ex:
to_f_or_i_or_s("0523.49") #=> 523.49
to_f_or_i_or_s("0000029") #=> 29
to_f_or_i_or_s("kittens") #=> "kittens"

I would avoid using regex whenever possible in Ruby. It's notoriously slow.
def to_f_or_i_or_s(v)
((float = Float(v)) && (float % 1.0 == 0) ? float.to_i : float) rescue v
end
# Proof of Ruby's slow regex
def regex_float_detection(input)
input.match('\.')
end
def math_float_detection(input)
input % 1.0 == 0
end
n = 100_000
Benchmark.bm(30) do |x|
x.report("Regex") { n.times { regex_float_detection("1.1") } }
x.report("Math") { n.times { math_float_detection(1.1) } }
end
# user system total real
# Regex 0.180000 0.000000 0.180000 ( 0.181268)
# Math 0.050000 0.000000 0.050000 ( 0.048692)
A more comprehensive benchmark:
def wattsinabox(input)
input.match('\.').nil? ? Integer(input) : Float(input) rescue input.to_s
end
def jaredonline(input)
((float = Float(input)) && (float % 1.0 == 0) ? float.to_i : float) rescue input
end
def muistooshort(input)
case(input)
when /\A\s*[+-]?\d+\.\d+\z/
input.to_f
when /\A\s*[+-]?\d+(\.\d+)?[eE]\d+\z/
input.to_f
when /\A\s*[+-]?\d+\z/
input.to_i
else
input
end
end
n = 1_000_000
Benchmark.bm(30) do |x|
x.report("wattsinabox") { n.times { wattsinabox("1.1") } }
x.report("jaredonline") { n.times { jaredonline("1.1") } }
x.report("muistooshort") { n.times { muistooshort("1.1") } }
end
# user system total real
# wattsinabox 3.600000 0.020000 3.620000 ( 3.647055)
# jaredonline 1.400000 0.000000 1.400000 ( 1.413660)
# muistooshort 2.790000 0.010000 2.800000 ( 2.803939)

def to_f_or_i_or_s(v)
v.match('\.').nil? ? Integer(v) : Float(v) rescue v.to_s
end

A pile of regexes might be a good idea if you want to handle numbers in scientific notation (which String#to_f does):
def to_f_or_i_or_s(v)
case(v)
when /\A\s*[+-]?\d+\.\d+\z/
v.to_f
when /\A\s*[+-]?\d+(\.\d+)?[eE]\d+\z/
v.to_f
when /\A\s*[+-]?\d+\z/
v.to_i
else
v
end
end
You could mash both to_f cases into one regex if you wanted.
This will, of course, fail when fed '3,14159' in a locale that uses a comma as a decimal separator.

Depends on security requirements.
def to_f_or_i_or_s s
eval(s) rescue s
end

I used this method
def to_f_or_i_or_s(value)
return value if value[/[a-zA-Z]/]
i = value.to_i
f = value.to_f
i == f ? i : f
end

CSV has converters which do this.
require "csv"
strings = ["0523.49", "29","kittens"]
strings.each{|s|p s.parse_csv(converters: :numeric).first}
#523.49
#29
#"kittens"
However for some reason it converts "00029" to a float.

Find consecutive substring indexes

Given a search string and a result string (which is guaranteed to contain all letters of the search string, case-insensitive, in order), how can I most efficiently get an array of ranges representing the indices in the result string corresponding to the letters in the search string?
Desired output:
substrings( "word", "Microsoft Office Word 2007" )
#=> [ 17..20 ]
substrings( "word", "Network Setup Wizard" )
#=> [ 3..5, 19..19 ]
#=> [ 3..4, 18..19 ] # Alternative, acceptable, less-desirable output
substrings( "word", "Watch Network Daemon" )
#=> [ 0..0, 10..11, 14..14 ]
This is for an autocomplete search box. Here's a screenshot from a tool similar to Quicksilver that underlines letters as I'm looking to do. Note that--unlike my ideal output above--this screenshot does not prefer longer single matches.
Benchmark Results
Benchmarking the current working results shows that #tokland's regex-based answer is basically as fast as the StringScanner-based solutions I put forth, with less code:
user system total real
phrogz1 0.889000 0.062000 0.951000 ( 0.944000)
phrogz2 0.920000 0.047000 0.967000 ( 0.977000)
tokland 1.030000 0.000000 1.030000 ( 1.035000)
Here is the benchmark test:
a=["Microsoft Office Word 2007","Network Setup Wizard","Watch Network Daemon"]
b=["FooBar","Foo Bar","For the Love of Big Cars"]
test = { a=>%w[ w wo wor word ], b=>%w[ f fo foo foobar fb fbr ] }
require 'benchmark'
Benchmark.bmbm do |x|
%w[ phrogz1 phrogz2 tokland ].each{ |method|
x.report(method){ test.each{ |words,terms|
words.each{ |master| terms.each{ |term|
2000.times{ send(method,term,master) }
} }
} }
}
end

To have something to start with, how about that?
>> s = "word"
>> re = /#{s.chars.map{|c| "(#{c})" }.join(".*?")}/i # /(w).*?(o).*?(r).*?(d)/i/
>> match = "Watch Network Daemon".match(re)
=> #<MatchData "Watch Network D" 1:"W" 2:"o" 3:"r" 4:"D">
>> 1.upto(s.length).map { |idx| match.begin(idx) }
=> [0, 10, 11, 14]
And now you only have to build the ranges (if you really need them, I guess the individual indexes are also ok).

Ruby's Abbrev module is a good starting point. It breaks down a string into a hash consisting of the unique keys that can identify the full word:
require 'abbrev'
require 'pp'
abbr = Abbrev::abbrev(['ruby'])
>> {"rub"=>"ruby", "ru"=>"ruby", "r"=>"ruby", "ruby"=>"ruby"}
For every keypress you can do a lookup and see if there's a match. I'd filter out all keys shorter than a certain length, to reduce the size of the hash.
The keys will also give you a quick set of words to look up the subword matches in your original string.
For fast lookups to see if there's a substring match:
regexps = Regexp.union(
abbr.keys.sort.reverse.map{ |k|
Regexp.new(
Regexp.escape(k),
Regexp::IGNORECASE
)
}
)
Note that it's escaping the patterns, which would allow characters to be entered, such as ?, * or ., and be treated as literals, instead of special characters for regex, like they would normally be treated.
The result looks like:
/(?i-mx:ruby)|(?i-mx:rub)|(?i-mx:ru)|(?i-mx:r)/
Regexp's match will return information about what was found.
Because the union "ORs" the patterns, it will only find the first match, which will be the shortest occurrence in the string. To fix that reverse the sort.
That should give you a good start on what you want to do.
EDIT: Here's some code to directly answer the question. We've been busy at work so it's taken a couple days to get back this:
require 'abbrev'
require 'pp'
abbr = Abbrev::abbrev(['ruby'])
regexps = Regexp.union( abbr.keys.sort.reverse.map{ |k| Regexp.new( Regexp.escape(k), Regexp::IGNORECASE ) } )
target_str ='Ruby rocks, rub-a-dub-dub, RU there?'
str_offset = 0
offsets = []
loop do
match_results = regexps.match(target_str, str_offset)
break if (match_results.nil?)
s, e = match_results.offset(0)
offsets << [s, e - s]
str_offset = 1 + s
end
pp offsets
>> [[0, 4], [5, 1], [12, 3], [27, 2], [33, 1]]
If you want ranges replace offsets << [s, e - s] with offsets << [s .. e] which will return:
>> [[0..4], [5..6], [12..15], [27..29], [33..34]]

Here's a late entrant that's making a move as it nears the finish line.
code
def substrings( search_str, result_str )
search_chars = search_str.downcase.chars
next_char = search_chars.shift
result_str.downcase.each_char.with_index.take_while.with_object([]) do |(c,i),a|
if next_char == c
(a.empty? || i != a.last.last+1) ? a << (i..i) : a[-1]=(a.last.first..i)
next_char = search_chars.shift
end
next_char
end
end
demo
substrings( "word", "Microsoft Office Word 2007" ) #=> [17..20]
substrings( "word", "Network Setup Wizard" ) #=> [3..5, 19..19]
substrings( "word", "Watch Network Daemon" ) #=> [0..0, 10..11, 14..14]
benchmark
user system total real
phrogz1 1.120000 0.000000 1.120000 ( 1.123083)
cary 0.550000 0.000000 0.550000 ( 0.550728)

I don't think there are any built in methods that will really help with this, probably the best way is to go through each letter in the word you're searching for and build up the ranges manually. Your next best option would probably be to build a regex like in #tokland's answer.

Here's my implementation:
require 'strscan'
def substrings( search, master )
[].tap do |ranges|
scan = StringScanner.new(master)
init = nil
last = nil
prev = nil
search.chars.map do |c|
return nil unless scan.scan_until /#{c}/i
last = scan.pos-1
if !init || (last-prev) > 1
ranges << (init..prev) if init
init = last
end
prev = last
end
ranges << (init..last)
end
end
And here's a shorter version using another utility method (also needed by #tokland's answer):
require 'strscan'
def substrings( search, master )
s = StringScanner.new(master)
search.chars.map do |c|
return nil unless s.scan_until(/#{c}/i)
s.pos - 1
end.to_ranges
end
class Array
def to_ranges
return [] if empty?
[].tap do |ranges|
init,last = first
each do |o|
if last && o != last.succ
ranges << (init..last)
init = o
end
last = o
end
ranges << (init..last)
end
end
end

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Optional whitespace in regexp - ruby

You can use: def suffixes(t) (t.scan /\((\w+)\s,\s(\w+)\)/).flatten end \s* will match 0 or more spaces before and after comma.

Make the inbetween \s as optional. def suffixes(t) (t.scan /\((\w+),\s?(\w+)\)/).flatten end ? after the \s would turn the space to optional (0 or 1).

\((\w+)|(?!^)\G\s,\s(\w+) Try this.This will work for all arguments.See demo. https://regex101.com/r/vN3sH3/27

Related

Regexp to match repeated substring

How to find word with the greatest number of repeated letters

Check the characters inside two strings are the same in Ruby

Convert Input Value to Integer or Float, as Appropriate Using Ruby

Find consecutive substring indexes

Categories

Resources

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Optional whitespace in regexp - ruby

You can use: def suffixes(t) (t.scan /\((\w+)\s*,\s*(\w+)\)/).flatten end \s* will match 0 or more spaces before and after comma.

Make the inbetween \s as optional. def suffixes(t) (t.scan /\((\w+),\s?(\w+)\)/).flatten end ? after the \s would turn the space to optional (0 or 1).

\((\w+)|(?!^)\G\s*,\s*(\w+) Try this.This will work for all arguments.See demo. https://regex101.com/r/vN3sH3/27

Related

Regexp to match repeated substring

How to find word with the greatest number of repeated letters

Check the characters inside two strings are the same in Ruby

Convert Input Value to Integer or Float, as Appropriate Using Ruby

Find consecutive substring indexes

Categories

Resources

You can use: def suffixes(t) (t.scan /\((\w+)\s,\s(\w+)\)/).flatten end \s* will match 0 or more spaces before and after comma.

\((\w+)|(?!^)\G\s,\s(\w+) Try this.This will work for all arguments.See demo. https://regex101.com/r/vN3sH3/27