Ruby get strings from array which contain substring - ruby

I've got an array of strings. A few of the strings in this array contain a certain substring I'm looking for. I want to get an array of those strings containing the substring.
I would hope to do it like this:
a = ["abc", "def", "ghi"]
o.select(&:include?("c"))
But that gives me this error:
(repl):2: syntax error, unexpected ')', expecting end-of-input
o.select(&:include?("c"))
^

If your array was a file lines.txt
abc
def
ghi
Then you would select the lines containing c with the grep command-line utility:
$ grep c lines.txt
abc
Ruby has adopted this as Enumerable#grep. You can pass a regular expression as the pattern and it returns the strings matching this pattern:
['abc', 'def', 'ghi'].grep(/c/)
#=> ["abc"]
More specifically, the result array contains all elements for which pattern === element is true:
/c/ === 'abc' #=> true
/c/ === 'def' #=> false
/c/ === 'ghi' #=> false

You can use the &-shorthand here. It's rather irrational (don't do this), but possible.
If you do manage to find an object and a method so you can make checks in your select like so:
o.select { |e| some_object.some_method(e) }
(the important part is that some_object and some_method need to be the same in all iterations)
...then you can use Object#method to get a block like that. It returns something that implements to_proc (a requirement for &-shorthand) and that proc, when called, calls some_method on some_object, forwarding its arguments to it. Kinda like:
o.m(a, b, c) # <=> o.method(:m).to_proc.call(a, b, c)
Here's how you use this with the &-shorthand:
collection.select(&some_object.method(:some_method))
In this particular case, /c/ and its method =~ do the job:
["abc", "def", "ghi"].select(&/c/.method(:=~))
Kinda verbose, readability is relatively bad.
Once again, don't do this here. But the trick can be helpful in other situations, particularly where the proc is passed in from the outside.
Note: you may have heard of this shorthand syntax in a pre-release of Ruby 2.7, which was, unfortunately, reverted and didn't make it to 2.7:
["abc", "def", "ghi"].select(&/c/.:=~)

You are almost there, you cannot pass parameter in &:. You can do something like:
o.select{ |e| e.include? 'c' }

Related

How do I make these string substitutions using hash keys in Ruby?

I have a bunch of JSON files, processed in both Python and Ruby, that look something like this:
{
"KEY1": "foo",
"KEY2": "bar",
"URL": "https://{KEY2}.com/{KEY1}",
"IMPORTANT_THING": "repos/{KEY1}",
"NOTE": "This thing is {KEY1}{KEY2}ed",
"PYTHON_ONLY_THING": "{}/test/{}.py"
}
Note that the order that the keys will show up is not consistent, and I'd rather not change the JSON.
Here's my test code showing what I've tried so far:
my_config = {"KEY1"=>"foo",
"KEY2"=>"bar",
"URL"=>"https://{KEY2}.com/{KEY1}",
"IMPORTANT_THING"=>"repos/{KEY1}",
"NOTE"=>"This thing is {KEY1}{KEY2}ed",
"PYTHON_ONLY_THING"=>"{}/test/{}.py"}
my_config.each_key do |key|
# Braindead, hard-coded solution that works:
# my_config[key].gsub!("{KEY1}", my_config["KEY1"])
# my_config[key].gsub!("{KEY2}", my_config["KEY2"])
# More flexible (if it would work):
# my_config[key].gsub!(/{.*}/, my_config['\0'.slice(1,-2)])
my_config[key].gsub!(/{.*}/) {|s| my_config[s.slice(1,-2)]}
end
puts my_config
I'm using the braindead solution for now, which produces the expected output:
{"KEY1"=>"foo", "KEY2"=>"bar", "URL"=>"https://bar.com/foo", "IMPORTANT_THING"=>"repos/foo", "NOTE"=>"This thing is foobared", "PYTHON_ONLY_THING"=>"{}/test/{}.py"}
But I want to make it more flexible and maintainable. The first "better" solution throws an error apparently because slice operates on '\0' itself and not the match, plus I'm not sure it would match more than once. The currently uncommented solution doesn't work because the second part seems to operate on one letter at a time rather than each match like I expected, so it just removes the stuff in curly braces. Worse, it removes everything between the outer braces in the PYTHON_ONLY_THING, which is no good.
I figure I need to change both my regex and Ruby code if this is going to work, but I'm not sure where to look for more help. Or perhaps gsub isn't the right tool for this job. Any ideas?
I am using Ruby 2.3.7 on Linux x86_64.
Use String#gsub with an initial hash for replacements:
my_config.map do |k, v|
[
k,
v.gsub(/(?<={)[^}]+(?=})/, my_config).gsub(/{(?!})|(?<!{)}/, '')
]
end.to_h
#⇒ {"KEY1"=>"foo",
# "KEY2"=>"bar",
# "URL"=>"https://bar.com/foo",
# "IMPORTANT_THING"=>"repos/foo",
# "NOTE"=>"This thing is foobared",
# "PYTHON_ONLY_THING"=>"{}/test/{}.py"}
Starting with Ruby 2.4 (or using Rails) it might be done simpler using Hash#transform_values.
If you dislike the second gsubbing, transform the hash upfront:
my_substs = my_config.map { |k, v| ["{#{k}}", v] }.to_h
my_config.map do |k, v|
[k, v.gsub(/{[^}]+}/, my_substs)]
end.to_h
Here's a possible solution:
my_config = {"KEY1"=>"foo",
"KEY2"=>"bar",
"URL"=>"https://{KEY2}.com/{KEY1}",
"IMPORTANT_THING"=>"repos/{KEY1}",
"NOTE"=>"This thing is {KEY1}{KEY2}ed",
"PYTHON_ONLY_THING"=>"{}/test/{}.py"}
my_config.each_key do |key|
placeholders = my_config[key].scan(/{([^}]+)}/).flatten
placeholders.each do |placeholder|
my_config[key].gsub!("{#{placeholder}}", my_config[placeholder]) if my_config.keys.include?(placeholder)
end
end
puts my_config
By using scan, this will substitute all matches, not just the first match.
Using [[^}]+ in the regex, rather than .*, means you won't "swallow" too much in this part of the match. For example, if the input contains "{FOO} bar {BAZ}", then you want that pattern to only capture FOO and BAZ, not FOO} bar {BAZ.
Grouping the scan result, then calling flatten, is an easy way to reject what's outside the capture group, i.e. in this case the { and } characters. (This just makes the code a little less cryptic than using indexes like slice(1,-2)!
my_config.keys.include?(placeholder) checks whether this is actually . a known value, so you don't replace things with nil.

Regular expression don't match with g flag [duplicate]

Is there a quick way to find every match of a regular expression in Ruby? I've looked through the Regex object in the Ruby STL and searched on Google to no avail.
Using scan should do the trick:
string.scan(/regex/)
To find all the matching strings, use String's scan method.
str = "A 54mpl3 string w1th 7 numb3rs scatter36 ar0und"
str.scan(/\d+/)
#=> ["54", "3", "1", "7", "3", "36", "0"]
If you want, MatchData, which is the type of the object returned by the Regexp match method, use:
str.to_enum(:scan, /\d+/).map { Regexp.last_match }
#=> [#<MatchData "54">, #<MatchData "3">, #<MatchData "1">, #<MatchData "7">, #<MatchData "3">, #<MatchData "36">, #<MatchData "0">]
The benefit of using MatchData is that you can use methods like offset:
match_datas = str.to_enum(:scan, /\d+/).map { Regexp.last_match }
match_datas[0].offset(0)
#=> [2, 4]
match_datas[1].offset(0)
#=> [7, 8]
See these questions if you'd like to know more:
"How do I get the match data for all occurrences of a Ruby regular expression in a string?"
"Ruby regular expression matching enumerator with named capture support"
"How to find out the starting point for each match in ruby"
Reading about special variables $&, $', $1, $2 in Ruby will be helpful too.
if you have a regexp with groups:
str="A 54mpl3 string w1th 7 numbers scatter3r ar0und"
re=/(\d+)[m-t]/
you can use String's scan method to find matching groups:
str.scan re
#> [["54"], ["1"], ["3"]]
To find the matching pattern:
str.to_enum(:scan,re).map {$&}
#> ["54m", "1t", "3r"]
You can use string.scan(your_regex).flatten. If your regex contains groups, it will return in a single plain array.
string = "A 54mpl3 string w1th 7 numbers scatter3r ar0und"
your_regex = /(\d+)[m-t]/
string.scan(your_regex).flatten
=> ["54", "1", "3"]
Regex can be a named group as well.
string = 'group_photo.jpg'
regex = /\A(?<name>.*)\.(?<ext>.*)\z/
string.scan(regex).flatten
You can also use gsub, it's just one more way if you want MatchData.
str.gsub(/\d/).map{ Regexp.last_match }
If you have capture groups () inside the regex for other purposes, the proposed solutions with String#scan and String#match are problematic:
String#scan only get what is inside the capture groups;
String#match only get the first match, rejecting all the others;
String#matches (proposed function) get all the matches.
On this case, we need a solution to match the regex without considering the capture groups.
String#matches
With the Refinements you can monkey patch the String class, implement the String#matches and this method will be available inside the scope of the class that is using the refinement. It is an incredible way to Monkey Patch classes on Ruby.
Setup
/lib/refinements/string_matches.rb
# This module add a String refinement to enable multiple String#match()s
# 1. `String#scan` only get what is inside the capture groups (inside the parens)
# 2. `String#match` only get the first match
# 3. `String#matches` (proposed function) get all the matches
module StringMatches
refine String do
def matches(regex)
scan(/(?<matching>#{regex})/).flatten
end
end
end
Used: named capture groups
Usage
rails c
> require 'refinements/string_matches'
> using StringMatches
> 'function(1, 2, 3) + function(4, 5, 6)'.matches(/function\((\d), (\d), (\d)\)/)
=> ["function(1, 2, 3)", "function(4, 5, 6)"]
> 'function(1, 2, 3) + function(4, 5, 6)'.scan(/function\((\d), (\d), (\d)\)/)
=> [["1", "2", "3"], ["4", "5", "6"]]
> 'function(1, 2, 3) + function(4, 5, 6)'.match(/function\((\d), (\d), (\d)\)/)[0]
=> "function(1, 2, 3)"
Return an array of MatchData objects
#scan is very limited--only returns a simple array of strings!
Far more powerful/flexible for us to get an array of MatchData objects.
I'll provide two approaches (using same logic), one using a PORO and one using a monkey patch:
PORO:
class MatchAll
def initialize(string, pattern)
raise ArgumentError, 'must pass a String' unless string.is_a?(String)
raise ArgumentError, 'must pass a Regexp pattern' unless pattern.is_a?(Regexp)
#string = string
#pattern = pattern
#matches = []
end
def match_all
recursive_match
end
private
def recursive_match(prev_match = nil)
index = prev_match.nil? ? 0 : prev_match.offset(0)[1]
matching_item = #string.match(#pattern, index)
return #matches unless matching_item.present?
#matches << matching_item
recursive_match(matching_item)
end
end
USAGE:
test_string = 'a green frog jumped on a green lilypad'
MatchAll.new(test_string, /green/).match_all
=> [#<MatchData "green", #<MatchData "green"]
Monkey patch
I don't typically condone monkey-patching, but in this case:
we're doing it the right way by "quarantining" our patch into its own module
I prefer this approach because 'string'.match_all(/pattern/) is more intuitive (and looks a lot nicer) than MatchAll.new('string', /pattern/).match_all
module RubyCoreExtensions
module String
module MatchAll
def match_all(pattern)
raise ArgumentError, 'must pass a Regexp pattern' unless pattern.is_a?(Regexp)
recursive_match(pattern)
end
private
def recursive_match(pattern, matches = [], prev_match = nil)
index = prev_match.nil? ? 0 : prev_match.offset(0)[1]
matching_item = self.match(pattern, index)
return matches unless matching_item.present?
matches << matching_item
recursive_match(pattern, matches, matching_item)
end
end
end
end
I recommend creating a new file and putting the patch (assuming you're using Rails) there /lib/ruby_core_extensions/string/match_all.rb
To use our patch we need to make it available:
# within application.rb
require './lib/ruby_core_extensions/string/match_all.rb'
Then be sure to include it in the String class (you could put this wherever you want; but for example, right under the require statement we just wrote above. After you include it once, it will be available everywhere, even outside the class where you included it).
String.include RubyCoreExtensions::String::MatchAll
USAGE: And now when you use #match_all you get results like:
test_string = 'hello foo, what foo are you going to foo today?'
test_string.match_all /foo/
=> [#<MatchData "foo", #<MatchData "foo", #<MatchData "foo"]
test_string.match_all /hello/
=> [#<MatchData "hello"]
test_string.match_all /none/
=> []
I find this particularly useful when I want to match multiple occurrences, and then get useful information about each occurrence, such as which index the occurrence starts and ends (e.g. match.offset(0) => [first_index, last_index])

Take an array and a letter as arguments and return a new array with words that contain that letter

I can run a search and find the element I want and can return those words with that letter. But when I start to put arguments in, it doesn't work. I tried select with include? and it throws an error saying, private method. This is my code, which returns what I am expecting:
my_array = ["wants", "need", 3, "the", "wait", "only", "share", 2]
def finding_method(source)
words_found = source.grep(/t/) #I just pick random letter
print words_found
end
puts finding_method(my_array)
# => ["wants", "the", "wait"]
I need to add the second argument, but it breaks:
def finding_method(source, x)
words_found = source.grep(/x/)
print words_found
end
puts finding_method(my_array, "t")
This doesn't work, (it returns an empty array because there isn't an 'x' in the array) so I don't know how to pass an argument. Maybe I'm using the wrong method to do what I'm after. I have to define 'x', but I'm not sure how to do that. Any help would be great.
Regular expressions support string interpolation just like strings.
/x/
looks for the character x.
/#{x}/
will first interpolate the value of the variable and produce /t/, which does what you want. Mostly.
Note that if you are trying to search for any text that might have any meaning in regular expression syntax (like . or *), you should escape it:
/#{Regexp.quote(x)}/
That's the correct answer for any situation where you are including literal strings in regular expression that you haven't built yourself specifically for the purpose of being a regular expression, i.e. 99% of cases where you're interpolating variables into regexps.

Why do parentheses affect hashes?

When I used respond_with and passed a literal hash, it gave me the error:
syntax error, unexpected tASSOC, expecting '}'
`respond_with {:status => "Not found"}`
However, when I enclosed the literal hash in parentheses like so:
respond_with({:status => "Not found"})
the function runs without a hitch. Why do the parentheses make a difference? Isn't a hash an enclosed call?
When calling a method, the opening curly bracket directly after the method name is interpreted as the start of a block. This has precedence over the interpretation as a hash. One way to circumvent the issue is to use parenthesis to enforce the interpretation as a method argument. As an example, please note the difference in meaning of these two method calls:
# interpreted as a block
[:a, :b, :c].each { |x| puts x }
# interpreted as a hash
{:a => :b}.merge({:c => :d})
Another way is to just get rid of the curly brackets as you can always skip the brackets on the last argument of a method. Ruby is "clever" enough to interpret everything which looks like an association list at the end of an argument list as a single hash. Please have a look at this example:
def foo(a, b)
puts a.inspect
puts b.inspect
end
foo "hello", :this => "is", :a => "hash"
# prints this:
# "hello"
# {:this=>"is", :a=>"hash"}

Regex with named capture groups getting all matches in Ruby

I have a string:
s="123--abc,123--abc,123--abc"
I tried using Ruby 1.9's new feature "named groups" to fetch all named group info:
/(?<number>\d*)--(?<chars>\s*)/
Is there an API like Python's findall which returns a matchdata collection? In this case I need to return two matches, because 123 and abc repeat twice. Each match data contains of detail of each named capture info so I can use m['number'] to get the match value.
Named captures are suitable only for one matching result.
Ruby's analogue of findall is String#scan. You can either use scan result as an array, or pass a block to it:
irb> s = "123--abc,123--abc,123--abc"
=> "123--abc,123--abc,123--abc"
irb> s.scan(/(\d*)--([a-z]*)/)
=> [["123", "abc"], ["123", "abc"], ["123", "abc"]]
irb> s.scan(/(\d*)--([a-z]*)/) do |number, chars|
irb* p [number,chars]
irb> end
["123", "abc"]
["123", "abc"]
["123", "abc"]
=> "123--abc,123--abc,123--abc"
Chiming in super-late, but here's a simple way of replicating String#scan but getting the matchdata instead:
matches = []
foo.scan(regex){ matches << $~ }
matches now contains the MatchData objects that correspond to scanning the string.
You can extract the used variables from the regexp using names method. So what I did is, I used regular scan method to get the matches, then zipped names and every match to create a Hash.
class String
def scan2(regexp)
names = regexp.names
scan(regexp).collect do |match|
Hash[names.zip(match)]
end
end
end
Usage:
>> "aaa http://www.google.com.tr aaa https://www.yahoo.com.tr ddd".scan2 /(?<url>(?<protocol>https?):\/\/[\S]+)/
=> [{"url"=>"http://www.google.com.tr", "protocol"=>"http"}, {"url"=>"https://www.yahoo.com.tr", "protocol"=>"https"}]
#Nakilon is correct showing scan with a regex, however you don't even need to venture into regex land if you don't want to:
s = "123--abc,123--abc,123--abc"
s.split(',')
#=> ["123--abc", "123--abc", "123--abc"]
s.split(',').inject([]) { |a,s| a << s.split('--'); a }
#=> [["123", "abc"], ["123", "abc"], ["123", "abc"]]
This returns an array of arrays, which is convenient if you have multiple occurrences and need to see/process them all.
s.split(',').inject({}) { |h,s| n,v = s.split('--'); h[n] = v; h }
#=> {"123"=>"abc"}
This returns a hash, which, because the elements have the same key, has only the unique key value. This is good when you have a bunch of duplicate keys but want the unique ones. Its downside occurs if you need the unique values associated with the keys, but that appears to be a different question.
If using ruby >=1.9 and the named captures, you could:
class String
def scan2(regexp2_str, placeholders = {})
return regexp2_str.to_re(placeholders).match(self)
end
def to_re(placeholders = {})
re2 = self.dup
separator = placeholders.delete(:SEPARATOR) || '' #Returns and removes separator if :SEPARATOR is set.
#Search for the pattern placeholders and replace them with the regex
placeholders.each do |placeholder, regex|
re2.sub!(separator + placeholder.to_s + separator, "(?<#{placeholder}>#{regex})")
end
return Regexp.new(re2, Regexp::MULTILINE) #Returns regex using named captures.
end
end
Usage (ruby >=1.9):
> "1234:Kalle".scan2("num4:name", num4:'\d{4}', name:'\w+')
=> #<MatchData "1234:Kalle" num4:"1234" name:"Kalle">
or
> re="num4:name".to_re(num4:'\d{4}', name:'\w+')
=> /(?<num4>\d{4}):(?<name>\w+)/m
> m=re.match("1234:Kalle")
=> #<MatchData "1234:Kalle" num4:"1234" name:"Kalle">
> m[:num4]
=> "1234"
> m[:name]
=> "Kalle"
Using the separator option:
> "1234:Kalle".scan2("#num4#:#name#", SEPARATOR:'#', num4:'\d{4}', name:'\w+')
=> #<MatchData "1234:Kalle" num4:"1234" name:"Kalle">
I needed something similar recently. This should work like String#scan, but return an array of MatchData objects instead.
class String
# This method will return an array of MatchData's rather than the
# array of strings returned by the vanilla `scan`.
def match_all(regex)
match_str = self
match_datas = []
while match_str.length > 0 do
md = match_str.match(regex)
break unless md
match_datas << md
match_str = md.post_match
end
return match_datas
end
end
Running your sample data in the REPL results in the following:
> "123--abc,123--abc,123--abc".match_all(/(?<number>\d*)--(?<chars>[a-z]*)/)
=> [#<MatchData "123--abc" number:"123" chars:"abc">,
#<MatchData "123--abc" number:"123" chars:"abc">,
#<MatchData "123--abc" number:"123" chars:"abc">]
You may also find my test code useful:
describe String do
describe :match_all do
it "it works like scan, but uses MatchData objects instead of arrays and strings" do
mds = "ABC-123, DEF-456, GHI-098".match_all(/(?<word>[A-Z]+)-(?<number>[0-9]+)/)
mds[0][:word].should == "ABC"
mds[0][:number].should == "123"
mds[1][:word].should == "DEF"
mds[1][:number].should == "456"
mds[2][:word].should == "GHI"
mds[2][:number].should == "098"
end
end
end
I really liked #Umut-Utkan's solution, but it didn't quite do what I wanted so I rewrote it a bit (note, the below might not be beautiful code, but it seems to work)
class String
def scan2(regexp)
names = regexp.names
captures = Hash.new
scan(regexp).collect do |match|
nzip = names.zip(match)
nzip.each do |m|
captgrp = m[0].to_sym
captures.add(captgrp, m[1])
end
end
return captures
end
end
Now, if you do
p '12f3g4g5h5h6j7j7j'.scan2(/(?<alpha>[a-zA-Z])(?<digit>[0-9])/)
You get
{:alpha=>["f", "g", "g", "h", "h", "j", "j"], :digit=>["3", "4", "5", "5", "6", "7", "7"]}
(ie. all the alpha characters found in one array, and all the digits found in another array). Depending on your purpose for scanning, this might be useful. Anyway, I love seeing examples of how easy it is to rewrite or extend core Ruby functionality with just a few lines!
A year ago I wanted regular expressions that were more easy to read and named the captures, so I made the following addition to String (should maybe not be there, but it was convenient at the time):
scan2.rb:
class String
#Works as scan but stores the result in a hash indexed by variable/constant names (regexp PLACEHOLDERS) within parantheses.
#Example: Given the (constant) strings BTF, RCVR and SNDR and the regexp /#BTF# (#RCVR#) (#SNDR#)/
#the matches will be returned in a hash like: match[:RCVR] = <the match> and match[:SNDR] = <the match>
#Note: The #STRING_VARIABLE_OR_CONST# syntax has to be used. All occurences of #STRING# will work as #{STRING}
#but is needed for the method to see the names to be used as indices.
def scan2(regexp2_str, mark='#')
regexp = regexp2_str.to_re(mark) #Evaluates the strings. Note: Must be reachable from here!
hash_indices_array = regexp2_str.scan(/\(#{mark}(.*?)#{mark}\)/).flatten #Look for string variable names within (#VAR#) or # replaced by <mark>
match_array = self.scan(regexp)
#Save matches in hash indexed by string variable names:
match_hash = Hash.new
match_array.flatten.each_with_index do |m, i|
match_hash[hash_indices_array[i].to_sym] = m
end
return match_hash
end
def to_re(mark='#')
re = /#{mark}(.*?)#{mark}/
return Regexp.new(self.gsub(re){eval $1}, Regexp::MULTILINE) #Evaluates the strings, creates RE. Note: Variables must be reachable from here!
end
end
Example usage (irb1.9):
> load 'scan2.rb'
> AREA = '\d+'
> PHONE = '\d+'
> NAME = '\w+'
> "1234-567890 Glenn".scan2('(#AREA#)-(#PHONE#) (#NAME#)')
=> {:AREA=>"1234", :PHONE=>"567890", :NAME=>"Glenn"}
Notes:
Of course it would have been more elegant to put the patterns (e.g. AREA, PHONE...) in a hash and add this hash with patterns to the arguments of scan2.
Piggybacking off of Mark Hubbart's answer, I added the following monkey-patch:
class ::Regexp
def match_all(str)
matches = []
str.scan(self) { matches << $~ }
matches
end
end
which can be used as /(?<letter>\w)/.match_all('word'), and returns:
[#<MatchData "w" letter:"w">, #<MatchData "o" letter:"o">, #<MatchData "r" letter:"r">, #<MatchData "d" letter:"d">]
This relies on, as others have said, the use of $~ in the scan block for the match data.
I like the match_all given by John, but I think it has an error.
The line:
match_datas << md
works if there are no captures () in the regex.
This code gives the whole line up to and including the pattern matched/captured by the regex. (The [0] part of MatchData) If the regex has capture (), then this result is probably not what the user (me) wants in the eventual output.
I think in the case where there are captures () in regex, the correct code should be:
match_datas << md[1]
The eventual output of match_datas will be an array of pattern capture matches starting from match_datas[0]. This is not quite what may be expected if a normal MatchData is wanted which includes a match_datas[0] value which is the whole matched substring followed by match_datas[1], match_datas[[2],.. which are the captures (if any) in the regex pattern.
Things are complex - which may be why match_all was not included in native MatchData.

Resources