How do I scan text for multiple strings?

How do I scan text for multiple strings? - ruby

I'm scanning through a product name to check if a specific string exists in it. Right now it works for a single string, but how can I can scan for multiple strings? e.g. i'd like to scan for both apple and microsoft
product.name.downcase.scan(/apple/)
If the string is detected i get ["apple"]
if not then it returns nil [ ]

You can use regex alternation:
product.name.downcase.scan(/apple|microsoft/)
If all you need to know is whether the string contains any of the specified strings, you should better use single match =~ instead of scan.
str = 'microsoft, apple and microsoft once again'
res = str.scan /apple|microsoft/ # => res = ["microsoft", "apple", "microsoft"]
# do smth with res
# or
if str =~ /apple|microsoft/
# do smth
end

You could also skip regular expressions altogether:
['apple', 'pear', 'orange'].any?{|s| product.name.downcase.match(s)}
or
['apple', 'pear', 'orange'].any?{|s| product.name.downcase[s]}

Related

Ruby extract string via regular expression

I have these strings:
'da_report/GY4LFDN6/2017_11/view_mission_join_player_count2017_11/index.html'
'da_report/GY4LFDN6/2017_11/activily_time2017_11/index.html'
From these two strings, I want to extract these two file names:
'2017_11/view_mission_join_player_count2017_11'
'2017_11/activily_time2017_11'
I wrote some regular expressions, but they seem wrong.
str = 'da_report/GY4LFDN6/2017_11/view_mission_join_player_count2017_11/index.html'
str[/([^\/index.html]+)/, 1] # => "a_r"

Regular expression is an overkill here, and i prone to errors.
input = [
"da_report/GY4LFDN6/" \
"2017_11/view_mission_join_player_count2017_11" \
"/index.html",
"da_report/GY4LFDN6/" \
"2017_11/activily_time2017_11" \
"/index.html"
]
input.map { |str| str.split('/')[2..3].join('/') }
#⇒ [
# [0] "2017_11/view_mission_join_player_count2017_11",
# [1] "2017_11/activily_time2017_11"
# ]
or, more elegant:
input.map { |str| str.split('/').grep(/2017_/).join('/') }

Use /(?<=GY4LFDN6\/)(.*)(?=\/index.html)/
str = 'da_report/GY4LFDN6/2017_11/view_mission_join_player_count2017_11/index.html'
str[/(?<=GY4LFDN6\/)(.*)(?=\/index.html)/]
=> "2017_11/view_mission_join_player_count2017_11"
live demo: http://rubular.com/r/Ued6UOXWDf

This answer assumes that you want to capture beginning with the third component of the path, up to and including the last component of the path before the filename. If so, then we can use the following regex pattern:
(?:[^/]*/){2}(.*)/.*
The quantity in parentheses is the capture group, i.e. what you want to extract from the entire path.
str = 'da_report/GY4LFDN6/2017_11/view_mission_join_player_count2017_11/index.html'
puts str[/(?:[^\/]*\/){2}(.*)\/.*/, 1]
Demo

If you are looking for the values at the end of the string like in the format string/string followed by /filename.extension, you could use a positive lookahead for a file name.
\w+\/\w+(?=\/\w+\.\w+$)
Demo

Based on your examples, you may be able to use a very simple regex.
def extract(str)
str[/\d{4}_\d{2}.+\d{4}_\d{2}/]
end
extract 'da_report/GY4LFDN6/2017_11/view_mission_join_player_count2017_11/index.html'
#=> "2017_11/view_mission_join_player_count2017_11"
extract 'da_report/GY4LFDN6/2017_11/activily_time2017_11/index.html'
#=> "2017_11/activily_time2017_11"

Check if one of the values from both lists are present in the string in ruby

I have two arrays with logins and file extensions:
logins = ['bob', 'mark', 'joe']
extensions = ['.doc', '.xls']
I need to check if one of the values from both lists are present in the string (string is like str = "aaa bob test.txt test text"), and if yes do some work.
How to correctly perform this checking in Ruby.
Now I'm perform this with several loop and if statements.

[logins, extensions].all? do |list|
list.any? { |match| str.include? match }
end
You have two lists, logins and extensions. You want to make sure that all? of them do something and that 'something' is that the string includes any? of their elements.
The answer using regex is better performing, though, even if a little less simple to write.

You could also use Regexp.union :
str = 'aaa bob test.xls test text'
logins = Regexp.union(['bob', 'mark', 'joe'])
extensions = Regexp.union(['.doc', '.xls'])
str =~ logins && str =~ extensions
# => 12
It returns either nil if one of both didn't match or an integer if both matched.
As an alternative, with Ruby 2.4 :
str.match?(logins) && str.match?(extensions)
which would return a boolean.

You can simply add the array to get the union with unique vales and then iterate to check if the string has a matching value.
str = 'aaa bob test.txt test text'
(logins + extensions).any? { |word| str.include?(word) }
This returns 'true' or 'false'

Remove all non-alphabetical, non-numerical characters from a string?

If I wanted to remove things like:
.!,'"^-# from an array of strings, how would I go about this while retaining all alphabetical and numeric characters.
Allowed alphabetical characters should also include letters with diacritical marks including à or ç.

You should use a regex with the correct character property. In this case, you can invert the Alnum class (Alphabetic and numeric character):
"◊¡ Marc-André !◊".gsub(/\p{^Alnum}/, '') # => "MarcAndré"
For more complex cases, say you wanted also punctuation, you can also build a set of acceptable characters like:
"◊¡ Marc-André !◊".gsub(/[^\p{Alnum}\p{Punct}]/, '') # => "¡MarcAndré!"
For all character properties, you can refer to the doc.

string.gsub(/[^[:alnum:]]/, "")

The following will work for an array:
z = ['asfdå', 'b12398!', 'c98347']
z.each { |s| s.gsub! /[^[:alnum:]]/, '' }
puts z.inspect
I borrowed Jeremy's suggested regex.

You might consider a regular expression.
http://www.regular-expressions.info/ruby.html
I'm assuming that you're using ruby since you tagged that in your post. You could go through the array, put it through a test using a regexp, and if it passes remove/keep it based on the regexp you use.
A regexp you might use might go something like this:
[^.!,^-#]
That will tell you if its not one of the characters inside the brackets. However, I suggest that you look up regular expressions, you might find a better solution once you know their syntax and usage.

If you truly have an array (as you state) and it is an array of strings (I'm guessing), e.g.
foo = [ "hello", "42 cats!", "yöwza" ]
then I can imagine that you either want to update each string in the array with a new value, or that you want a modified array that only contains certain strings.
If the former (you want to 'clean' every string the array) you could do one of the following:
foo.each{ |s| s.gsub! /\p{^Alnum}/, '' } # Change every string in place…
bar = foo.map{ |s| s.gsub /\p{^Alnum}/, '' } # …or make an array of new strings
#=> [ "hello", "42cats", "yöwza" ]
If the latter (you want to select a subset of the strings where each matches your criteria of holding only alphanumerics) you could use one of these:
# Select only those strings that contain ONLY alphanumerics
bar = foo.select{ |s| s =~ /\A\p{Alnum}+\z/ }
#=> [ "hello", "yöwza" ]
# Shorthand method for the same thing
bar = foo.grep /\A\p{Alnum}+\z/
#=> [ "hello", "yöwza" ]
In Ruby, regular expressions of the form /\A………\z/ require the entire string to match, as \A anchors the regular expression to the start of the string and \z anchors to the end.

Regex with named capture groups getting all matches in Ruby

I have a string:
s="123--abc,123--abc,123--abc"
I tried using Ruby 1.9's new feature "named groups" to fetch all named group info:
/(?<number>\d*)--(?<chars>\s*)/
Is there an API like Python's findall which returns a matchdata collection? In this case I need to return two matches, because 123 and abc repeat twice. Each match data contains of detail of each named capture info so I can use m['number'] to get the match value.

Named captures are suitable only for one matching result.
Ruby's analogue of findall is String#scan. You can either use scan result as an array, or pass a block to it:
irb> s = "123--abc,123--abc,123--abc"
=> "123--abc,123--abc,123--abc"
irb> s.scan(/(\d*)--([a-z]*)/)
=> [["123", "abc"], ["123", "abc"], ["123", "abc"]]
irb> s.scan(/(\d*)--([a-z]*)/) do |number, chars|
irb* p [number,chars]
irb> end
["123", "abc"]
["123", "abc"]
["123", "abc"]
=> "123--abc,123--abc,123--abc"

Chiming in super-late, but here's a simple way of replicating String#scan but getting the matchdata instead:
matches = []
foo.scan(regex){ matches << $~ }
matches now contains the MatchData objects that correspond to scanning the string.

You can extract the used variables from the regexp using names method. So what I did is, I used regular scan method to get the matches, then zipped names and every match to create a Hash.
class String
def scan2(regexp)
names = regexp.names
scan(regexp).collect do |match|
Hash[names.zip(match)]
end
end
end
Usage:
>> "aaa http://www.google.com.tr aaa https://www.yahoo.com.tr ddd".scan2 /(?<url>(?<protocol>https?):\/\/[\S]+)/
=> [{"url"=>"http://www.google.com.tr", "protocol"=>"http"}, {"url"=>"https://www.yahoo.com.tr", "protocol"=>"https"}]

#Nakilon is correct showing scan with a regex, however you don't even need to venture into regex land if you don't want to:
s = "123--abc,123--abc,123--abc"
s.split(',')
#=> ["123--abc", "123--abc", "123--abc"]
s.split(',').inject([]) { |a,s| a << s.split('--'); a }
#=> [["123", "abc"], ["123", "abc"], ["123", "abc"]]
This returns an array of arrays, which is convenient if you have multiple occurrences and need to see/process them all.
s.split(',').inject({}) { |h,s| n,v = s.split('--'); h[n] = v; h }
#=> {"123"=>"abc"}
This returns a hash, which, because the elements have the same key, has only the unique key value. This is good when you have a bunch of duplicate keys but want the unique ones. Its downside occurs if you need the unique values associated with the keys, but that appears to be a different question.

If using ruby >=1.9 and the named captures, you could:
class String
def scan2(regexp2_str, placeholders = {})
return regexp2_str.to_re(placeholders).match(self)
end
def to_re(placeholders = {})
re2 = self.dup
separator = placeholders.delete(:SEPARATOR) || '' #Returns and removes separator if :SEPARATOR is set.
#Search for the pattern placeholders and replace them with the regex
placeholders.each do |placeholder, regex|
re2.sub!(separator + placeholder.to_s + separator, "(?<#{placeholder}>#{regex})")
end
return Regexp.new(re2, Regexp::MULTILINE) #Returns regex using named captures.
end
end
Usage (ruby >=1.9):
> "1234:Kalle".scan2("num4:name", num4:'\d{4}', name:'\w+')
=> #<MatchData "1234:Kalle" num4:"1234" name:"Kalle">
or
> re="num4:name".to_re(num4:'\d{4}', name:'\w+')
=> /(?<num4>\d{4}):(?<name>\w+)/m
> m=re.match("1234:Kalle")
=> #<MatchData "1234:Kalle" num4:"1234" name:"Kalle">
> m[:num4]
=> "1234"
> m[:name]
=> "Kalle"
Using the separator option:
> "1234:Kalle".scan2("#num4#:#name#", SEPARATOR:'#', num4:'\d{4}', name:'\w+')
=> #<MatchData "1234:Kalle" num4:"1234" name:"Kalle">

I needed something similar recently. This should work like String#scan, but return an array of MatchData objects instead.
class String
# This method will return an array of MatchData's rather than the
# array of strings returned by the vanilla `scan`.
def match_all(regex)
match_str = self
match_datas = []
while match_str.length > 0 do
md = match_str.match(regex)
break unless md
match_datas << md
match_str = md.post_match
end
return match_datas
end
end
Running your sample data in the REPL results in the following:
> "123--abc,123--abc,123--abc".match_all(/(?<number>\d*)--(?<chars>[a-z]*)/)
=> [#<MatchData "123--abc" number:"123" chars:"abc">,
#<MatchData "123--abc" number:"123" chars:"abc">,
#<MatchData "123--abc" number:"123" chars:"abc">]
You may also find my test code useful:
describe String do
describe :match_all do
it "it works like scan, but uses MatchData objects instead of arrays and strings" do
mds = "ABC-123, DEF-456, GHI-098".match_all(/(?<word>[A-Z]+)-(?<number>[0-9]+)/)
mds[0][:word].should == "ABC"
mds[0][:number].should == "123"
mds[1][:word].should == "DEF"
mds[1][:number].should == "456"
mds[2][:word].should == "GHI"
mds[2][:number].should == "098"
end
end
end

I really liked #Umut-Utkan's solution, but it didn't quite do what I wanted so I rewrote it a bit (note, the below might not be beautiful code, but it seems to work)
class String
def scan2(regexp)
names = regexp.names
captures = Hash.new
scan(regexp).collect do |match|
nzip = names.zip(match)
nzip.each do |m|
captgrp = m[0].to_sym
captures.add(captgrp, m[1])
end
end
return captures
end
end
Now, if you do
p '12f3g4g5h5h6j7j7j'.scan2(/(?<alpha>[a-zA-Z])(?<digit>[0-9])/)
You get
{:alpha=>["f", "g", "g", "h", "h", "j", "j"], :digit=>["3", "4", "5", "5", "6", "7", "7"]}
(ie. all the alpha characters found in one array, and all the digits found in another array). Depending on your purpose for scanning, this might be useful. Anyway, I love seeing examples of how easy it is to rewrite or extend core Ruby functionality with just a few lines!

A year ago I wanted regular expressions that were more easy to read and named the captures, so I made the following addition to String (should maybe not be there, but it was convenient at the time):
scan2.rb:
class String
#Works as scan but stores the result in a hash indexed by variable/constant names (regexp PLACEHOLDERS) within parantheses.
#Example: Given the (constant) strings BTF, RCVR and SNDR and the regexp /#BTF# (#RCVR#) (#SNDR#)/
#the matches will be returned in a hash like: match[:RCVR] = <the match> and match[:SNDR] = <the match>
#Note: The #STRING_VARIABLE_OR_CONST# syntax has to be used. All occurences of #STRING# will work as #{STRING}
#but is needed for the method to see the names to be used as indices.
def scan2(regexp2_str, mark='#')
regexp = regexp2_str.to_re(mark) #Evaluates the strings. Note: Must be reachable from here!
hash_indices_array = regexp2_str.scan(/\(#{mark}(.*?)#{mark}\)/).flatten #Look for string variable names within (#VAR#) or # replaced by <mark>
match_array = self.scan(regexp)
#Save matches in hash indexed by string variable names:
match_hash = Hash.new
match_array.flatten.each_with_index do |m, i|
match_hash[hash_indices_array[i].to_sym] = m
end
return match_hash
end
def to_re(mark='#')
re = /#{mark}(.*?)#{mark}/
return Regexp.new(self.gsub(re){eval $1}, Regexp::MULTILINE) #Evaluates the strings, creates RE. Note: Variables must be reachable from here!
end
end
Example usage (irb1.9):
> load 'scan2.rb'
> AREA = '\d+'
> PHONE = '\d+'
> NAME = '\w+'
> "1234-567890 Glenn".scan2('(#AREA#)-(#PHONE#) (#NAME#)')
=> {:AREA=>"1234", :PHONE=>"567890", :NAME=>"Glenn"}
Notes:
Of course it would have been more elegant to put the patterns (e.g. AREA, PHONE...) in a hash and add this hash with patterns to the arguments of scan2.

Piggybacking off of Mark Hubbart's answer, I added the following monkey-patch:
class ::Regexp
def match_all(str)
matches = []
str.scan(self) { matches << $~ }
matches
end
end
which can be used as /(?<letter>\w)/.match_all('word'), and returns:
[#<MatchData "w" letter:"w">, #<MatchData "o" letter:"o">, #<MatchData "r" letter:"r">, #<MatchData "d" letter:"d">]
This relies on, as others have said, the use of $~ in the scan block for the match data.

I like the match_all given by John, but I think it has an error.
The line:
match_datas << md
works if there are no captures () in the regex.
This code gives the whole line up to and including the pattern matched/captured by the regex. (The [0] part of MatchData) If the regex has capture (), then this result is probably not what the user (me) wants in the eventual output.
I think in the case where there are captures () in regex, the correct code should be:
match_datas << md[1]
The eventual output of match_datas will be an array of pattern capture matches starting from match_datas[0]. This is not quite what may be expected if a normal MatchData is wanted which includes a match_datas[0] value which is the whole matched substring followed by match_datas[1], match_datas[[2],.. which are the captures (if any) in the regex pattern.
Things are complex - which may be why match_all was not included in native MatchData.

Get id from string with Ruby

I have strings like this:
"/detail/205193-foo-var-bar-foo.html"
"/detail/183863-parse-foo.html"
"/detail/1003-bar-foo-bar.html"
How to get ids (205193, 183863, 1003) from it with Ruby?

Just say s[/\d+/]
[
"/detail/205193-foo-var-bar-foo.html",
"/detail/183863-parse-foo.html",
"/detail/1003-bar-foo-bar.html"
].each { |s| puts s[/\d+/] }

could also do something like this
"/detail/205193-foo-var-bar-foo.html".gsub(/\/detail\//,'').to_i
=> 205193

regex = /\/detail\/(\d+)-/
s = "/detail/205193-foo-var-bar-foo.html"
id = regex.match s # => <MatchData "/detail/205193-" 1:"205193">
id[1] # => "205193"
$1 # => "205193"
The MatchData object will store the entire matched portion of the string in the first element, and any matched subgroups starting from the second element (depending on how many matched subgroups there are)
Also, Ruby provides a shortcut to the most recent matched subgroup with $1 .

One easy way to do it would be to strip out the /detail/ part of your string, and then just call to_i on what's left over:
"/detail/1003-bar-foo-bar.html".gsub('/detail/','').to_i # => 1003

s = "/detail/205193-foo-var-bar-foo.html"
num = (s =~ /detail\/(\d+)-/) ? Integer($1) : nil

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

How do I scan text for multiple strings? - ruby

You could also skip regular expressions altogether: ['apple', 'pear', 'orange'].any?{|s| product.name.downcase.match(s)} or ['apple', 'pear', 'orange'].any?{|s| product.name.downcase[s]}

Related

Ruby extract string via regular expression

Check if one of the values from both lists are present in the string in ruby

Remove all non-alphabetical, non-numerical characters from a string?

Regex with named capture groups getting all matches in Ruby

Get id from string with Ruby

Categories

Resources