Ruby count # of matches in string from array - ruby

I have a string, for example:
'This is a test string'
and an array:
['test', 'is']
I need to find out how many elements in array are present in string (in this case, it would be 2). What's the best/ruby-way of doing this? Also, I am doing this thousands of time, so please keep in mind efficiency.
What I tried so far:
array.each do |el|
string.include? el #increment counter
end
Thanks

['test', 'is'].count{ |s| /\b#{s}\b/ =~ 'This is a test string' }
Edit: adjusted for full word matching.

['test', 'is'].count { |e| 'This is a test string'.split.include? e }

Your question is ambiguous.
If you are counting the occurrences, then:
('This is a test string'.scan(/\w+/).map(&:downcase) & ['test', 'is']).length
If you are counting the tokens, then:
(['test', 'is'] & 'This is a test string'.scan(/\w+/).map(&:downcase)).length
You can further speed up the calculation by replacing Array#& by some operation using a Hash (or Set).

Kyle's answer gave you the simple practical way of doing the job. But looking at it, allow me to remark that more efficient algorithms exist to solve your problem, when n (string length and/or number of matched strings) climbs to millions. We commonly encounter such problems in biology.

Following will work provided there are no duplicates in string or array.
str = "This is a test string"
arr = ["test", "is"]
match_count = arr.size - (arr - str.split).size # 2 in this example

Related

Find the last occurence of a string being a certain length

I know there is a method to find the largest string in an array
def longest_word(string_of_words)
x = string_of_words.split(" ").max_by(&:length)
end
However, if there are multiple words with the longest length, how do i return the last instance of the word with the longest length? Is there a method and do I use indexing?
Benjamin
What if we took advantage of reverse?
"asd qweewe lol qwerty df qwsazx".split.reverse_each.max_by(&:length)
=> "qwsazx"
Simply reverse your words array before applying max_by.
The first longest word from the reversed array will be the last one in your sentence.
can do this way also:
> "asd qweewe lol qwerty df qwsazx".split.sort_by(&:length).last
#=> "qwsazx"
Note: You can split words and sort by length in ascending(default) order and take the last word
You can use inject which will replace the maximum only if (via <=) it's matched or improved upon. By default inject takes the first element of its receiver.
str.split.inject { |m,s| m.size <= s.size ? s : m }
max_by.with_index{|e, i| [e, i]}
There's no need to convert the string to an array.
def longest_word(str)
str.gsub(/[[:alpha:]]+/).
each_with_object('') {|s,longest| longest.replace(s) if s.size >= longest.size}
end
longest_word "Many dogs love to swim in the sea"
#=> "swim"
Two points.
I've used String#gsub to create an enumerator that will feed words to Enumerable.#each_with_object. The string argument is not modified. This is an usual use of gsub that I've been able to use to advantage in several situations.
Within the block it's necessary to use longest.replace(s) rather than longest = s. That's because each_with_object returns the originally given object (usually modified by the block), but does not update that object on each iteration. longest = s merely returns s (is equivalent to just s) but does not alter the value of the block variable. By contrast, longest.replace(s) modifies the original object.
With regard to the second of these two points, it is interesting to contrast the use of each_with_object with Enumerable#reduce (aka inject).
str.gsub(/[[:alpha:]]+/).
reduce('') {|longest,s| s.size >= longest.size ? s : longest }
#=> "swim"

Gsub-ing multiple substrings to an empty string

I often remove substrings from strings by doing this:
"don't use bad words like".gsub("bad", "").gsub("words", "").gsub("like", "")
What's a more concise/better way of excising long lists of substrings from a string in Ruby?
I would go with nronas' answer, however people tend to forget about Regexp.union:
str = "don't use bad words like"
str.gsub(Regexp.union('bad', 'words', 'like'), '')
# or
str.gsub(Regexp.union(['bad', 'words', 'like']), '')
You can always use regex when you gsubing :P. like:
str = "don't use bad words like"
str.gsub(/bad|words|like/, '')
I hope that helps
Edit2: Upon reflection, I think what I have below (or any solution that first breaks the string into an array of words) is really what you want. Suppose:
str = "Becky darned her socks before playing badmitten."
bad_words = ["bad", "darn", "socks"]
Which of the following would you want?
str.gsub(Regexp.union(*bad_words), '')
#=> "Becky ed her before playing mitten."
or
(str.split - bad_words).join(' ')
#=> "Becky darned her before playing badmitten."
Alternatively,
bad_words.reduce(str.split) { |arr,bw| arr.delete(bw); arr }.join(' ')
#=> "Becky darned her before playing badmitten."
:2tidE
Edit1: I've come to my senses and purged my solution. It was much too elaborate (and inefficient) for such a simple problem. I've just left an observation. :1tidE
If you want to end up with just a single space between words, you need to take a different tack:
(str.split - bad_words).join(' ')
#=> "don't use
I already suggested this to Cary, but it's here:
bad_words = %w[bad words like]
h = Hash.new{|h, k| k}.merge(bad_words.product(['']).to_h)
"don't use bad words like".gsub(/\w+/, h)

Find the longest substring in a string

I would like to find the longest sequence of repeated characters in a string.
ex:
"aabbccc" #=> ccc
"aabbbddccdddd" #=> dddd
etc
In the first example, ccc is the longest sequence because c is repeated 3 times. In the second example, dddd is the longest sequence because d is repeated 4 times.
It should be something like this:
b = []
a.scan(/(.)(.)(.)/) do |x,y,z|
b<<x<<y<<z if x==y && y==z
end
but with some flags to keep the count of repeating, I guess
This should work:
string = 'aabbccc'
string.chars.chunk {|a| a}.max_by {|_, ary| ary.length}.last.join
Update:
Explanation of |_, ary|: at this point we have array of 2-element arrays. We only need to use the second one and we ignore the first one. If instead we do |char, ary| some IDEs would complain about unused local variable. Placing _ tells ruby to ignore that value.
Using regex:
We can achieve same thing with regex:
string.scan(/([a-z])(\1*)/).map(&:join).max_by(&:length)
Here's a solution using a regular expression:
LETTER_MATCH = Regexp.new(('a'..'z').collect do |letter|
"#{letter}+"
end.join('|'))
def repeated(string)
string.scan(LETTER_MATCH).sort_by(&:length).last
end
Here's another solution. It's bigger but it still works)
def most_friquent_char_in_a_row(my_str)
my_str = my_str.chars
temp=[]
ary=[]
for i in 0..my_str.count-1
if my_str[i]==my_str[i+1]
temp<<my_str[i] unless temp.include?(my_str[i])
temp<<my_str[i+1]
else
ary<<temp
temp=[]
end
end
result = ary.max_by(&:size).join
p "#{result} - #{result.size}"
end

Fastest way to count occurences of a substrings list in ruby

My problem is simple, I have a list of substrings, and I have to count how many substrings are included in a specific string.
Here is my code :
string = "..."
substrings = ["hello", "foo", "bar", "brol"]
count = 0
substrings.each do |sub|
count += 1 if string.include?(sub)
end
In this example, we run through the entire string 4 times, which is quite consuming.
How would you optimize this process ?
This uses a Regexp.union to run through the string only once:
string = 'hello there! this is foobar!'
substrings = ["hello", "foo", "bar", "brol"]
string.scan(Regexp.union(substrings)).count
# => 3
Though this solution is markedly slower with small input, it has lower complexity - for string of length n and substrings of length m the original solution has a complexity of O(m*n), while this solution has a complexity of O(m+n).
Update
After reading the question again, and my answer, I've come to the conclusion that not only this is a premature optimization (as #Max has noted), but that my answer is semantically different than the OP.
Let me explain - the OP code counts how many of the substrings has at least one appearance in the string, while my solution count how many appearances are there for any of the substrings:
op_solution('hello hello there', ["hello", "foo", "bar", "brol"])
# => 1
uri_solution('hello hello there', ["hello", "foo", "bar", "brol"])
# => 2
This also explains why my solution is so slow, even for long strings - although it has only one pass on the input string, it has to pass all of it, while the original code stops at the first occurrence of a word.
My conclusion is - go with #Arup's solution. It will not be faster than yours, it is just more succinct, but I can't think of anything better :)
write as :-
substrings.count { |sub| string.include?(sub) }
subtrings.collect { |i| string.scan(i).count }.sum
Elegant.

Case-insensitive Array#include?

I want to know what's the best way to make the String.include? methods ignore case. Currently I'm doing the following. Any suggestions? Thanks!
a = "abcDE"
b = "CD"
result = a.downcase.include? b.downcase
Edit:
How about Array.include?. All elements of the array are strings.
Summary
If you are only going to test a single word against an array, or if the contents of your array changes frequently, the fastest answer is Aaron's:
array.any?{ |s| s.casecmp(mystr)==0 }
If you are going to test many words against a static array, it's far better to use a variation of farnoy's answer: create a copy of your array that has all-lowercase versions of your words, and use include?. (This assumes that you can spare the memory to create a mutated copy of your array.)
# Do this once, or each time the array changes
downcased = array.map(&:downcase)
# Test lowercase words against that array
downcased.include?( mystr.downcase )
Even better, create a Set from your array.
# Do this once, or each time the array changes
downcased = Set.new array.map(&:downcase)
# Test lowercase words against that array
downcased.include?( mystr.downcase )
My original answer below is a very poor performer and generally not appropriate.
Benchmarks
Following are benchmarks for looking for 1,000 words with random casing in an array of slightly over 100,000 words, where 500 of the words will be found and 500 will not.
The 'regex' text is my answer here, using any?.
The 'casecmp' test is Arron's answer, using any? from my comment.
The 'downarray' test is farnoy's answer, re-creating a new downcased array for each of the 1,000 tests.
The 'downonce' test is farnoy's answer, but pre-creating the lookup array once only.
The 'set_once' test is creating a Set from the array of downcased strings, once before testing.
user system total real
regex 18.710000 0.020000 18.730000 ( 18.725266)
casecmp 5.160000 0.000000 5.160000 ( 5.155496)
downarray 16.760000 0.030000 16.790000 ( 16.809063)
downonce 0.650000 0.000000 0.650000 ( 0.643165)
set_once 0.040000 0.000000 0.040000 ( 0.038955)
If you can create a single downcased copy of your array once to perform many lookups against, farnoy's answer is the best (assuming you must use an array). If you can create a Set, though, do that.
If you like, examine the benchmarking code.
Original Answer
I (originally said that I) would personally create a case-insensitive regex (for a string literal) and use that:
re = /\A#{Regexp.escape(str)}\z/i # Match exactly this string, no substrings
all = array.grep(re) # Find all matching strings…
any = array.any?{ |s| s =~ re } # …or see if any matching string is present
Using any? can be slightly faster than grep as it can exit the loop as soon as it finds a single match.
For an array, use:
array.map(&:downcase).include?(string)
Regexps are very slow and should be avoided.
You can use casecmp to do your comparison, ignoring case.
"abcdef".casecmp("abcde") #=> 1
"aBcDeF".casecmp("abcdef") #=> 0
"abcdef".casecmp("abcdefg") #=> -1
"abcdef".casecmp("ABCDEF") #=> 0
class String
def caseinclude?(x)
a.downcase.include?(x.downcase)
end
end
my_array.map!{|c| c.downcase.strip}
where map! changes my_array, map instead returns a new array.
To farnoy in my case your example doesn't work for me. I'm actually looking to do this with a "substring" of any.
Here's my test case.
x = "<TD>", "<tr>", "<BODY>"
y = "td"
x.collect { |r| r.downcase }.include? y
=> false
x[0].include? y
=> false
x[0].downcase.include? y
=> true
Your case works with an exact case-insensitive match.
a = "TD", "tr", "BODY"
b = "td"
a.collect { |r| r.downcase }.include? b
=> true
I'm still experimenting with the other suggestions here.
---EDIT INSERT AFTER HERE---
I found the answer. Thanks to Drew Olsen
var1 = "<TD>", "<tr>","<BODY>"
=> ["<TD>", "<tr>", "<BODY>"]
var2 = "td"
=> "td"
var1.find_all{|item| item.downcase.include?(var2)}
=> ["<TD>"]
var1[0] = "<html>"
=> "<html>"
var1.find_all{|item| item.downcase.include?(var2)}
=> []

Resources