Detect if string has substring from array of possibilities - ruby

Is there a way to check whether a string contains any substring in an array?
Say I have bad_words = ['broccoli', 'cabbage', 'kale'], and my_string = "My shopping list features pizza, gummy bears, kale, and vodka." I want to check whether my_string has any of the bad_words items within it without using a loop/iterator. Is this possible?
It seems like people use Array#index to solve this problem, but I'm not sure how because, in my tests, it only returns true if the entire argument matches an entire item from the array: bad_words.index "kale" returns an index but bad_words.index "no kale" returns nil. So it's no good for substrings.

my_string =~ /\b#{Regexp.union(bad_words)}\b/
#=> 46
will do.

Related

How do I detect the presence of a substring in 1 string in another string?

Say I have a string "rubinassociatespa", what I would like to do is detect any substring of that string with 3 characters or more, in any other string.
For example, the following strings should be detected:
rubin
associates
spa
ass
rub
etc.
But what should NOT be detected are the following strings:
rob
cpa
dea
ru
or any other substring that does not appear in my original string, or is shorter than 3 characters.
Basically, I have a string and I am comparing many other strings against it and I only want to match the strings that comprise a substring of the original string.
I hope that's clear.
str = "rubinassociatespa"
arr = %w| rubin associates spa ass rub rob cpa dea ru |
#=> ["rubin", "associates", "spa", "ass", "rub", "rob", "cpa", "dea", "ru"]
Just use String#include?.
def substring?(str, s)
(s.size >= 3) ? str.include?(s) : false
end
arr.each { |s| puts "#{s}: #{substring? str, s}" }
# rubin: true
# associates: true
# spa: true
# ass: true
# rub: true
# rob: false
# cpa: false
# dea: false
# ru: false
you can use match
str = "rubinassociatespa"
test_str = "associates"
str.match(test_str) #=> #<MatchData "associates">
str.match(test_str).to_s #=> "associates"
test_str = 'rob'
str.match(test_str) #=> nil
So, if test_str is a substring of str, then the match method will return the entire test_str, otherwise, it will return nil.
if test_str.length >= 3 && str.match(test_str)
# do stuff here.
end
First you need a list of acceptable strings. Something like https://github.com/first20hours/google-10000-english would probably be usefull.
Secondly you want a data structure that allows for fast lookups to see if a word is valid. I would use a Bloom Filter for this. This gem might be useful if you don't want to implement it on your own: https://github.com/igrigorik/bloomfilter-rb
Then you need to initiate the Bloom filter with the list of all valid words in the valid word list.
Then, For each substring in your string you want to do a lookup in the bloom filter structure to see if it is in the valid word list. See this example for how to get all substrings: What is the best way to split a string to get all the substrings by Ruby?
If the bloom filter returns true you need to do a secondary check to confirm that it is actually in the list since Bloom filters is a probabilistic data structure. You probably need to use a database to store the valid word list collection, so you can just do a database lookup to confirm if it's valid.
I hope this gives you an idea on how to proceed.

Why is this Regex result unexpected

The regex in question is
/(<iframe.*?><\/iframe>)/
I am using this ruby regex to match sections of a string then creating an array of the results.
The string is
"<p><iframe src=\"http://www.dailymotion.com/embed/video/k18WBkRTMldXzB7JYW5?logo=0&info=0\" frameborder=\"0\" height=\"450\" width=\"580\"></iframe></p>\n<p>#1<br />\n<iframe src=\"https://www.cloudy.ec/embed.php?id=cabe5d3ba31da\" allowfullscreen=\"\" frameborder=\"0\" height=\"420\" width=\"640\"></iframe></p>\n<p>#2<br />\n<iframe src=\"https://www.cloudy.ec/embed.php?id=b03d31e4b5663\" allowfullscreen=\"\" frameborder=\"0\" height=\"420\" width=\"640\"></iframe></p>\n<p>#3<br />\n<iframe src=\"https://www.cloudy.ec/embed.php?id=f63895add1aac\" allowfullscreen=\"\" frameborder=\"0\" height=\"420\" width=\"640\"></iframe></p>\n"
I am calling the regex is .match() like so
/(<iframe.*?><\/iframe>)/.match(entry.content).to_a
The result is a duplicate of the first match
["<iframe src=\"http://www.dailymotion.com/embed/video/k18WBkRTMldXzB7JYW5?logo=0&info=0\" frameborder=\"0\" height=\"450\" width=\"580\"></iframe>", "<iframe src=\"http://www.dailymotion.com/embed/video/k18WBkRTMldXzB7JYW5?logo=0&info=0\" frameborder=\"0\" height=\"450\" width=\"580\"></iframe>"]
I used Rubular and I was able to get the Regex to work there http://rubular.com/r/CYF0vgQtrX
The result is a duplicate of the first match
Even though the docs for Regex#match() do a horrible job of describing what match() does, it actually finds the first match:
str = "abc"
md = /./.match(str)
p md.to_a
--output:--
["a"]
Regexp.match() returns a MatchData object when there is a match. A MatchData object contains matches for the whole match and for each group. If you call to_a() on a MatchData object, the return value is an Array containing the whole match and whatever matched each group in the regex:
str = "abc"
md = /(.)(.)(.)/.match(str)
p md.to_a
--output:--
["abc", "a", "b", "c"]
Because you specified a group in your regex, one result is the whole match, and the other result is what matched your group.
[A regex] was the first approach I thought of. If this wasn't going to
work, then I was going to use nokogiri
From now on, nokogiri should be your first thought...because:
If you have a programming problem, and you think, "I'll use a regex",
now you have two problems".
You should use scan instead of match here.
entry.content.scan(/<iframe.*?><\/iframe>/)
Using /(<iframe.*?><\/iframe>)/ will get a 2d array. The document says:
If the pattern contains groups, each individual result is itself an array containing one entry per group.

ruby/regex getting the first letter of each word

I want to get the first letter of each word put together, making something like "I need help" turn into "Inh". I was thinking to trim everything off, then going from there, or grab each first letter right away.
You could simply use split, map and join together here.
string = 'I need help'
result = string.split.map(&:first).join
puts result #=> "Inh"
How about regular expressions? Using the split method here forces a focus on the parts of the string that you don't need to for this problem, then taking another step of extracting the first letter of each word (chr). that's why I think regular expressions is better for this case. Node that this will also work if you have a - or another special character in the string. And then, of course you can add .upcase method at the end to get a proper acronym.
string = 'something - something and something else'
string.scan(/\b\w/).join
#=> ssase
Alternative solution using regex
string = 'I need help'
result = string.scan(/(\A\w|(?<=\s)\w)/).flatten.join
puts result
This basically says "look for either the first letter or any letter directly preceded by a space". The scan function returns array of arrays of matches, which is flattened (made into one array) and joined (made into a string).
string = 'I need help'
result = string.split.map(&:chr).join
puts result
http://ruby-doc.org/core-2.0/String.html#method-i-chr

Ruby - How to check if a string contains all the words in an array?

I have an array of strings:
phrases = ["Have a good Thanksgiving", "Eat lots of food"]
I have another array of single words: words = ["eat", "food"]
I want to return the entries in the first array if the string contains all the words in the second array.
So, it should look something like this:
phrases.select{ |x| x.include_all?(words) }
Should I just create the include_all? function to iterate through each member of the words array and do the comparison, or is there any built-in methods I'm missing?
You're actually very close to the solution.
phrases.select do |phrase|
words.all?{ |word| phrase.include? word }
end
The all? method is on Enumerable, and returns true if the block evaluates to true for each item in the collection.
Depending on exactly what your definition of the phrase "including" the word is, you may want to define your own include_all? method, or a method on String to determine the match. The include? method is case-sensitive and doesn't care about word boundaries. If those aren't your requirements, you can use a Regexp in place of include? or define your own method to wrap that logic up.

What's the splat doing here?

match, text, number = *"foobar 123".match(/([A-z]*) ([0-9]*)/)
I know this is doing some kind of regular expression match but what role does the splat play here and is there a way to do this without the splat so it's less confusing?
The splat is decomposing the regex match results (a MatchData with three groups: the whole pattern, the letters, and the numbers) into three variables. So we end up with:
match = "foobar 123"
text = "foobar"
number = "123"
Without the splat, there'd only be the one result (the MatchData) so Ruby wouldn't know how to assign it to the three separate variables.
is there a way to do this without the splat so it's less confusing?
Since a,b = [c,d] is the same as a,b = *[c,d] and splat calls to_a on its operand when it's not an array you could simply call to_a explicitly and not need the splat:
match, text, number = "foobar 123".match(/([A-z]*) ([0-9]*)/).to_a
Don't know whether that's less confusing, but it's splatless.
There's a good explanation in the documentation for MatchData:
Because to_a is called when expanding
*variable, thereā€˜s a useful assignment shortcut for extracting matched
fields. This is slightly slower than
accessing the fields directly (as an
intermediate array is generated).
all,f1,f2,f3 = *(/(.)(.)(\d+)(\d)/.match("THX1138."))
all #=> "HX1138"
f1 #=> "H"
f2 #=> "X"
f3 #=> "113"
String.match returns a MatchData object, which contains all the matches of the regular expression. The splat operator splits this object and returns all the matches separately.
If you just run
"foobar 123".match(/([A-z]*) ([0-9]*)/)
in irb, you can see the MatchData object, with the matches collected.
MatchData is a special variable, for all intents and purposes an array (kind of) so you can in fact do this as well:
match, text, number = "foobar 123".match(/([A-z]*) ([0-9]*)/)[0..2]
Learn more about the special variable MatchData

Resources