Compare string against array and extract array elements present in ruby - ruby

I have the following string:
str = "This is a string"
What I want to do is compare it with this array:
a = ["this", "is", "something"]
The result should be an array with "this" and "is" because both are present in the array and in the given string. "something" is not present in the string so it shouldn't appear. How can I do this?

One way to do this:
str = "This is a string"
a = ["this","is","something"]
str.downcase.split & a
# => ["this", "is"]
I am assuming Array a will always have keys(elements) in downcase.

There's always many ways to do this sort of thing
str = "this is the example string"
words_to_compare = ["dogs", "ducks", "seagulls", "the"]
words_to_compare.select{|word| word =~ Regexp.union(str.split) }
#=> ["the"]

Your question has an XY problem smell to it. Usually when we want to find what words exist the next thing we want to know is how many times they exist. Frequency counts are all over the internet and Stack Overflow. This is a minor modification to such a thing:
str = "This is a string"
a = ["this", "is", "something"]
a_hash = a.each_with_object({}) { |i, h| h[i] = 0 } # => {"this"=>0, "is"=>0, "something"=>0}
That defined a_hash with the keys being the words to be counted.
str.downcase.split.each{ |k| a_hash[k] += 1 if a_hash.key?(k) }
a_hash # => {"this"=>1, "is"=>1, "something"=>0}
a_hash now contains the counts of the word occurrences. if a_hash.key?(k) is the main difference we'd see compared to a regular word-count as it's only allowing word-counts to occur for the words in a.
a_hash.keys.select{ |k| a_hash[k] > 0 } # => ["this", "is"]
It's easy to find the words that were in common because the counter is > 0.
This is a very common problem in text processing so it's good knowing how it works and how to bend it to your will.

Related

Check whether a string contains all the characters of another string in Ruby

Let's say I have a string, like string= "aasmflathesorcerersnstonedksaottersapldrrysaahf". If you haven't noticed, you can find the phrase "harry potter and the sorcerers stone" in there (minus the space).
I need to check whether string contains all the elements of the string.
string.include? ("sorcerer") #=> true
string.include? ("harrypotterandtheasorcerersstone") #=> false, even though it contains all the letters to spell harrypotterandthesorcerersstone
Include does not work on shuffled string.
How can I check if a string contains all the elements of another string?
Sets and array intersection don't account for repeated chars, but a histogram / frequency counter does:
require 'facets'
s1 = "aasmflathesorcerersnstonedksaottersapldrrysaahf"
s2 = "harrypotterandtheasorcerersstone"
freq1 = s1.chars.frequency
freq2 = s2.chars.frequency
freq2.all? { |char2, count2| freq1[char2] >= count2 }
#=> true
Write your own Array#frequency if you don't want to the facets dependency.
class Array
def frequency
Hash.new(0).tap { |counts| each { |v| counts[v] += 1 } }
end
end
I presume that if the string to be checked is "sorcerer", string must include, for example, three "r"'s. If so you could use the method Array#difference, which I've proposed be added to the Ruby core.
class Array
def difference(other)
h = other.each_with_object(Hash.new(0)) { |e,h| h[e] += 1 }
reject { |e| h[e] > 0 && h[e] -= 1 }
end
end
str = "aasmflathesorcerersnstonedksaottersapldrrysaahf"
target = "sorcerer"
target.chars.difference(str.chars).empty?
#=> true
target = "harrypotterandtheasorcerersstone"
target.chars.difference(str.chars).empty?
#=> true
If the characters of target must not only be in str, but must be in the same order, we could write:
target = "sorcerer"
r = Regexp.new "#{ target.chars.join "\.*" }"
#=> /s.*o.*r.*c.*e.*r.*e.*r/
str =~ r
#=> 2 (truthy)
(or !!(str =~ r) #=> true)
target = "harrypotterandtheasorcerersstone"
r = Regexp.new "#{ target.chars.join "\.*" }"
#=> /h.*a.*r.*r.*y* ... o.*n.*e/
str =~ r
#=> nil
A different albeit not necessarily better solution using sorted character arrays and sub-strings:
Given your two strings...
subject = "aasmflathesorcerersnstonedksaottersapldrrysaahf"
search = "harrypotterandthesorcerersstone"
You can sort your subject string using .chars.sort.join...
subject = subject.chars.sort.join # => "aaaaaaacddeeeeeffhhkllmnnoooprrrrrrssssssstttty"
And then produce a list of substrings to search for:
search = search.chars.group_by(&:itself).values.map(&:join)
# => ["hh", "aa", "rrrrrr", "y", "p", "ooo", "tttt", "eeeee", "nn", "d", "sss", "c"]
You could alternatively produce the same set of substrings using this method
search = search.chars.sort.join.scan(/((.)\2*)/).map(&:first)
And then simply check whether every search sub-string appears within the sorted subject string:
search.all? { |c| subject[c] }
Create a 2 dimensional array out of your string letter bank, to associate the count of letters to each letter.
Create a 2 dimensional array out of the harry potter string in the same way.
Loop through both and do comparisons.
I have no experience in Ruby but this is how I would start to tackle it in the language I know most, which is Java.

Array of strings Group by first common letters [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 7 years ago.
Improve this question
Is there anyway of grouping first common letters in an array of strings?
For example:
array = [ 'hello', 'hello you', 'people', 'finally', 'finland' ]
so when i do
array.group_by{ |string| some_logic_with_string }
The result should be,
{
'hello' => ['hello', 'hello you'],
'people' => ['people'],
'fin' => ['finally', 'finland']
}
NOTE: Some test cases are ambiguous and expectations conflict with other tests, you need to fix them.
I guess plain group_by may not work, a further processing is needed.
I have come up with below code that seems to work for all the given test cases in consistent manner.
I have left notes in the code to explain the logic. Only way to fully understand it will be to inspect value of h and see the flow for a simple test case.
def group_by_common_chars(array)
# We will iteratively group by as many time as there are characters
# in a largest possible key, which is max length of all strings
max_len = array.max_by {|i| i.size}.size
# First group by first character.
h = array.group_by{|i| i[0]}
# Now iterate remaining (max_len - 1) times
(1...max_len).each do |c|
# Let's perform a group by next set of starting characters.
t = h.map do |k,v|
h1 = v.group_by {|i| i[0..c]}
end.reduce(&:merge)
# We need to merge the previously generated hash
# with the hash generated in this iteration. Here things get tricky.
# If previously, we had
# {"a" => ["a"], "ab" => ["ab", "abc"]},
# and now, we have
# {"a"=>["a"], "ab"=>["ab"], "abc"=>["abc"]},
# We need to merge the two hashes such that we have
# {"a"=>["a"], "ab"=>["ab", "abc"], "abc"=>["abc"]}.
# Note that `Hash#merge`'s block is called only for common keys, so, "abc"
# will get merged, we can't do much about it now. We will process
# it later in the loop
h = h.merge(t) do |k, o, n|
if (o.size != n.size)
diff = [o,n].max - [o,n].min
if diff.size == 1 && t.value?(diff)
[o,n].max
else
[o,n].min
end
else
o
end
end
end
# Sort by key length, smallest in the beginning.
h = h.sort {|i,j| i.first.size <=> j.first.size }.to_h
# Get rid of those key-value pairs, where value is single element array
# and that single element is already part of another key-value pair, and
# that value array has more than one element. This step will allow us
# to get rid of key-value like "abc"=>["abc"] in the example discussed
# above.
h = h.tap do |h|
keys = h.keys
keys.each do |k|
v = h[k]
if (v.size == 1 &&
h.key?(v.first) &&
h.values.flatten.count(v.first) > 1) then
h.delete(k)
end
end
end
# Get rid of those keys whose value array consist of only elements that
# already part of some other key. Since, hash is ordered by key's string
# size, this process allows us to get rid of those keys which are smaller
# in length but consists of only elements that are present somewhere else
# with a key of larger length. For example, it lets us to get rid of
# "a"=>["aba", "abb", "aaa", "aab"] from a hash like
# {"a"=>["aba", "abb", "aaa", "aab"], "ab"=>["aba", "abb"], "aa"=>["aaa", "aab"]}
h.tap do |h|
keys = h.keys
keys.each do |k|
values = h[k]
other_values = h.values_at(*(h.keys-[k])).flatten
already_present = values.all? do |v|
other_values.include?(v)
end
h.delete(k) if already_present
end
end
end
Sample Run:
p group_by_common_chars ['hello', 'hello you', 'people', 'finally', 'finland']
#=> {"fin"=>["finally", "finland"], "hello"=>["hello", "hello you"], "people"=>["people"]}
p group_by_common_chars ['a', 'ab', 'abc']
#=> {"a"=>["a"], "ab"=>["ab", "abc"]}
p group_by_common_chars ['aba', 'abb', 'aaa', 'aab']
#=> {"ab"=>["aba", "abb"], "aa"=>["aaa", "aab"]}
p group_by_common_chars ["Why", "haven't", "you", "answered", "the", "above", "questions?", "Please", "do", "so."]
#=> {"a"=>["answered", "above"], "do"=>["do"], "Why"=>["Why"], "you"=>["you"], "so."=>["so."], "the"=>["the"], "Please"=>["Please"], "haven't"=>["haven't"], "questions?"=>["questions?"]}
Not sure, if you can sort by all common letters. But if you want to do sort only by first letter then here it is:
array = [ 'hello', 'hello you', 'people', 'finally', 'finland' ]
result = {}
array.each { |st| result[st[0]] = result.fetch(st[0], []) + [st] }
pp result
{"h"=>["hello", "hello you"], "p"=>["people"], "f"=>["finally", "finland"]}
Now result contains your desired hash.
Hmm, you're trying to do something that's pretty custom. I can think of two classical approaches that sort of do what you want: 1) Stemming and 2) Levenshtein Distance.
With stemming you're finding the root word to a longer word. Here's a gem for it.
Levenshtein is a famous algorithm which calculates the difference between two strings. There is a gem for it that runs pretty fast due to a native C extension.

Find if all letters in a string are unique

I need to know if all letters in a string are unique. For a string to be unique, a letter can only appear once. If all letters in a string are distinct, the string is unique. If one letter appears multiple times, the string is not unique.
"Cwm fjord veg balks nth pyx quiz."
# => All 26 letters are used only once. This is unique
"This is a string"
# => Not unique, i and s are used more than once
"two"
# => unique, each letter is shown only once
I tried writing a function that determines whether or not a string is unique.
def unique_characters(string)
for i in ('a'..'z')
if string.count(i) > 1
puts "This string is unique"
else
puts "This string is not unique"
end
end
unique_characters("String")
I receive the output
"This string is unique" 26 times.
Edit:
I would like to humbly apologize for including an incorrect example in my OP. I did some research, trying to find pangrams, and assumed that they would only contain 26 letters. I would also like to thank you guys for pointing out my error. After that, I went on wikipedia to find a perfect pangram (I wrongly thought the others were perfect).
Here is the link for reference purposes
http://en.wikipedia.org/wiki/List_of_pangrams#Perfect_pangrams_in_English_.2826_letters.29
Once again, my apologies.
s = "The quick brown fox jumps over the lazy dog."
.downcase
("a".."z").all?{|c| s.count(c) <= 1}
# => false
Another way to do it is:
s = "The quick brown fox jumps over the lazy dog."
(s.downcase !~ /([a-z]).*\1/)
# => false
I would solve this in two steps: 1) extract the letters 2) check if there are duplicates:
letters = string.scan(/[a-z]/i) # append .downcase to ignore case
letters.length == letters.uniq.length
Here is a method that does not convert the string to an array:
def dupless?(str)
str.downcase.each_char.with_object('') { |c,s|
c =~ /[a-z]/ && s.include?(c) ? (return false) : s << c }
true
end
dupless?("Cwm fjord veg balks nth pyx quiz.") #=> true
dupless?("This is a string.") #=> false
dupless?("two") #=> true
dupless?("Two tubs") #=> false
If you want to actually keep track of the duplicate characters:
def is_unique?(string)
# Remove whitespaces
string = string.gsub(/\s+/, "")
# Build a hash counting all occurences of each characters
h = Hash.new { |hash, key| hash[key] = 0 }
string.chars.each { |c| h[c] += 1 }
# An array containing all the repetitions
res = h.keep_if {|k, c| c > 1}.keys
if res.size == 0
puts "All #{string.size} characters are used only once. This is unique"
else
puts "Not unique #{res.join(', ')} are used more than once"
end
end
is_unique?("This is a string") # Not unique i, s are used more than once
is_unique?("two") # All 3 characters are used only once. This is unique
To check if a string is unique or not, you can try out this:
string_input.downcase.gsub(/[^a-z]/, '').split("").sort.join('') == ('a' .. 'z').to_a.join('')
This will return true, if all the characters in your string are unique and if they include all the 26 characters.
def has_uniq_letters?(str)
letters = str.gsub(/[^A-Za-z]/, '').chars
letters == letters.uniq
end
If this doesn't have to be case sensitive,
def has_uniq_letters?(str)
letters = str.downcase.gsub(/[^a-z]/, '').chars
letters == letters.uniq
end
In your example, you mentioned you wanted additional information about your string (list of unique characters, etc), so this example may also be useful to you.
# s = "Cwm fjord veg balks nth pyx quiz."
s = "This is a test string."
totals = Hash.new(0)
s.downcase.each_char { |c| totals[c] += 1 if ('a'..'z').cover?(c) }
duplicates, uniques = totals.partition { |k, v| v > 1 }
duplicates, uniques = Hash[duplicates], Hash[uniques]
# duplicates = {"t"=>4, "i"=>3, "s"=>4}
# uniques = {"h"=>1, "a"=>1, "e"=>1, "r"=>1, "n"=>1, "g"=>1}

Regular expression and String

With the expression below:
words = string.scan(/\b\S+\b/i)
I am trying to scan through the string with word boundaries and case insensitivity, so if I have:
string = "A ball a Ball"
then when I have this each block:
words.each { |word| result[word] += 1 }
I am anticipating something like:
{"a"=>2, "ball"=>2}
But instead what I get is:
{"A"=>1, "ball"=>1, "a"=>1, "Ball"=>1}
After this thing didnt work I tried to create a new Regexp like:
Regexp.new(Regexp.escape(string), "i")
but then I do not know how to use this or move forward from here.
The regex matches words in case-insensitive mode, but it doesn't alter matched text in any way. So you will receive text in its original form in the block. Try casting strings to lower case when counting.
string = "A ball a Ball"
words = string.scan(/\b\S+\b/i) # => ["A", "ball", "a", "Ball"]
result = Hash.new(0)
words.each { |word| result[word.downcase] += 1 }
result # => {"a"=>2, "ball"=>2}
The regexp is fine; your problem is when you increment your counter using the hash. Hash keys are case sensitive, so you must change the case when incrementing:
words.each { |word| result[word.upcase] += 1 }

How to make sure certain elements not get into arrays in Ruby

I have an array lets say
array1 = ["abc", "a", "wxyz", "ab",......]
How do I make sure neither for example "a" (any 1 character), "ab" (any 2 characters), "abc" (any 3 characters), nor words like "that", "this", "what" etc nor any of the foul words are saved in array1?
This removes elements with less than 4 characters and words like this, that, what from array1 (if I got it right):
array1.reject! do |el|
el.length < 4 || ['this', 'that', 'what'].include?(el)
end
This changes array1. If you use reject (without !), it'll return the result and not change array1
You can open and add a new interface to the Array class which will disallow certain words. Example:
class Array
def add(ele)
unless rejects.include?(ele)
self.push ele
end
end
def rejects
['this', 'that', 'what']
end
end
arr = []
arr.add "one"
puts arr
arr.add "this"
puts arr
arr.add "aslam"
puts arr
Output would be:
one one one aslam
And notice the word "this" was not added.
You could create a stop list. Using a hash for this would be more efficient than an array as lookup time will be consistant with a hash. With an array the lookup time is proportional to the number of elements in the array. If you are going to check for stop words a lot, I suggest using a hash that contains all the stop words. Using your code, you could do the following
badwords_a = ["abc", "a", "wxyz", "ab"] # Your array of bad words
badwords_h = {} # Initialize and empty hash
badwords_a.each{|word| badwords_h[word] = nil} # Fill the hash
goodwords = []
words_to_process = ["abc","a","Foo","Bar"] # a list of words you want to process
words_to_process.each do |word| # Process new words
if badwords_h.key?(word)
else
goodwords << word # Add the word if it did not match the bad list
end
end
puts goodwords.join(", ")

Resources