Regular expression and String - ruby

With the expression below:
words = string.scan(/\b\S+\b/i)
I am trying to scan through the string with word boundaries and case insensitivity, so if I have:
string = "A ball a Ball"
then when I have this each block:
words.each { |word| result[word] += 1 }
I am anticipating something like:
{"a"=>2, "ball"=>2}
But instead what I get is:
{"A"=>1, "ball"=>1, "a"=>1, "Ball"=>1}
After this thing didnt work I tried to create a new Regexp like:
Regexp.new(Regexp.escape(string), "i")
but then I do not know how to use this or move forward from here.

The regex matches words in case-insensitive mode, but it doesn't alter matched text in any way. So you will receive text in its original form in the block. Try casting strings to lower case when counting.
string = "A ball a Ball"
words = string.scan(/\b\S+\b/i) # => ["A", "ball", "a", "Ball"]
result = Hash.new(0)
words.each { |word| result[word.downcase] += 1 }
result # => {"a"=>2, "ball"=>2}

The regexp is fine; your problem is when you increment your counter using the hash. Hash keys are case sensitive, so you must change the case when incrementing:
words.each { |word| result[word.upcase] += 1 }

Related

Why does ruby's IO readlines method behave differently when followed by a filter

I'm building a little Wordle inspired project for fun and am gathering the words from my local dictionary. Originally I was doing this:
word_list = File.readlines("/usr/share/dict/words", chomp: true)
word_list.filter { |word| word.length == 5 }.map(&:upcase)
The first line takes absolutely ages. However when doing this:
word_list = File.readlines("/usr/share/dict/words", chomp: true).filter { |word| word.length == 5 }.map(&:upcase)
it completes in a matter of seconds. I can't work out how the filter block is being applied to the lines being read before they're assigned memory (which I'm assuming is what is causing the slow read time), clearly each method isn't being fully applied before the next is called but that is how I thought method chaining works.
Let's create a file.
File.write('t', "dog horse\npig porcupine\nowl zebra\n") #=> 34
then
a = File.readlines("t", chomp:true)
#=> ["dog horse", "pig porcupine", "owl zebra"]
so your block variable word holds a string of two words. That's obviously not what you want.
You could use IO::read to "gulp" the file into a string.
s = File.read("t")
#=> "dog horse\npig porcupine\nowl zebra\n"
then
a = s.scan(/\w+/)
#=> ["dog", "horse", "pig", "porcupine", "owl", "zebra"].
b = a.select { |word| word.size == 5 }
#=> ["horse", "zebra"]
c = b.map(&:upcase)
#=> ["HORSE", "ZEBRA"]
We could of course chain these operations:
File.read("t").scan(/\w+/).select { |word| word.size == 5 }.map(&:upcase)
#=> ["HORSE", "ZEBRA"]
scan(/\w+/) matches each string of word characters (letters, digits and underscores). To match only letters change that to scan(/[a-zA-Z]+/).
You could use IO#readlines, which reads lines into an array, by extracting words for each line, filtering the resulting array to keep ones having 5 characters, and then adding those words, after upcasing, to a previously-defined empty array.
File.readlines('t')
.each_with_object([]) { |line,arr| line.scan(/\w+/) }
.select { |word| word.size == 5 }
.map(&:upcase)
.each { |word| arr << word } #=> ["HORSE", "ZEBRA"]
You could add the optional parameter chomp: true to readline's arguments, but there is no reason to do so.
Better would be to use IO#foreach which, without a block, returns an enumerator which can be chained, avoiding for the temporary array created by readlines.
File.foreach('t').with_object([]) do |line,arr|
line.scan(/\w+/)
.select { |word| word.size == 5 }
.map(&:upcase)
.each { |word| arr << word }
end
#=> ["HORSE", "ZEBRA"]

Find multiple longest common prefixes from list of string

I'm trying to find all possible prefixes from a list of strings. We can remove "/" or ":" from prefixes to make it more readable.
input = ["item1", "item2", "product1", "product2", "variant:123", "variant:789"]
Expected output
item
product
variant
The key here is to find your delimiter. It looks like your delimiters are numbers and : and /. So you should be able to map through the array, use the delimiter in a regex to return the prefix. You also have the option to check it exists in the array (so you TRULY know that its a prefix) but I didnt include it in my answer.
input = ["item1", "item2", "product1", "product2", "variant:123", "variant:789"]
prefixes = input.map {|word| word.gsub(/:?\/?[0-9]/, '')}.uniq
=> ["item", "product", "variant"]
The more you delimiters you have, you can append it onto the regex pattern. Do a little reading here about wildcards :-)
Hope this answers your question!
I assume the order of the prefixes that is returned is unimportant. Also, I have disregarded the treatment of the characters "/" and ":" because that is straightforward and a distraction to the central problem.
Let's first create a helper method whose sole argument is an array of words that begin with the same letter.
def longest_prefix(arr)
a0, *a = arr
return a0 if a0.size == 1 || arr.size == 1
n = (1..a0.size-1).find do |i|
c = a0[i]
a.any? { |w| w[i] != c }
end
n.nil? ? a0 : a0[0,n]
end
For example,
arr = ["dog1", "doggie", "dog2", "doggy"]
longest_prefix arr
#=> "dog"
We now merely need to group the words by their first letters, then map the resulting key-value pairs to the return value of the helper method when its argument equals the value of the key-value pair.
def prefixes(arr)
arr.group_by { |w| w[0] }.map { |_,a| longest_prefix(a) }
end
Suppose, for example,
arr = ["dog1", "eagles", "eagle", "doggie", "dog2", "catsup",
"cats", "elephant", "cat", "doggy", "caustic"]
Then
prefixes arr
#=> ["dog", "e", "ca"]
Note that
arr.group_by { |w| w[0] }
#=> { "d"=>["dog1", "doggie", "dog2", "doggy"],
# "e"=>["eagles", "eagle", "elephant"],
# "c"=>["catsup", "cats", "cat", "caustic"] }
See Enumerable#group_by.

Check whether a string contains all the characters of another string in Ruby

Let's say I have a string, like string= "aasmflathesorcerersnstonedksaottersapldrrysaahf". If you haven't noticed, you can find the phrase "harry potter and the sorcerers stone" in there (minus the space).
I need to check whether string contains all the elements of the string.
string.include? ("sorcerer") #=> true
string.include? ("harrypotterandtheasorcerersstone") #=> false, even though it contains all the letters to spell harrypotterandthesorcerersstone
Include does not work on shuffled string.
How can I check if a string contains all the elements of another string?
Sets and array intersection don't account for repeated chars, but a histogram / frequency counter does:
require 'facets'
s1 = "aasmflathesorcerersnstonedksaottersapldrrysaahf"
s2 = "harrypotterandtheasorcerersstone"
freq1 = s1.chars.frequency
freq2 = s2.chars.frequency
freq2.all? { |char2, count2| freq1[char2] >= count2 }
#=> true
Write your own Array#frequency if you don't want to the facets dependency.
class Array
def frequency
Hash.new(0).tap { |counts| each { |v| counts[v] += 1 } }
end
end
I presume that if the string to be checked is "sorcerer", string must include, for example, three "r"'s. If so you could use the method Array#difference, which I've proposed be added to the Ruby core.
class Array
def difference(other)
h = other.each_with_object(Hash.new(0)) { |e,h| h[e] += 1 }
reject { |e| h[e] > 0 && h[e] -= 1 }
end
end
str = "aasmflathesorcerersnstonedksaottersapldrrysaahf"
target = "sorcerer"
target.chars.difference(str.chars).empty?
#=> true
target = "harrypotterandtheasorcerersstone"
target.chars.difference(str.chars).empty?
#=> true
If the characters of target must not only be in str, but must be in the same order, we could write:
target = "sorcerer"
r = Regexp.new "#{ target.chars.join "\.*" }"
#=> /s.*o.*r.*c.*e.*r.*e.*r/
str =~ r
#=> 2 (truthy)
(or !!(str =~ r) #=> true)
target = "harrypotterandtheasorcerersstone"
r = Regexp.new "#{ target.chars.join "\.*" }"
#=> /h.*a.*r.*r.*y* ... o.*n.*e/
str =~ r
#=> nil
A different albeit not necessarily better solution using sorted character arrays and sub-strings:
Given your two strings...
subject = "aasmflathesorcerersnstonedksaottersapldrrysaahf"
search = "harrypotterandthesorcerersstone"
You can sort your subject string using .chars.sort.join...
subject = subject.chars.sort.join # => "aaaaaaacddeeeeeffhhkllmnnoooprrrrrrssssssstttty"
And then produce a list of substrings to search for:
search = search.chars.group_by(&:itself).values.map(&:join)
# => ["hh", "aa", "rrrrrr", "y", "p", "ooo", "tttt", "eeeee", "nn", "d", "sss", "c"]
You could alternatively produce the same set of substrings using this method
search = search.chars.sort.join.scan(/((.)\2*)/).map(&:first)
And then simply check whether every search sub-string appears within the sorted subject string:
search.all? { |c| subject[c] }
Create a 2 dimensional array out of your string letter bank, to associate the count of letters to each letter.
Create a 2 dimensional array out of the harry potter string in the same way.
Loop through both and do comparisons.
I have no experience in Ruby but this is how I would start to tackle it in the language I know most, which is Java.

Why does capitalizing the first letter of string elements alter an array?

The following code is intended to capitalize the first letter of each word in a string, and it works:
def capitalize_words(string)
words = string.split(" ")
idx = 0
while idx < words.length
word = words[idx]
word[0] = word[0].upcase
idx += 1
end
return words.join(" ")
end
capitalize_words("this is a sentence") # => "This Is A Sentence"
capitalize_words("mike bloomfield") # => "Mike Bloomfield"
I do not understand why it works. In the while loop, I did not set any element in the words array to anything new. I understand that it might work if I added the following line before the index iteration:
words[idx] = word
I would then be altering the elements of words. However, the code works even without that line.
yet in no place in the while loop that I am using to capitalize the
first letter of each word do I actually set any of the elements in the
"words" array to anything new.
You do, actually, right here:
word = words[idx]
word[0] = word[0].upcase # This changes words[idx][0]!
The upcase method does just that: returns the upcase of a given string. For example:
'example'.upcase
# => "EXAMPLE"
'example'[0].upcase
# => "E"
The method String#[]= that you are using in:
word[0] = ...
is not variable assignment. It alters the content of the receiver string at the given index, retaining the identity of the string as an object. And since word is not a copy but is the original string taken from words, in turn, you are modifying words.
You're doing a lot of work that you don't have to:
def capitalize_words(string)
string.split.map{ |w|
[w[0].upcase, w[1..-1]].join # => "Foo", "Bar"
}.join(' ')
end
capitalize_words('foo bar')
# => "Foo Bar"
Breaking it down:
'foo'[0] # => "f"
'foo'[0].upcase # => "F"
'foo'[1..-1] # => "oo"
['F', 'oo'].join # => "Foo"

Compare string against array and extract array elements present in ruby

I have the following string:
str = "This is a string"
What I want to do is compare it with this array:
a = ["this", "is", "something"]
The result should be an array with "this" and "is" because both are present in the array and in the given string. "something" is not present in the string so it shouldn't appear. How can I do this?
One way to do this:
str = "This is a string"
a = ["this","is","something"]
str.downcase.split & a
# => ["this", "is"]
I am assuming Array a will always have keys(elements) in downcase.
There's always many ways to do this sort of thing
str = "this is the example string"
words_to_compare = ["dogs", "ducks", "seagulls", "the"]
words_to_compare.select{|word| word =~ Regexp.union(str.split) }
#=> ["the"]
Your question has an XY problem smell to it. Usually when we want to find what words exist the next thing we want to know is how many times they exist. Frequency counts are all over the internet and Stack Overflow. This is a minor modification to such a thing:
str = "This is a string"
a = ["this", "is", "something"]
a_hash = a.each_with_object({}) { |i, h| h[i] = 0 } # => {"this"=>0, "is"=>0, "something"=>0}
That defined a_hash with the keys being the words to be counted.
str.downcase.split.each{ |k| a_hash[k] += 1 if a_hash.key?(k) }
a_hash # => {"this"=>1, "is"=>1, "something"=>0}
a_hash now contains the counts of the word occurrences. if a_hash.key?(k) is the main difference we'd see compared to a regular word-count as it's only allowing word-counts to occur for the words in a.
a_hash.keys.select{ |k| a_hash[k] > 0 } # => ["this", "is"]
It's easy to find the words that were in common because the counter is > 0.
This is a very common problem in text processing so it's good knowing how it works and how to bend it to your will.

Resources