Given a search string and a result string (which is guaranteed to contain all letters of the search string, case-insensitive, in order), how can I most efficiently get an array of ranges representing the indices in the result string corresponding to the letters in the search string?
Desired output:
substrings( "word", "Microsoft Office Word 2007" )
#=> [ 17..20 ]
substrings( "word", "Network Setup Wizard" )
#=> [ 3..5, 19..19 ]
#=> [ 3..4, 18..19 ] # Alternative, acceptable, less-desirable output
substrings( "word", "Watch Network Daemon" )
#=> [ 0..0, 10..11, 14..14 ]
This is for an autocomplete search box. Here's a screenshot from a tool similar to Quicksilver that underlines letters as I'm looking to do. Note that--unlike my ideal output above--this screenshot does not prefer longer single matches.
Benchmark Results
Benchmarking the current working results shows that #tokland's regex-based answer is basically as fast as the StringScanner-based solutions I put forth, with less code:
user system total real
phrogz1 0.889000 0.062000 0.951000 ( 0.944000)
phrogz2 0.920000 0.047000 0.967000 ( 0.977000)
tokland 1.030000 0.000000 1.030000 ( 1.035000)
Here is the benchmark test:
a=["Microsoft Office Word 2007","Network Setup Wizard","Watch Network Daemon"]
b=["FooBar","Foo Bar","For the Love of Big Cars"]
test = { a=>%w[ w wo wor word ], b=>%w[ f fo foo foobar fb fbr ] }
require 'benchmark'
Benchmark.bmbm do |x|
%w[ phrogz1 phrogz2 tokland ].each{ |method|
x.report(method){ test.each{ |words,terms|
words.each{ |master| terms.each{ |term|
2000.times{ send(method,term,master) }
} }
} }
}
end
To have something to start with, how about that?
>> s = "word"
>> re = /#{s.chars.map{|c| "(#{c})" }.join(".*?")}/i # /(w).*?(o).*?(r).*?(d)/i/
>> match = "Watch Network Daemon".match(re)
=> #<MatchData "Watch Network D" 1:"W" 2:"o" 3:"r" 4:"D">
>> 1.upto(s.length).map { |idx| match.begin(idx) }
=> [0, 10, 11, 14]
And now you only have to build the ranges (if you really need them, I guess the individual indexes are also ok).
Ruby's Abbrev module is a good starting point. It breaks down a string into a hash consisting of the unique keys that can identify the full word:
require 'abbrev'
require 'pp'
abbr = Abbrev::abbrev(['ruby'])
>> {"rub"=>"ruby", "ru"=>"ruby", "r"=>"ruby", "ruby"=>"ruby"}
For every keypress you can do a lookup and see if there's a match. I'd filter out all keys shorter than a certain length, to reduce the size of the hash.
The keys will also give you a quick set of words to look up the subword matches in your original string.
For fast lookups to see if there's a substring match:
regexps = Regexp.union(
abbr.keys.sort.reverse.map{ |k|
Regexp.new(
Regexp.escape(k),
Regexp::IGNORECASE
)
}
)
Note that it's escaping the patterns, which would allow characters to be entered, such as ?, * or ., and be treated as literals, instead of special characters for regex, like they would normally be treated.
The result looks like:
/(?i-mx:ruby)|(?i-mx:rub)|(?i-mx:ru)|(?i-mx:r)/
Regexp's match will return information about what was found.
Because the union "ORs" the patterns, it will only find the first match, which will be the shortest occurrence in the string. To fix that reverse the sort.
That should give you a good start on what you want to do.
EDIT: Here's some code to directly answer the question. We've been busy at work so it's taken a couple days to get back this:
require 'abbrev'
require 'pp'
abbr = Abbrev::abbrev(['ruby'])
regexps = Regexp.union( abbr.keys.sort.reverse.map{ |k| Regexp.new( Regexp.escape(k), Regexp::IGNORECASE ) } )
target_str ='Ruby rocks, rub-a-dub-dub, RU there?'
str_offset = 0
offsets = []
loop do
match_results = regexps.match(target_str, str_offset)
break if (match_results.nil?)
s, e = match_results.offset(0)
offsets << [s, e - s]
str_offset = 1 + s
end
pp offsets
>> [[0, 4], [5, 1], [12, 3], [27, 2], [33, 1]]
If you want ranges replace offsets << [s, e - s] with offsets << [s .. e] which will return:
>> [[0..4], [5..6], [12..15], [27..29], [33..34]]
Here's a late entrant that's making a move as it nears the finish line.
code
def substrings( search_str, result_str )
search_chars = search_str.downcase.chars
next_char = search_chars.shift
result_str.downcase.each_char.with_index.take_while.with_object([]) do |(c,i),a|
if next_char == c
(a.empty? || i != a.last.last+1) ? a << (i..i) : a[-1]=(a.last.first..i)
next_char = search_chars.shift
end
next_char
end
end
demo
substrings( "word", "Microsoft Office Word 2007" ) #=> [17..20]
substrings( "word", "Network Setup Wizard" ) #=> [3..5, 19..19]
substrings( "word", "Watch Network Daemon" ) #=> [0..0, 10..11, 14..14]
benchmark
user system total real
phrogz1 1.120000 0.000000 1.120000 ( 1.123083)
cary 0.550000 0.000000 0.550000 ( 0.550728)
I don't think there are any built in methods that will really help with this, probably the best way is to go through each letter in the word you're searching for and build up the ranges manually. Your next best option would probably be to build a regex like in #tokland's answer.
Here's my implementation:
require 'strscan'
def substrings( search, master )
[].tap do |ranges|
scan = StringScanner.new(master)
init = nil
last = nil
prev = nil
search.chars.map do |c|
return nil unless scan.scan_until /#{c}/i
last = scan.pos-1
if !init || (last-prev) > 1
ranges << (init..prev) if init
init = last
end
prev = last
end
ranges << (init..last)
end
end
And here's a shorter version using another utility method (also needed by #tokland's answer):
require 'strscan'
def substrings( search, master )
s = StringScanner.new(master)
search.chars.map do |c|
return nil unless s.scan_until(/#{c}/i)
s.pos - 1
end.to_ranges
end
class Array
def to_ranges
return [] if empty?
[].tap do |ranges|
init,last = first
each do |o|
if last && o != last.succ
ranges << (init..last)
init = o
end
last = o
end
ranges << (init..last)
end
end
end
Related
Task I want to solve:
Write a program that takes a string, will perform a transformation and return it.
For each of the letters of the parameter string switch it by the next one in alphabetical order.
'z' becomes 'a' and 'Z' becomes 'A'. Case remains unaffected.
def rotone(param_1)
a = ""
param_1.each_char do |x|
if x.count("a-zA-Z") > 0
a << x.succ
else
a << x
end
end
a
end
And I take this:
Input: "AkjhZ zLKIJz , 23y "
Expected Return Value: "BlkiA aMLJKa , 23z "
Return Value: "BlkiAA aaMLJKaa , 23z "
When iterators find 'z' or 'Z' it increment two times z -> aa or Z -> AA
input = "AkjhZ zLKIJz , 23y"
Code
p input.tr('a-yA-YzZ','b-zB-ZaA')
Output
"BlkiA aMLJKa , 23z"
Your problem is that String#succ (aka String#next) has been designed in a way that does not serve your purpose when the receiver is 'z' or 'Z':
'z'.succ #=> 'aa'
'Z'.succ #=> 'AA'
If you replaced a << x.succ with a << x.succ[0] you would obtain the desired result.
You might consider writing that as follows.
def rotone(param_1)
param_1.gsub(/./m) { |c| c.match?(/[a-z]/i) ? c.succ[0] : c }
end
String#gsub's argument is a regular expression that matches every character (so every character is passed to gsub's block)1.
See also String#match?. The regular expression /[a-z]/i matches every character that is one of the characters in the character class [a-z]. The option i makes the match case-independent, so uppercase letters are matched as well.
Here is alternative way to write the method that employs two hashes that are defined as constants.
CODE = [*'a'..'z', *'A'..'Z'].each_with_object({}) do |c,h|
h[c] = c.succ[0]
end.tap { |h| h.default_proc = proc { |_h,k| k } }
#=> {"a"=>"b", "b"=>"c",..., "y"=>"z", "z"=>"a",
# "A"=>"B", "B"=>"C",..., "Y"=>"Z", "Z"=>"A"}
DECODE = CODE.invert.tap { |h| h.default_proc = proc { |_h,k| k } }
#=> {"b"=>"a", "c"=>"b", ..., "z"=>"y", "a"=>"z",
# "B"=>"A", "C"=>"B", ..., "Z"=>"Y", "A"=>"Z"}
For example,
CODE['e'] #=> "f"
CODE['Z'] #=> "A"
CODE['?'] #=> "?"
DECODE['f'] #=> "e"
DECODE['A'] #=> "Z"
DECODE['?'] #=> "?"
Let's try using gsub, CODE and DECODE with an example string.
str = "The quick brown dog Zelda jumped over the lazy fox Arnie"
rts = str.gsub(/./m, CODE)
#=> "Uif rvjdl cspxo eph Afmeb kvnqfe pwfs uif mbaz gpy Bsojf"
rts.gsub(/./m, DECODE)
#=> "The quick brown dog Zelda jumped over the lazy fox Arnie"
See Hash#merge, Object#tap, Hash#default_proc=, Hash#invert and the form of Sting#gsub that takes a hash as its optional second argument.
Adding the default proc to the hash h causes h[k] to return k if h does not have a key k. Had CODE been defined without the default proc,
CODE = [*'a'..'z', *'A'..'Z'].each_with_object({}) { |c,h| h[c] = c.succ[0] }
#=> {"a"=>"b", "b"=>"c",..., "y"=>"z", "z"=>"a",
# "A"=>"B", "B"=>"C",..., "Y"=>"Z", "Z"=>"A"}
gsub would skip over characters that are not letters:
rts = str.gsub(/./m, CODE)
#=> "UifrvjdlcspxoephAfmebkvnqfepwfsuifmbazgpyBsojf"
Without the default proc we would have to write
rts = str.gsub(/./m) { |s| CODE.fetch(s, s) }
#=> "Uif rvjdl cspxo eph Afmeb kvnqfe pwfs uif mbaz gpy Bsojf"
See Hash#fetch.
1. The regular expression /./ matches every character other than line terminators. Adding the option m (/./m) causes . to match line terminators as well.
I'm new to Ruby (and programming in general). I have a hash that is using data from an external file, and I'm trying to get the total number of values that are greater than 1500.
Here's my code Actually, I need both the number of entries and the total value of purchase orders over 1500. The external file is just a column of order numbers and a column of prices. I'm sure there is a very simple solution, but like I said I'm a beginner and can't figure it out. Any help would be appreciated. Thanks.
Edit: Here is my code. It's just that last while loop that's causing all the problems. I know that's not the right way to go about it, but I just can't figure out what to do.
myhash={}
file=File.open("Purchase Orders.csv", "r")
while !file.eof
line=file.readline
key,value=line.chomp.split(",")
myhash[key]=value
end
total=0
entries=myhash.length
newtotal=0
myhash.each { |key,value|
total+=value.to_f
}
puts total
puts entries
while value.to_f>1500
myhash.each {|key,value| newtotal+=value.to_f}
end
puts newtotal
I will rewrite the code in ruby idiomatic way in hope you’ll examine it and find out some hints.
prices = File.readlines("Purchase Orders.csv").map do |line|
line.chomp.split(",").last.to_f
end # array of prices
total = prices.inject(:+) # sum values
pricy = prices.select { |v| v > 1500 }
pricy_sum = pricy.inject(:+) # sum expensives
pricy_count = pricy.length # expensives’ size
puts "Total sum is: #{total}"
puts "Total expensives is: #{pricy}"
looks like you have your loops reversed. Also, using do and end is usually preferred over curly braces for multiline code blocks, while curly braces are generally used for single line blocks (as noted by #mudasobwa). Check out the ruby style guide for some more style pointers.
myhash.each do |key,value|
newtotal+=value.to_f if value.to_f > 1500
end
puts newtotal
Code
def nbr_and_tot(fname)
File.foreach(fname).with_object({ nbr_over: 0, tot_over: 0 }) do |line, h|
n = line[/\d+/].to_i
if n > 1500
h[:nbr_over] += 1
h[:tot_over] += n
end
end
end
Example
First let's create a file "temp":
str =<<-END
:cat, 1501
:dog, 1500
:pig, 2000
END
File.write("temp", str)
#=> 33
Confirm the file is correct:
puts File.read("temp")
prints
:cat, 1501
:dog, 1500
:pig, 2000
Now execute the method.
nbr_and_tot "temp"
#=> {:nbr_over=>2, :tot_over=>3501}
Explanation
First review, as necessary, IO::foreach, which reads the file line-by-line1 and returns an enumerator that is chained to with_object, Enumerator#with_object and String#[].
For the example,
fname = "temp"
e0 = File.foreach(fname)
#=> #<Enumerator: File:foreach("temp")>
We can see the values that will be generated by this enumerator (and passed to each_object) by converting it to an array:
e0.to_a
#=> [":cat, 1501\n", ":dog, 1500\n", ":pig, 2000\n"]
Continuing,
e1 = e0.with_object({ nbr_over: 0, tot_over: 0 })
#=> #<Enumerator: #<Enumerator: 2.3.0 :171 >
e1.to_a
#=> [[":cat, 1501\n", {:nbr_over=>0, :tot_over=>0}],
# [":dog, 1500\n", {:nbr_over=>0, :tot_over=>0}],
# [":pig, 2000\n", {:nbr_over=>0, :tot_over=>0}]]
The first element generated by e1 is passed to the block and the block variables are assigned values, using parallel assignment:
line, h = e1.next
#=> [":cat, 1501\n", {:nbr_over=>0, :tot_over=>0}]
line
#=> ":cat, 1501\n"
h #=> {:nbr_over=>0, :tot_over=>0}
and n is computed:
s = line[/\d+/]
#=> "1501"
n = s.to_i
#=> 1501
As n > 1500 #=> true, we perform the following operations:
h[:nbr_over] += 1
#=> 1
h[:tot_over] += n
#=> 1501
so now
h #=> {:nbr_over=>1, :tot_over=>1501}
Now the second element of e1 is passed to the block and the following steps are performed:
line, h = e1.next
#=> [":dog, 1500\n", {:nbr_over=>1, :tot_over=>1501}]
line
#=> ":dog, 1500\n"
h #=> {:nbr_over=>1, :tot_over=>1501}
n = line[/\d+/].to_i
#=> 1500
As n > 1500 #=> fasle, this line is skipped. The processing of the last element generated by e1 is similar to that for the first element.
1 File is a subclass of IO (File < IO #=> true), so IO class methods such as foreach are often invoked on the File class (File.foreach...).
Let's say I have a string, like string= "aasmflathesorcerersnstonedksaottersapldrrysaahf". If you haven't noticed, you can find the phrase "harry potter and the sorcerers stone" in there (minus the space).
I need to check whether string contains all the elements of the string.
string.include? ("sorcerer") #=> true
string.include? ("harrypotterandtheasorcerersstone") #=> false, even though it contains all the letters to spell harrypotterandthesorcerersstone
Include does not work on shuffled string.
How can I check if a string contains all the elements of another string?
Sets and array intersection don't account for repeated chars, but a histogram / frequency counter does:
require 'facets'
s1 = "aasmflathesorcerersnstonedksaottersapldrrysaahf"
s2 = "harrypotterandtheasorcerersstone"
freq1 = s1.chars.frequency
freq2 = s2.chars.frequency
freq2.all? { |char2, count2| freq1[char2] >= count2 }
#=> true
Write your own Array#frequency if you don't want to the facets dependency.
class Array
def frequency
Hash.new(0).tap { |counts| each { |v| counts[v] += 1 } }
end
end
I presume that if the string to be checked is "sorcerer", string must include, for example, three "r"'s. If so you could use the method Array#difference, which I've proposed be added to the Ruby core.
class Array
def difference(other)
h = other.each_with_object(Hash.new(0)) { |e,h| h[e] += 1 }
reject { |e| h[e] > 0 && h[e] -= 1 }
end
end
str = "aasmflathesorcerersnstonedksaottersapldrrysaahf"
target = "sorcerer"
target.chars.difference(str.chars).empty?
#=> true
target = "harrypotterandtheasorcerersstone"
target.chars.difference(str.chars).empty?
#=> true
If the characters of target must not only be in str, but must be in the same order, we could write:
target = "sorcerer"
r = Regexp.new "#{ target.chars.join "\.*" }"
#=> /s.*o.*r.*c.*e.*r.*e.*r/
str =~ r
#=> 2 (truthy)
(or !!(str =~ r) #=> true)
target = "harrypotterandtheasorcerersstone"
r = Regexp.new "#{ target.chars.join "\.*" }"
#=> /h.*a.*r.*r.*y* ... o.*n.*e/
str =~ r
#=> nil
A different albeit not necessarily better solution using sorted character arrays and sub-strings:
Given your two strings...
subject = "aasmflathesorcerersnstonedksaottersapldrrysaahf"
search = "harrypotterandthesorcerersstone"
You can sort your subject string using .chars.sort.join...
subject = subject.chars.sort.join # => "aaaaaaacddeeeeeffhhkllmnnoooprrrrrrssssssstttty"
And then produce a list of substrings to search for:
search = search.chars.group_by(&:itself).values.map(&:join)
# => ["hh", "aa", "rrrrrr", "y", "p", "ooo", "tttt", "eeeee", "nn", "d", "sss", "c"]
You could alternatively produce the same set of substrings using this method
search = search.chars.sort.join.scan(/((.)\2*)/).map(&:first)
And then simply check whether every search sub-string appears within the sorted subject string:
search.all? { |c| subject[c] }
Create a 2 dimensional array out of your string letter bank, to associate the count of letters to each letter.
Create a 2 dimensional array out of the harry potter string in the same way.
Loop through both and do comparisons.
I have no experience in Ruby but this is how I would start to tackle it in the language I know most, which is Java.
My goal is to find the word with greatest number of repeated letters in a given string. For example, "aabcc ddeeteefef iijjfff" would return "ddeeteefef" because "e" is repeated five times in this word and that is more than all other repeating characters.
So far this is what I got, but it has many problems and is not complete:
def LetterCountI(str)
s = str.split(" ")
i = 0
result = []
t = s[i].scan(/((.)\2+)/).map(&:max)
u = t.max { |a, b| a.length <=> b.length }
return u.split(//).count
end
The code I have only finds consecutive patterns; if the pattern is interrupted (such as with "aabaaa", it counts a three times instead of five).
str.scan(/\w+/).max_by{ |w| w.chars.group_by(&:to_s).values.map(&:size).max }
scan(/\w+/) — create an array of all sequences of 'word' characters
max_by{ … } — find the word that gives the largest value inside this block
chars — split the string into characters
group_by(&:to_s) — create a hash mapping each character to an array of all the occurrences
values — just get all the arrays of the occurrences
map(&:size) — convert each array to the number of characters in that array
max — find the largest characters and use this as the result for max_by to examine
Edit: Written less compactly:
str.scan(/\w+/).max_by do |word|
word.chars
.group_by{ |char| char }
.map{ |char,array| array.size }
.max
end
Written less functionally and with less Ruby-isms (to make it look more like "other" languages):
words_by_most_repeated = []
str.split(" ").each do |word|
count_by_char = {} # hash mapping character to count of occurrences
word.chars.each do |char|
count_by_char[ char ] = 0 unless count_by_char[ char ]
count_by_char[ char ] += 1
end
maximum_count = 0
count_by_char.each do |char,count|
if count > maximum_count then
maximum_count = count
end
end
words_by_most_repeated[ maximum_count ] = word
end
most_repeated = words_by_most_repeated.last
I'd do as below :
s = "aabcc ddeeteefef iijjfff"
# intermediate calculation that's happening in the final code
s.split(" ").map { |w| w.chars.max_by { |e| w.count(e) } }
# => ["a", "e", "f"] # getting the max count character from each word
s.split(" ").map { |w| w.count(w.chars.max_by { |e| w.count(e) }) }
# => [2, 5, 3] # getting the max count character's count from each word
# final code
s.split(" ").max_by { |w| w.count(w.chars.max_by { |e| w.count(e) }) }
# => "ddeeteefef"
update
each_with_object gives better result than group_by method.
require 'benchmark'
s = "aabcc ddeeteefef iijjfff"
def phrogz(s)
s.scan(/\w+/).max_by{ |word| word.chars.group_by(&:to_s).values.map(&:size).max }
end
def arup_v1(s)
max_string = s.split.max_by do |w|
h = w.chars.each_with_object(Hash.new(0)) do |e,hsh|
hsh[e] += 1
end
h.values.max
end
end
def arup_v2(s)
s.split.max_by { |w| w.count(w.chars.max_by { |e| w.count(e) }) }
end
n = 100_000
Benchmark.bm do |x|
x.report("Phrogz:") { n.times {|i| phrogz s } }
x.report("arup_v2:"){ n.times {|i| arup_v2 s } }
x.report("arup_v1:"){ n.times {|i| arup_v1 s } }
end
output
user system total real
Phrogz: 1.981000 0.000000 1.981000 ( 1.979198)
arup_v2: 0.874000 0.000000 0.874000 ( 0.878088)
arup_v1: 1.684000 0.000000 1.684000 ( 1.685168)
Similar to sawa's answer:
"aabcc ddeeteefef iijjfff".split.max_by{|w| w.length - w.chars.uniq.length}
=> "ddeeteefef"
In Ruby 2.x, this works as-is because String#chars returns an array. In earlier versions of ruby, String#chars yields an enumerator so you need to add .to_a before applying uniq. I did my testing in Ruby 2.0, and overlooked this until it was pointed out by Stephens.
I believe this is valid, since the question was "greatest number of repeated letters in a given string" rather than greatest number of repeats for a single letter in a given string.
"aabcc ddeeteefef iijjfff"
.split.max_by{|w| w.chars.sort.chunk{|e| e}.map{|e| e.last.length}.max}
# => "ddeeteefef"
I wondering if there is a way to return the first letter of a word. like if you type in word("hey") it will return just the letter h. or if you wanted to you could return the letter e. individually by themselves. I was considering using the break method or scan but I can't seem to make them work.
another method you can look at is chr which returns the first character of a string
>> 'hey'.chr # 'h'
you can also look at http://www.ruby-doc.org/core-1.9.3/String.html#method-i-slice to see how you can combine regexp and indexes to get a part of a string.
UPDATE: If you are on ruby 1.8, it's a bit hackish but
>> 'hey'[0] # 104
>> 'hey'[0..0] # 'h'
>> 'hey'.slice(0,1) # 'h'
Yes:
def first_letter(word)
word[0]
end
Or, if using Ruby 1.8:
def first_letter(word)
word.chars[0]
end
Use the syntax str[index] to get a specific letter of a word (0 is first letter, 1 second, and so on).
This is a naive implementation, but you could use method_missing to create a DSL that'd allow you to query a word for letters at different positions:
def self.method_missing(method, *args)
number_dictionary = {
first: 1,
second: 2,
third: 3,
fourth: 4,
fifth: 5,
sixth: 6,
seventh: 7,
eighth: 8,
ninth: 9,
tenth: 10
}
if method.to_s =~ /(.+)_letter/ && number = number_dictionary[$1.to_sym]
puts args[0][number - 1]
else
super
end
end
first_letter('hey') # => 'h'
second_letter('hey') # => 'e'
third_letter('hey') # => 'y'
Using your example - the word "hey":
h = "hey"
puts h[0]
This should return h.