hashes ruby merge - ruby

My txt file contains a few lines and i want to add each line to a hash with key as first 2 words and value as 3rd word...The following code has no errors but the logic may be wrong...last line is supposed to print all the keys of the hash...but nothing happens...pls help
def word_count(string)
count = string.count(' ')
return count
end
h = Hash.new
f = File.open('sheet.txt','r')
f.each_line do |line|
count = word_count(line)
if count == 3
a = line.split
h.merge(a[0]+a[1] => a[2])
end
end
puts h.keys

Hash#merge doesn't modify the hash you call it on, it returns the merged Hash:
merge(other_hash) → new_hash
Returns a new hash containing the contents of other_hash and the contents of hsh. [...]
Note the Returns a new hash... part. When you say this:
h.merge(a[0]+a[1] => a[2])
You're merge the new values you built into a copy of h and then throwing away the merged hash; the end result is that h never gets anything added to it and ends up being empty after all your work.
You want to use merge! to modify the Hash:
h.merge!(a[0]+a[1] => a[2])
or keep using merge but save the return value:
h = h.merge(a[0]+a[1] => a[2])
or, since you're only adding a single value, just assign it:
h[a[0] + a[1]] = a[2]

If you want to add the first three words of each line to the hash, regardless of how many words there are, then you can drop the if count == 3 line. Or you can change it to if count > 2 if you want to make sure that there are at least three words.
Also, mu is correct. You'll want h.merge!

Related

Given a string, how do I compare the characters to see if there are duplicates?

I'm trying to compare characters in a given string to see if there are duplicates, and if there are I was to remove the two characters to reduce the string to as small at possible. eg. ("ttyyzx") would equal to ("zx")
I've tried converting the characters in an array and then using an #each_with_index to iterate over the characters.
arr = ("xxyz").split("")
arr.each_with_index do |idx1, idx2|
if idx1[idx2] == idx1[idx2 + 1]
p idx1[idx2]
p idx1[idx2 + 1]
end
end
At this point I just wan to be able to print the next character in the array within the loop so I know I can move on to the next step, but no matter what code I use it will only print out the first character "x".
To only keep the unique characters (ggorlen's answer is "b"): count all characters, find only those that appear once. We rely on Ruby's Hash producing keys in insertion order.
def keep_unique_chars(str)
str.each_char.
with_object(Hash.new(0)) { |element, counts| counts[element] += 1 }.
select { |_, count| count == 1 }.
keys.
join
end
To remove adjacent dupes only (ggorlen's answer is "aba"): a regular expression replacing adjacent repetitions is probably the go-to method.
def remove_adjacent_dupes(str)
str.gsub(/(.)\1+/, '')
end
Without regular expressions, we can use slice_when to cut the array when the character changes, then drop the groups that are too long. One might think a flatten would be required before join, but join doesn't care:
def remove_adjacent_dupes_without_regexp(str)
str.each_char.
slice_when { |prev, curr| prev != curr }.
select { |group| group.size == 1 }.
join
end
While amadan's and user's solution definitely solve the problem I felt like writing a solution closer to the OP's attempt:
def clean(string)
return string if string.length == 1
array = string.split('')
array.select.with_index do |value, index|
array[index - 1] != value && array[index + 1] != value
end.join
end
Here are a few examples:
puts clean("aaaaabccccdeeeeeefgggg")
#-> bdf
puts clean("m")
#-> m
puts clean("ttyyzx")
#-> zx
puts clean("aab")
#-> b
The method makes use of the fact that the characters are sorted and in case there are duplicates, they are either before or after the character that's being checked by the select method. The method is slower than the solutions posted above, but as OP mentioned he does not yet work with hashes yet I though this might be useful.
If speed is not an issue,
require 'set'
...
Set.new(("xxyz").split("")).to_a.join # => xyz
Making it a Set removes duplicates.
The OP does not want to remove duplicates and keep just a single copy, but remove all characters completely from occurring more than once. So here is a new approach, again compact, but not fast:
"xxyz".split('').sort.join.gsub(/(.)\1+/,'')
The idea is to sort the the letters; hence, identical letters will be joined together. The regexp /(.)\1+/ describes a repetition of a letter.

Bug in my Ruby counter

It is only counting once for each word. I want it to tell me how many times each word appears.
dictionary = ["to","do","to","do","to","do"]
string = "just do it to"
def machine(word,list)
initialize = Hash.new
swerve = word.downcase.split(" ")
list.each do |i|
counter = 0
swerve.each do |j|
if i.include? j
counter += 1
end
end
initialize[i]=counter
end
return initialize
end
machine(string,dictionary)
I assume that, for each word in string, you wish to determine the number of instances of that word in dictionary. If so, the first step is to create a counting hash.
dict_hash = dictionary.each_with_object(Hash.new(0)) { |word,h| h[word] += 1 }
#=> {"to"=>3, "do"=>3}
(I will explain this code later.)
Now split string on whitespace and create a hash whose keys are the words in string and whose values are the numbers of times that the value of word appears in dictionary.
string.split.each_with_object({}) { |word,h| h[word] = dict_hash.fetch(word, 0) }
#=> {"just"=>0, "do"=>3, "it"=>0, "to"=>3}
This of course assumes that each word in string is unique. If not, depending on the desired behavior, one possibility would be to use another counting hash.
string = "to just do it to"
string.split.each_with_object(Hash.new(0)) { |word,h|
h[word] += dict_hash.fetch(word, 0) }
#=> {"to"=>6, "just"=>0, "do"=>3, "it"=>0}
Now let me explain some of the constructs above.
I created two hashes with the form of the class method Hash::new that takes a parameter equal to the desired default value, which here is zero. What that means is that if
h = Hash.new(0)
and h does not have a key equal to the value word, then h[word] will return h's default value (and the hash h will not be changed). After creating the first hash that way, I wrote h[word] += 1. Ruby expands that to
h[word] = h[word] + 1
before she does any further processing. The first word in string that is passed to the block is "to" (which is assigned to the block variable word). Since the hash h is is initially empty (has no keys), h[word] on the right side of the above equality returns the default value of zero, giving us
h["to"] = h["to"] + 1
#=> = 0 + 1 => 1
Later, when word again equals "to" the default value is not used because h now has a key "to".
h["to"] = h["to"] + 1
#=> = 1 + 1 => 2
I used the well-worn method Enumerable#each_with_object. To a newbie this might seem complex. It isn't. The line
dict_hash = dictionary.each_with_object(Hash.new(0)) { |word,h| h[word] += 1 }
is effectively1 the same as the following.
h = Hash.new(0)
dict_hash = dictionary.each { |word| h[word] += 1 }
h
In other words, the method allows one to write a single line that creates, constructs and returns the hash, rather than three lines that do the same.
Notice that I used the method Hash#fetch for retrieving values from the hash:
dict_hash.fetch(word, 0)
fetch's second argument (here 0) is returned if dict_hash does not have a key equal to the value of word. By contrast, dict_hash[word] returns nil in that case.
1 The reason for "effectively" is that when using each_with_object, the variable h's scope is confined to the block, which is generally a good programming practice. Don't worry if you haven't learned about "scope" yet.
You can actually do this using Array#count rather easily:
def machine(word,list)
word.downcase.split(' ').collect do |w|
# for every word in `word`, count how many appearances in `list`
[w, list.count { |l| l.include?(w) }]
end.to_h
end
machine("just do it to", ["to","do","to","do","to","do"]) # => {"just"=>0, "do"=>3, "it"=>0, "to"=>3}
I think this is what you're looking for, but it seems like you're approaching this backwards
Convert your string "string" into an array, remove duplicate values and iterate through each element, counting the number of matches in your array "dictionary". The enumerable method :count is useful here.
A good data structure to output here would be a hash, where we store the unique words in our string "string" as keys and the number of occurrences of these words in array "dictionary" as the values. Hashes allow one to store more information about the data in a collection than an array or string, so this fits here.
dictionary = [ "to","do","to","do","to","do" ]
string = "just do it to"
def group_by_matches( match_str, list_of_words )
## trim leading and trailing whitespace and split string into array of words, remove duplicates.
to_match = match_str.strip.split.uniq
groupings = {}
## for each element in array of words, count the amount of times it appears *exactly* in the list of words array.
## store that in the groupings hash
to_match.each do | word |
groupings[ word ] = list_of_words.count( word )
end
groupings
end
group_by_matches( string, dictionary ) #=> {"just"=>0, "do"=>3, "it"=>0, "to"=>3}
On a side note, you should consider using more descriptive variable and method names to help yourself and others follow what's going on.
This also seems like you have it backwards. Typically, you'd want to use the array to count the number of occurrences in the string. This seems to more closely fit a real-world application where you'd examine a sentence/string of data for matches from a list of predefined words.
Arrays are also useful because they're flexible collections of data, easily iterated through and mutated with enumerable methods. To work with the words in our string, as you can see, it's easiest to immediately convert it to an array of words.
There are many alternatives. If you wanted to shorten the method, you could replace the more verbose each loop with an each_with_object call or a map call which will return a new object rather than the original object like each. In the case of using map.to_h, be careful as to_h will work on a two-dimensional array [["key1", "val1"], ["key2", "val2"]] but not on a single dimensional array.
## each_with_object
def group_by_matches( match_str, list_of_words )
to_match = match_str.strip.split.uniq
to_match.
each_with_object( {} ) { | word, groupings | groupings[ word ] = list_of_words.count( word ) }
end
## map
def group_by_matches( match_str, list_of_words )
to_match = match_str.strip.split.uniq
to_match.
map { | word | [ word, list_of_words.count( word ) ] }.to_h
end
Gauge your method preferences depending on performance, readability, and reliability.
list.each do |i|
counter = 0
swerve.each do |j|
if i.include? j
counter += 1
needs to be changed to
swerve.each do |i|
counter = 0
list.each do |j|
if i.include? j
counter += 1
Your code is telling how many times each word in the word/string (the word which is included in the dictionary) appears.
If you want to tell how many times each word in the dictionary appears, you can switch the list.each and swerve.each loops. Then, it will return a hash # => {"just"=>0, "do"=>3, "it"=>0, "to"=>3}

my_hash.keys == [], yet my_hash[key] gives a value?

I'm trying to demonstrate a situation where it's necessary to pass a block to Hash.new in order to set up default values for a given key when creating a hash of hashes.
To show what can go wrong, I've created the following code, which passes a single value as an argument to Hash.new. I expected all outer hash keys to wind up holding a reference to the same inner hash, causing the counts for the "piles" to get mixed together. And indeed, that does seem to have happened. But part_counts.each doesn't seem to find any keys/values to iterate over, and part_counts.keys returns an empty array. Only part_counts[0] and part_counts[1] successfully retrieve a value for me.
piles = [
[:gear, :spring, :gear],
[:axle, :gear, :spring],
]
# I do realize this should be:
# Hash.new {|h, k| h[k] = Hash.new(0)}
part_counts = Hash.new(Hash.new(0))
piles.each_with_index do |pile, pile_index|
pile.each do |part|
part_counts[pile_index][part] += 1
end
end
p part_counts # => {}
p part_counts.keys # => []
# The next line prints no output
part_counts.each { |key, value| p key, value }
p part_counts[0] # => {:gear=>3, :spring=>2, :axle=>1}
For context, here is the corrected code that I intend to show after the "broken" code. The parts for each pile within part_counts are separated, as they should be. each and keys work as expected, as well.
# ...same pile initialization code as above...
part_counts = Hash.new {|h, k| h[k] = Hash.new(0)}
# ...same part counting code as above...
p part_counts # => {0=>{:gear=>2, :spring=>1}, 1=>{:axle=>1, :gear=>1, :spring=>1}}
p part_counts.keys # => [0, 1]
# The next line of code prints:
# 0
# {:gear=>2, :spring=>1}
# 1
# {:axle=>1, :gear=>1, :spring=>1}
part_counts.each { |key, value| p key, value }
p part_counts[0] # => {:gear=>2, :spring=>1}
But why don't each and keys work (at all) in the first sample?
We'll start by decomposing this a little bit:
part_counts = Hash.new(Hash.new(0))
That's the same as saying:
default_hash = { }
default_hash.default = 0
part_counts = { }
part_counts.default = default_hash
Later on, you're saying things like this:
part_counts[pile_index][part] += 1
That's the same as saying:
h = part_counts[pile_index]
h[part] += 1
You're not using the (correct) block form of the default value for your Hash so accessing the default value doesn't auto-vivify the key. That means that part_counts[pile_index] doesn't create a pile_index key in part_counts, it just gives you part_counts.default and you're really saying:
h = part_counts.default
h[part] += 1
You're not doing anything else to add keys to part_counts so it has no keys and:
part_counts.keys == [ ]
So why does part_counts[0] give us {:gear=>3, :spring=>2, :axle=>1}? part_counts doesn't have any keys and in particular doesn't have a 0 key so:
part_counts[0]
is the same as
part_counts.default
Up above where you're accessing part_counts[pile_index], you're really just getting a reference to the default, the Hash won't clone it, you get the whole default value that the Hash will use next time. That means that:
part_counts[pile_index][part] += 1
is another way of saying:
part_counts.default[part] += 1
so you're actually just changing part_counts's default value in-place. Then when you part_counts[0], you're accessing this modified default value and there's the {:gear=>3, :spring=>2, :axle=>1} that you accidentally built in your loop.
The value given to Hash.new is used as the default value, but this value is not inserted into the hash. So part_count remains empty. You can get the default value by using part_count[...] but this has no effect on the hash, it doesn't really contain the key.
When you call part_counts[pile_index][part] += 1, then part_counts[pile_index] returns the default value, and it's this value that is modified with the assignment, not part_counts.
You have something like:
outer = Hash.new({})
outer[1][2] = 3
p outer, outer[1]
which can also be written like:
inner = {}
outer = Hash.new(inner)
inner2 = outer[1] # inner2 refers to the same object as inner, outer is not modified
inner2[2] = 3 # same as inner[2] = 3
p outer, inner

Ruby's optimized implementation of Histogram/Aggregator

i'm about to write my own but i was wondering if there are any gems/libs that i can use as aggregator/histogram
my goal would be to sum up values based on a matching key:
["fish","2"]
["fish","40"]
["meat","56"]
["meat","1"]
Should sum op the values per unique key and return ["fish","42"] and ["meat","57"]
.The files i have to aggregate are relatively large, about 4gb text files made of tsv key/value pair
.My goal is to try not to use temporary files in order not to take too much space on the machine, so i was wondering if something similar already optimized already exists, i have found a jeb on github named 'histogram' but it does not really contain the functionalities i need
Thx
You can use a Hash with a default value of 0 to do the counting, then in the end you could convert it to Array to yield the format you want, though I think you might just want to keep using the Hash instead.
data = [
["fish","2"],
["fish","40"],
["meat","56"],
["meat","1"]
]
hist = data.each_with_object(Hash.new(0)) do |(k,v), h|
h[k] += v.to_i
end
hist # => {"fish"=>42, "meat"=>57}
hist.to_a # => [["fish", 42], ["meat", 57]]
# To get String values, "42" instead of 42, etc:
hist.map { |k,v| [k, v.to_s] } # => [["fish", "42"], ["meat", "57"]]
Since you stated you had to read the data from a file, here is the above when applied to a file. The input.txt file contents are as follows for this example:
fish,2
fish,40
meat,56
meat,1
Then, to create the same output as before by reading it line by line:
file = File.open('input.txt')
hist = file.each_with_object(Hash.new(0)) do |line, h|
key, value = line.split(',')
h[key] += value.to_i
end
file.close

Ruby regex into array of hashes but need to drop a key/val pair

I'm trying to parse a file containing a name followed by a hierarchy path. I want to take the named regex matches, turn them into Hash keys, and store the match as a hash. Each hash will get pushed to an array (so I'll end up with an array of hashes after parsing the entire file. This part of the code is working except now I need to handle bad paths with duplicated hierarchy (top_* is always the top level). It appears that if I'm using named backreferences in Ruby I need to name all of the backreferences. I have gotten the match working in Rubular but now I have the p1 backreference in my resultant hash.
Question: What's the easiest way to not include the p1 key/value pair in the hash? My method is used in other places so we can't assume that p1 always exists. Am I stuck with dropping each key/value pair in the array after calling the s_ary_to_hash method?
NOTE: I'm keeping this question to try and solve the specific issue of ignoring certain hash keys in my method. The regex issue is now in this ticket: Ruby regex - using optional named backreferences
UPDATE: Regex issue is solved, the hier is now always stored in the named 'hier' group. The only item remaining is to figure out how to drop the 'p1' key/value if it exists prior to creating the Hash.
Example file:
name1 top_cat/mouse/dog/top_cat/mouse/dog/elephant/horse
new12 top_ab12/hat[1]/top_ab12/hat[1]/path0_top_ab12/top_ab12path1/cool
tops top_bat/car[0]
ab123 top_2/top_1/top_3/top_4/top_2/top_1/top_3/top_4/dog
Expected output:
[{:name => "name1", :hier => "top_cat/mouse/dog/elephant/horse"},
{:name => "new12", :hier => "top_ab12/hat[1]/path0_top_ab12/top_ab12path1/cool"},
{:name => "tops", :hier => "top_bat/car[0]"},
{:name => "ab123", :hier => "top_2/top_1/top_3/top_4/dog"}]
Code snippet:
def s_ary_to_hash(ary, regex)
retary = Array.new
ary.each {|x| (retary << Hash[regex.match(x).names.map{|key| key.to_sym}.zip(regex.match(x).captures)]) if regex.match(x)}
return retary
end
regex = %r{(?<name>\w+) (?<p1>[\w\/\[\]]+)?(?<hier>(\k<p1>.*)|((?<= ).*$))}
h_ary = s_ary_to_hash(File.readlines(filename), regex)
What about this regex ?
^(?<name>\S+)\s+(?<p1>top_.+?)(?:\/(?<hier>\k<p1>(?:\[.+?\])?.+))?$
Demo
http://rubular.com/r/awEP9Mz1kB
Sample code
def s_ary_to_hash(ary, regex, mappings)
retary = Array.new
for item in ary
tmp = regex.match(item)
if tmp then
hash = Hash.new
retary.push(hash)
mappings.each { |mapping|
mapping.map { |key, groups|
for group in group
if tmp[group] then
hash[key] = tmp[group]
break
end
end
}
}
end
end
return retary
end
regex = %r{^(?<name>\S+)\s+(?<p1>top_.+?)(?:\/(?<hier>\k<p1>(?:\[.+?\])?.+))?$}
h_ary = s_ary_to_hash(
File.readlines(filename),
regex,
[
{:name => ['name']},
{:hier => ['hier','p1']}
]
)
puts h_ary
Output
{:name=>"name1", :hier=>"top_cat/mouse/dog/elephant/horse\r"}
{:name=>"new12", :hier=>"top_ab12/hat[1]/path0_top_ab12/top_ab12path1/cool\r"}
{:name=>"tops", :hier=>"top_bat/car[0]"}
Discussion
Since Ruby 2.0.0 doesn't support branch reset, I have built a solution that add some more power to the s_ary_to_hash function. It now admits a third parameter indicating how to build the final array of hashes.
This third parameter is an array of hashes. Each hash in this array has one key (K) corresponding to the key in the final array of hashes. K is associated with an array containing the named group to use from the passed regex (second parameter of s_ary_to_hash function).
If a group equals nil, s_ary_to_hash skips it for the next group.
If all groups equal nil, K is not pushed on the final array of hashes.
Feel free to modify s_ary_to_hash if this isn't a desired behavior.
Edit: I've changed the method s_ary_to_hash to conform with what I now understand to be the criterion for excluding directories, namely, directory d is to be excluded if there is a downstream directory with the same name, or the same name followed by a non-negative integer in brackets. I've applied that to all directories, though I made have misunderstood the question; perhaps it should apply to the first.
data =<<THE_END
name1 top_cat/mouse/dog/top_cat/mouse/dog/elephant/horse
new12 top_ab12/hat/top_ab12/hat[1]/path0_top_ab12/top_ab12path1/cool
tops top_bat/car[0]
ab123 top_2/top_1/top_3/top_4/top_2/top_1/top_3/top_4/dog
THE_END
text = data.split("\n")
def s_ary_to_hash(ary)
ary.map do |s|
name, _, downstream_path = s.partition(' ').map(&:strip)
arr = []
downstream_dirs = downstream_path.split('/')
downstream_dirs.each {|d| puts "'#{d}'"}
while downstream_dirs.any? do
dir = downstream_dirs.shift
arr << dir unless downstream_dirs.any? { |d|
d == dir || d =~ /#{dir}\[\d+\]/ }
end
{ name: name, hier: arr.join('/') }
end
end
s_ary_to_hash(text)
# => [{:name=>"name1", :hier=>"top_cat/mouse/dog/elephant/horse"},
# {:name=>"new12", :hier=>"top_ab12/hat[1]/path0_top_ab12/top_ab12path1/cool"},
# {:name=>"tops", :hier=>"top_bat/car[0]"},
# {:name=>"ab123", :hier=>"top_2/top_1/top_3/top_4/dog"}]
The exclusion criterion is implement in downstream_dirs.any? { |d| d == dir || d =~ /#{dir}\[\d+\]/ }, where dir is the directory that is being tested and downstream_dirs is an array of all the downstream directories. (When dir is the last directory, downstream_dirs is empty.) Localizing it in this way makes it easy to test and change the exclusion criterion. You could shorten this to a single regex and/or make it a method:
dir exclude_dir?(dir, downstream_dirs)
downstream_dirs.any? { |d| d == dir || d =~ /#{dir}\[\d+\]/ }end
end
Here is a non regexp solution:
result = string.each_line.map do |line|
name, path = line.split(' ')
path = path.split('/')
last_occur_of_root = path.rindex(path.first)
path = path[last_occur_of_root..-1]
{name: name, heir: path.join('/')}
end

Resources