Searching through two multidimensional arrays and grouping together similar subarrays - ruby

I am trying to search through two multidimensional arrays to find any elements in common in a given subarray and then put the results in a third array where the entire subarrays with similar elements are grouped together (not just the similar elements).
The data is imported from two CSVs:
require 'csv'
array = CSV.read('primary_csv.csv')
#=> [["account_num", "account_name", "primary_phone", "second_phone", "status],
#=> ["11111", "John Smith", "8675309", " ", "active"],
#=> ["11112", "Tina F.", "5551234", "5555678" , "disconnected"],
#=> ["11113", "Troy P.", "9874321", " ", "active"]]
# and so on...
second_array = CSV.read('customer_service.csv')
#=> [["date", "name", "agent", "call_length", "phone", "second_phone", "complaint"],
#=> ["3/1/15", "Mary ?", "Bob X", "5:00", "5551234", " ", "rude"],
#=> ["3/2/15", "Mrs. Smith", "Stew", "1:45", "9995678", "8675309" , "says shes not a customer"]]
# and so on...
If any number is present as an element in a subarray on both primary.csv and customer_service.csv, I want that entire subarray (as opposed to just the common elements), put into a third array, results_array. The desire output based upon the above sample is:
results_array = [["11111", "John Smith", "8675309", " ", "active"],
["3/2/15", "Mrs. Smith", "Stew", "1:45", "9995678", "8675309" , "says shes not a customer"]] # and so on...
I then want to export the array into a new CSV, where each subarray is its own row of the CSV. I intend to iterate over each subarray by joining it with a , to make it comma delimited and then put the results into a new CSV:
results_array.each do {|j| j.join(",")}
File.open("results.csv", "w") {|f| f.puts results_array}
#=> 11111,John Smith,8675309, ,active
#=> 3/2/15,Mrs. Smith,Stew,1:45,9995678,8675309,says shes not a customer
# and so on...
How can I achieve the desired output? I am aware that the final product will look messy because similar data (for example, phone number) will be in different columns. But I need to find a way to generally group the data together.

Suppose a1 and a2 are the two arrays (excluding header rows).
Code
def combine(a1, a2)
h2 = a2.each_with_index
.with_object(Hash.new { |h,k| h[k] = [] }) { |(arr,i),h|
arr.each { |e| es = e.strip; h[es] << i if number?(es) } }
a1.each_with_object([]) do |arr, b|
d = arr.each_with_object([]) do |str, d|
s = str.strip
d.concat(a2.values_at(*h2[s])) if number?(s) && h2.key?(s)
end
b << d.uniq.unshift(arr) if d.any?
end
end
def number?(str)
str =~ /^\d+$/
end
Example
Here is your example, modified somewhat:
a1 = [
["11111", "John Smith", "8675309", "", "active" ],
["11112", "Tina F.", "5551234", "5555678", "disconnected"],
["11113", "Troy P.", "9874321", "", "active" ]
]
a2 = [
["3/1/15", "Mary ?", "Bob X", "5:00", "5551234", "", "rude"],
["3/2/15", "Mrs. Smith", "Stew", "1:45", "9995678", "8675309", "surly"],
["3/7/15", "Cher", "Sonny", "7:45", "9874321", "8675309", "Hey Jude"]
]
combine(a1, a2)
#=> [[["11111", "John Smith", "8675309", "",
# "active"],
# ["3/2/15", "Mrs. Smith", "Stew", "1:45",
# "9995678", "8675309", "surly"],
# ["3/7/15", "Cher", "Sonny", "7:45",
# "9874321", "8675309", "Hey Jude"]
# ],
# [["11112", "Tina F.", "5551234", "5555678",
# "disconnected"],
# ["3/1/15", "Mary ?", "Bob X", "5:00",
# "5551234", "", "rude"]
# ],
# [["11113", "Troy P.", "9874321", "",
# "active"],
# ["3/7/15", "Cher", "Sonny", "7:45",
# "9874321", "8675309", "Hey Jude"]
# ]
# ]
Explanation
First, we define a helper:
def number?(str)
str =~ /^\d+$/
end
For example:
number?("8675309") #=> 0 ("truthy)
number?("3/1/15") #=> nil
Now index a2 on the values that represent numbers:
h2 = a2.each_with_index
.with_object(Hash.new { |h,k| h[k] = [] }) { |(arr,i),h|
arr.each { |e| es = e.strip; h[es] << i if number?(es) } }
#=> {"5551234"=>[0], "9995678"=>[1], "8675309"=>[1, 2], "9874321"=>[2]}
This says, for example, that the "numeric" field "8675309" is contained in elements at offsets 1 and 2 of a2 (i.e, for Mrs. Smith and Cher).
We can now simply run through the elements of a1 looking for matches.
The code:
arr.each_with_object([]) do |str, d|
s = str.strip
d.concat(a2.values_at(*h2[s])) if number?(s) && h2.key?(s)
end
steps through the elements of arr, assigning each to the block variable str. For example, if arr holds the first element of a1 str will in turn equals "11111", "John Smith", and so on. After s = str.strip, this says that if a s has a numerical representation and there is a matching key in h2, the (initially empty) array d is concatenated with the elements of a2 given by the value of h2[s].
After completing this loop we see if d contains any elements of a2:
b << d.uniq.unshift(arr) if d.any?
If it does, we remove duplicates, prepend the array with arr and save it to b.
Note that this allows one element of a2 to match multiple elements of a1.

Related

Ruby hash with multiple comma separated values to array of hashes with same keys

What is the most efficient and pretty way to map this:
{name:"cheese,test", uid:"1,2"}
to this:
[ {name:"cheese", uid:"1"}, {name:"test", uid:"2"} ]
should work dinamically for example with: { name:"cheese,test,third", uid:"1,2,3" } or {name:"cheese,test,third,fourth", uid:"1,2,3,4", age:"9,8,7,6" }
Finally I made this:
hash = {name:"cheese,test", uid:"1,2"}
results = []
length = hash.values.first.split(',').length
length.times do |i|
results << hash.map {|k,v| [k, v.split(',')[i]]}
end
results.map{|e| e.to_h}
It is working, but i am not pleased with it, has to be a cleaner and more 'rubyst' way to do this
def splithash(h)
# Transform each element in the Hash...
h.map do |k, v|
# ...by splitting the values on commas...
v.split(',').map do |vv|
# ...and turning these into individual { k => v } entries.
{ k => vv }
end
end.inject do |a,b|
# Then combine these by "zip" combining each list A to each list B...
a.zip(b)
# ...which will require a subsequent .flatten to eliminate nesting
# [ [ 1, 2 ], 3 ] -> [ 1, 2, 3 ]
end.map(&:flatten).map do |s|
# Then combine all of these { k => v } hashes into one containing
# all the keys with associated values.
s.inject(&:merge)
end
end
Which can be used like this:
splithash(name:"cheese,test", uid:"1,2", example:"a,b")
# => [{:name=>"cheese", :uid=>"1", :example=>"a"}, {:name=>"test", :uid=>"2", :example=>"b"}]
It looks a lot more convoluted at first glance, but this handles any number of keys.
I would likely use transpose and zip like so:
hash = {name:"cheese,test,third,fourth", uid:"1,2,3,4", age:"9,8,7,6" }
hash.values.map{|x| x.split(",")}.transpose.map{|v| hash.keys.zip(v).to_h}
#=> [{:name=>"cheese", :uid=>"1", :age=>"9"}, {:name=>"test", :uid=>"2", :age=>"8"}, {:name=>"third", :uid=>"3", :age=>"7"}, {:name=>"fourth", :uid=>"4", :age=>"6"}]
To break it down a bit (code slightly modified for operational clarity):
hash.values
#=> ["cheese,test,third,fourth", "1,2,3,4", "9,8,7,6"]
.map{|x| x.split(",")}
#=> [["cheese", "test", "third", "fourth"], ["1", "2", "3", "4"], ["9", "8", "7", "6"]]
.transpose
#=> [["cheese", "1", "9"], ["test", "2", "8"], ["third", "3", "7"], ["fourth", "4", "6"]]
.map do |v|
hash.keys #=> [[:name, :uid, :age], [:name, :uid, :age], [:name, :uid, :age], [:name, :uid, :age]]
.zip(v) #=> [[[:name, "cheese"], [:uid, "1"], [:age, "9"]], [[:name, "test"], [:uid, "2"], [:age, "8"]], [[:name, "third"], [:uid, "3"], [:age, "7"]], [[:name, "fourth"], [:uid, "4"], [:age, "6"]]]
.to_h #=> [{:name=>"cheese", :uid=>"1", :age=>"9"}, {:name=>"test", :uid=>"2", :age=>"8"}, {:name=>"third", :uid=>"3", :age=>"7"}, {:name=>"fourth", :uid=>"4", :age=>"6"}]
end
Input
hash={name:"cheese,test,third,fourth", uid:"1,2,3,4", age:"9,8,7,6" }
Code
p hash
.transform_values { |v| v.split(',') }
.map { |k, v_arr| v_arr.map { |v| [k, v] }
}
.transpose
.map { |array| array.to_h }
Output
[{:name=>"cheese", :uid=>"1", :age=>"9"}, {:name=>"test", :uid=>"2", :age=>"8"}, {:name=>"third", :uid=>"3", :age=>"7"}, {:name=>"fourth", :uid=>"4", :age=>"6"}]
We are given
h = { name: "cheese,test", uid: "1,2" }
Here are two ways to create the desired array. Neither construct arrays that are then converted to hashes.
#1
First compute
g = h.transform_values { |s| s.split(',') }
#=> {:name=>["cheese", "test"], :uid=>["1", "2"]}
then compute
g.first.last.size.times.map { |i| g.transform_values { |v| v[i] } }
#=> [{:name=>"cheese", :uid=>"1"}, {:name=>"test", :uid=>"2"}]
Note
a = g.first
#=> [:name, ["cheese", "test"]]
b = a.last
#=> ["cheese", "test"]
b.size
#=> 2
#2
This approach does not convert the values of the hash to arrays.
(h.first.last.count(',')+1).times.map do |i|
h.transform_values { |s| s[/(?:\w+,){#{i}}\K\w+/] }
end
#=> [{:name=>"cheese", :uid=>"1"}, {:name=>"test", :uid=>"2"}]
We have
a = h.first
#=> [:name, "cheese,test"]
s = a.last
#=> "cheese,test"
s.count(',')+1
#=> 2
We can express the regular expression in free-spacing mode to make it self-documenting.
/
(?: # begin a non-capture group
\w+, # match one or more word characters followed by a comma
) # end the non-capture group
{#{i}} # execute the preceding non-capture group i times
\K # discard all matches so far and reset the start of the match
\w+ # match one or more word characters
/x # invoke free-spacing regex definition mode

Loop though multi-dimensional array in ruby

This is the question i'm having trouble with.
"Loop through the multi-dimensional Array and print out the full information of even items in the Array (ie the 2nd and 4th array in your multidimensional array)".I'm tasked with outputting all the data in the even numbered array which should be [1] [3], which would output all the information from array "derrick" & "andrew" only.
kristopher = ["kris", "palos hills", "708-200", "green"]
derrick = ["D-Rock", "New York", "773-933", "green"]
willie = ["William", "Humbolt Park", "773-987", "Black"]
andrew = ["drew", "northside", "773-123","blue"]
friends = [kristopher, derrick, willie, andrew]
friends.each do |arr|
print arr [0..4]
end
#Output
["kris", "palos hills", "708-200", "green"]["D-Rock", "New York", "773-933", "green"]["William", "Humbolt Park", "773-987", "Black"]["drew", "northside", "773-123", "blue"]
Something like this:
kristopher = ["kris", "palos hills", "708-200", "green"]
derrick = ["D-Rock", "New York", "773-933", "green"]
willie = ["William", "Humbolt Park", "773-987", "Black"]
andrew = ["drew", "northside", "773-123","blue"]
friends = [kristopher, derrick, willie, andrew]
(1...friends.length).step(2).each do |friendIndex|
friend = friends[friendIndex]
print friend
end
You can check Enumerable#partition and Enumerable#each_with_index which are helpful for splitting the array by a condition on the index of elements. If you use Integer#even? you can make a partition between even and odd indexes (+ 1 in this case).
friends.partition.with_index { |_, i| (i + 1).even? }
#=> [[["D-Rock", "New York", "773-933", "green"], ["drew", "northside", "773-123", "blue"]], [["kris", "palos hills", "708-200", "green"], ["William", "Humbolt Park", "773-987", "Black"]]]
So, for your case, take the first element:
friends.partition.with_index { |_, i| (i + 1).even? }.first
Or you can go straight with Enumerable#select:
friends.select.with_index { |_, i| (i + 1).even? }

Ruby: Scanning strings for matching adjacent vowel groups

I am building a script to randomly generate words that sound like english. I have broken down a large number of english words into VCV groups.
...where the V's represent ALL the adjacent vowels in a word and the C represents ALL the adjacent consonants. For example, the English word "miniature" would become
"-mi", "inia", "iatu", and "ure". "school" would become "-schoo" and "ool".
These groups will be assembled together with other groups from other words with
the rule being that the complete set of adjacent ending vowels must match the
complete set of starting vowels for the attached group.
I have constructed a hash in the following structure:
pieces = {
:starters => { "-sma" => 243, "-roa" => 77, "-si" => 984, ...},
:middles => { "iatu" => 109, "inia" => 863, "aci" => 229, ...},
:enders => { "ar-" => 19, "ouid-" => 6, "ude" => 443, ...}
}
In order to construct generated words, a "starter" string would need to end with the same vowel grouping as the "middle" string. The same applies when connecting the "middle" string with the "ender" string. One possible result using the examples above would be "-sma" + "aba" + "ar-" to give "smabar". Another would be "-si" + "inia" + "iatu" + "ude" to give "siniatude".
My problem is that when I sample any two pieces, I don't know how to ensure that the ending V group of the first piece exactly matches the beginning V group of the second piece. For example, "utua" + "uailo" won't work together because "ua" is not the same as "uai". However, a successful pair would be "utua" + "uado" because "ua" = "ua".
def match(first, second)
end_of_first = first[/[aeiou]+$|[^aeiou]+$/]
start_of_second = second[/^[aeiou]+|^[^aeiou]+/]
end_of_first == start_of_second
end
match("utua", "uailo")
# => false
match("inia", "iatu")
# => true
EDIT: I apparently can't read, I thought you just want to match the group (whether vowel or consonant). If you restrict to vowel groups, it's simpler:
end_of_first = first[/[aeiou]+$/]
start_of_second = second[/^[aeiou]+/]
Since you're already pre-processing the dictionary, I suggest doing a little more preprocessing to make generation simpler. I have two suggestions. First, for the starters and middles, separate each into a tuple (for which, in Ruby, we just use a two-element array) of the form (VC, V), so e.g. "inia" becomes ["in", "ia"]:
starters = [
[ "-sm", "a" ],
[ "-r", "oa" ],
[ "-s", "i" ],
# ...
]
We store the starters in an array since we just need to choose one at random, which we can do with Array#sample:
starter, middle1_key = starters.sample
puts starter # => "-sm"
puts middle1_key # => "a"
We want to be able to look up middles by their initial V groups, so we put those tuples in a Hash instead, with their initial V groups as keys:
middles = {
"ia" => [
[ "iat", "u" ],
[ "iabl", "e" ],
],
"i" => [
[ "in", "ia" ],
# ...
],
"a" => [
[ "ac", "i" ],
# ...
],
# ...
}
Since we stored the starter's final V group in middle1_key above, we can now use that as a key to get the array of middle tuples whose initial V group matches, and choose one at random as we did above:
possible_middles1 = middles[middle1_key]
middle1, middle2_key = possible_middles1.sample
puts middle1 # => "ac"
puts middle2_key => "i"
Just for kicks, let's pick a second middle:
middle2, ender_key = middles[middle2_key].sample
puts middle2 # => "in"
puts ender_key # => "ia"
Our enders we don't need to store in tuples, since we won't be using any part of them to look anything up like we did with middles. We can just put them in a hash whose keys are the initial V groups and whose values are arrays of all of the enders with that initial V group:
enders = {
"a" => [ "ar-", ... ],
"oui" => [ "ouid-", ... ],
"u" => [ "ude-", ... ],
"ia" => [ "ial-", "iar-", ... ]
# ...
}
We stored the second middle's final V group in ender_key above, which we can use to get the array of matching enders:
possible_enders = enders[ender_key]
ender = possible_enders.sample
puts ender # => "iar-"
Now that we have four parts, we just put them together to form our word:
puts starter + middle1 + middle2 + ender
# => -smaciniar-
Edit
The data structures above omit the relative frequencies (I wrote the above before I had a chance to read your answer to my question about the numbers). Obviously it's trivial to also store the relative frequencies alongside the parts, but I don't know off the top of my head a fast way to then choose parts in a weighted fashion. Hopefully my answer is of some use to you regardless.
You can do that using the methods Enumerable#flat_map, String#partition, Enumerable#chunk and a few more familiar ones:
def combine(arr)
arr.flat_map { |s| s.partition /[^aeiou-]+/ }.
chunk { |s| s }.
map { |_, a| a.first }.
join.delete('-')
end
combine ["-sma", "aba", "ar-"]) #=> "smabar"
combine ["-si", "inia", "iatu", "ude"] #=> "siniatude"
combine ["utua", "uailo", "orsua", "uav-"] #=> "utuauailorsuav"
To see how this works, let's look at the last example:
arr = ["utua", "uailo", "orsua", "uav-"]
a = arr.flat_map { |s| s.partition /[^aeiou-]+/ }
#=> ["u", "t", "ua", "uai", "l", "o", "o", "rs", "ua", "ua", "v", "-"]
enum = a.chunk { |s| s }
#=> #<Enumerator: #<Enumerator::Generator:0x007fdd14963888>:each>
We can see the elements of this enumerator by converting it to an array:
enum.to_a
#=> [["u", ["u"]], ["t", ["t"]], ["ua", ["ua"]], ["uai", ["uai"]],
# ["l", ["l"]], ["o", ["o", "o"]], ["rs", ["rs"]], ["ua", ["ua", "ua"]],
# ["v", ["v"]], ["-", ["-"]]]
b = enum.map { |_, a| a.first }
#=> ["u", "t", "ua", "uai", "l", "o", "rs", "ua", "v", "-"]
s = b.join
#=> "utuauailorsuav-"
s.delete('-')
#=> "utuauailorsuav"

Ruby idiom for sorting on two fields

I need the Ruby idiom for sorting on two fields. In Python if you sort a list of two-element tuples, it sorts based on the first element, and if two elements are equal then the sort is based on the second element.
One example is the following sorting code in Python (word sort from longest to shortest and consider second element to break ties) from http://www.pythonlearn.com/html-008/cfbook011.html
txt = 'but soft what light in yonder window breaks'
words = txt.split()
t = list()
for word in words:
t.append((len(word), word))
t.sort(reverse=True)
res = list()
for length, word in t:
res.append(word)
print res
What I came up in Ruby is the following code that uses structs
txt = 'but soft what light in yonder window breaks'
words = txt.split()
t = []
tuple = Struct.new(:len, :word)
for word in words
tpl = tuple.new
tpl.len = word.length
tpl.word = word
t << tpl
end
t = t.sort {|a, b| a[:len] == b[:len] ?
b[:word] <=> a[:word] : b[:len] <=> a[:len]
}
res = []
for x in t
res << x.word
end
puts res
I would like to know if there are better ways (less code) to achieve this stable sort.
I think you're overthinking this.
txt = 'but soft what light in yonder window breaks'
lengths_words = txt.split.map {|word| [ word.size, word ] }
# => [ [ 3, "but" ], [ 4, "soft" ], [ 4, "what" ], [ 5, "light" ], ... ]
sorted = lengths_words.sort
# => [ [ 2, "in" ], [ 3, "but" ], [ 4, "soft" ], [ 4, "what" ], ... ]
If you really want to use Struct, you can:
tuple = Struct.new(:length, :word)
tuples = txt.split.map {|word| tuple.new(word.size, word) }
# => [ #<struct length=3, word="but">, #<struct length=4, word="soft">, ... ]
sorted = tuples.sort_by {|tuple| [ tuple.length, tuple.word ] }
# => [ #<struct length=2, word="in">, #<struct length=3, word="but">, ... ]
This is equivalent:
sorted = tuples.sort {|tuple, other| tuple.length == other.length ?
tuple.word <=> other.word : tuple.length <=> other.length }
(Note that it's sort this time, not sort_by.)
...but since we're using a Struct we can make this nicer by defining our own comparison operator (<=>), which sort will invoke (the same works in any Ruby class):
tuple = Struct.new(:length, :word) do
def <=>(other)
[ length, word ] <=> [ other.length, other.word ]
end
end
tuples = txt.split.map {|word| tuple.new(word.size, word) }
tuples.sort
# => [ #<struct length=2, word="in">, #<struct length=3, word="but">, ... ]
There are other options for more complex sorting. If you wanted to get longest words first, for example:
lengths_words = txt.split.map {|word| [ word.size, word ] }
sorted = lengths_words.sort_by {|length, word| [ -length, word ] }
# => [ [ 6, "breaks" ], [ 6, "window" ], [ 6, "yonder" ], [ 5, "light" ], ... ]
Or:
tuple = Struct.new(:length, :word) do
def <=>(other)
[ -length, word ] <=> [ -other.length, other.word ]
end
end
txt.split.map {|word| tuple.new(word.size, word) }.sort
# => [ #<struct length=6, word="breaks">, #<struct length=6, word="window">, #<struct length=6, word="yonder">, ... ]
As you can see, I'm relying a lot on Ruby's built-in ability to sort arrays based on their contents, but you can also "roll your own" if you prefer, which might perform better with many, many items. Here's a comparison method that's equivalent to your t.sort {|a, b| a[:len] == b[:len] ? ... } code (plus a bonus to_s method):
tuple = Struct.new(:length, :word) do
def <=>(other)
return word <=> other.word if length == other.length
length <=> other.length
end
def to_s
"#{word} (#{length})"
end
end
sorted = txt.split.map {|word| tuple.new(word.size, word) }.sort
puts sorted.join(", ")
# => in (2), but (3), soft (4), what (4), light (5), breaks (6), window (6), yonder (6)
Finally, a couple comments on your Ruby style:
You pretty much never see for in idiomatic Ruby code. each is the idiomatic way to do almost all iteration in Ruby, and "functional" methods like map, reduce and select are also common. Never for.
A great advantage of Struct is that you get accessor methods for each property, so you can do tuple.word instead of tuple[:word].
Methods with no arguments are called without parentheses: txt.split.map, not txt.split().map
Ruby makes this easy, using Enumerable#sort_by will and Array#<=> for sorting.
def sort_on_two(arr, &proc)
arr.map.sort_by { |e| [proc[e], e] }.reverse
end
txt = 'but soft what light in yonder window breaks'
sort_on_two(txt.split) { |e| e.size }
#=> ["yonder", "window", "breaks", "light", "what", "soft", "but", "in"]
sort_on_two(txt.split) { |e| e.count('aeiou') }
#=> ["yonder", "window", "breaks", "what", "soft", "light", "in", "but"]
sort_on_two(txt.split) { |e| [e.count('aeiou'), e.size] }
#=> ["yonder", "window", "breaks", "light", "what", "soft", "but", "in"]
Note that in recent versions of Ruby, proc.call(e) can be written proc[e], proc.yield(e) or proc.(e).
UPDATE: my first answer was wrong (this time!), thanks to #mu is too short comment
Your code is ok to sort on two criteria, but if you just want to achieve the same result, the best is to do the following:
txt.split.sort_by{|a| [a.size,a] }.reverse
=> ["breaks", "window", "yonder", "light", "soft", "what", "but", "in"]
The first check will use the size operator, and if the result is zero, it will use the second one....
If you really want to keep your data structure, it's same principle:
t.sort_by{ |a| [a[:len],a[:word]] }.reverse

categorize by hash value

I have an array of hashes with values like:
by_person = [{ :person => "Jane Smith", :filenames => ["Report.pdf", "File2.pdf"]}, {:person => "John Doe", :filenames => ["Report.pdf] }]
I would like to end up with another array of hashes (by_file) that has each unique value from the filenames key as a key in the by_file array:
by_file = [{ :filename => "Report.pdf", :people => ["Jane Smith", "John Doe"] }, { :filename => "File2.pdf", :people => [Jane Smith] }]
I have tried:
by_file = []
by_person.each do |person|
person[:filenames].each do |file|
unless by_file.include?(file)
# list people that are included in file
by_person_each_file = by_person.select{|person| person[:filenames].include?(file)}
by_person_each_file.each do |person|
by_file << {
:file => file,
:people => person[:person]
}
end
end
end
end
as well as:
by_file.map(&:to_a).reduce({}) {|h,(k,v)| (h[k] ||= []) << v; h}
Any feedback is appreciated, thanks!
Doesn't seem too tricky, but the way you're compiling it isn't very efficient:
by_person = [{ :person => "Jane Smith", :filenames => ["Report.pdf", "File2.pdf"]}, {:person => "John Doe", :filenames => ["Report.pdf"] }]
by_file = by_person.each_with_object({ }) do |entry, index|
entry[:filenames].each do |filename|
set = index[filename] ||= [ ]
set << entry[:person]
end
end.collect do |filename, people|
{
filename: filename,
people: people
}
end
puts by_file.inspect
# => [{:filename=>"Report.pdf", :people=>["Jane Smith", "John Doe"]}, {:filename=>"File2.pdf", :people=>["Jane Smith"]}]
This makes use of a hash to group the people by filename, essentially inverting your structure, and then converts that into the final format in a second pass. This is more efficient than working with the final format during compilation as that's not indexed and requires an expensive linear search to find the correct container to insert into.
An alternate method is to create a default hash constructor that makes the structure you're looking for:
by_file_hash = Hash.new do |h,k|
h[k] = {
filename: k,
people: [ ]
}
end
by_person.each do |entry|
entry[:filenames].each do |filename|
by_file_hash[filename][:people] << entry[:person]
end
end
by_file = by_file_hash.values
puts by_file.inspect
# => [{:filename=>"Report.pdf", :people=>["Jane Smith", "John Doe"]}, {:filename=>"File2.pdf", :people=>["Jane Smith"]}]
This may or may not be easier to understand.
This is one way to do it.
Code
def convert(by_person)
by_person.each_with_object({}) do |hf,hp|
hf[:filenames].each do |fname|
hp.update({ fname=>[hf[:person]] }) { |_,oh,nh| oh+nh }
end
end.map { |fname,people| { :filename => fname, :people=>people } }
end
Example
by_person = [{:person=>"Jane Smith", :filenames=>["Report.pdf", "File2.pdf"]},
{:person=>"John Doe", :filenames=>["Report.pdf"]}]
convert(by_person)
#=> [{:filename=>"Report.pdf", :people=>["Jane Smith", "John Doe"]},
# {:filename=>"File2.pdf", :people=>["Jane Smith"]}]
Explanation
For by_person in the example:
enum1 = by_person.each_with_object({})
#=>[{:person=>"Jane Smith", :filenames=>["Report.pdf", "File2.pdf"]},
{:person=>"John Doe", :filenames=>["Report.pdf"]}]:each_with_object({})>
Let's see what values the enumerator enum will pass into the block:
enum1.to_a
#=> [[{:person=>"Jane Smith", :filenames=>["Report.pdf", "File2.pdf"]}, {}],
# [{:person=>"John Doe", :filenames=>["Report.pdf"]}, {}]]
As will be shown below, the empty hash in the first element of the enumerator will no longer be empty with the second element is passed into the block.
The first element is assigned to the block variables as follows (I've indented to indicate the block level):
hf = {:person=>"Jane Smith", :filenames=>["Report.pdf", "File2.pdf"]}
hp = {}
enum2 = hf[:filenames].each
#=> #<Enumerator: ["Report.pdf", "File2.pdf"]:each>
enum2.to_a
#=> ["Report.pdf", "File2.pdf"]
"Report.pdf" is passed to the inner block, assigned to the block variable:
fname = "Report.pdf"
and
hp.update({ "Report.pdf"=>["Jane Smith"] }) { |_,oh,nh| oh+nh }
#=> {"Report.pdf"=>["Jane Smith"]}
is executed, returning the updated value of hp.
Here the block for Hash#update (aka Hash#merge!) is not consulted. It is only needed when the hash hp and the merging hash (here { fname=>["Jane Smith"] }) have one or more common keys. For each common key, the key and the corresponding values from the two hashes are passed to the block. This is elaborated below.
Next, enum2 passes "File2.pdf" into the block and assigns it to the block variable:
fname = "File2.pdf"
and executes
hp.update({ "File2.pdf"=>["Jane Smith"] }) { |_,oh,nh| oh+nh }
#=> {"Report.pdf"=>["Jane Smith"], "File2.pdf"=>["Jane Smith"]}
which returns the updated value of hp. Again, update's block was not consulted. We're now finished with Jane, so enum1 next passes its second and last value into the block and assigns the block variables as follows:
hf = {:person=>"John Doe", :filenames=>["Report.pdf"]}
hp = {"Report.pdf"=>["Jane Smith"], "File2.pdf"=>["Jane Smith"]}
Note that hp has now been updated. We then have:
enum2 = hf[:filenames].each
#=> #<Enumerator: ["Report.pdf"]:each>
enum2.to_a
#=> ["Report.pdf"]
enum2 assigns
fname = "Report.pdf"
and executes:
hp.update({ "Report.pdf"=>["John Doe"] }) { |_,oh,nv| oh+nv }
#=> {"Report.pdf"=>["Jane Smith", "John Doe"], "File2.pdf"=>["Jane Smith"]}
In making this update, hp and the hash being merged both have the key "Report.pdf". The following values are therefore passed to the block variables |k,ov,nv|:
k = "Report.pdf"
oh = ["Jane Smith"]
nh = ["John Doe"]
We don't need the key, so I've replaced it with an underscore. The block returns
["Jane Smith"]+["John Doe"] #=> ["Jane Smith", "John Doe"]
which becomes the new value for the key "Report.pdf".
Before turning to the final step, I'd like to suggest that you consider stopping here. That is, rather than constructing an array of hashes, one for each file, just leave it as a hash with the files as keys and arrays of persons the values:
{ "Report.pdf"=>["Jane Smith", "John Doe"], "File2.pdf"=>["Jane Smith"] }
The final step is straightforward:
hp.map { |fname,people| { :filename => fname, :people=>people } }
#=> [{ :filename=>"Report.pdf", :people=>["Jane Smith", "John Doe"] },
# { :filename=>"File2.pdf", :people=>["Jane Smith"] }]

Resources