Find duplicates in array of hashes on specific keys - ruby

I have an array of hashes (CSV rows, actually) and I need to find and keep all the rows that match two specific keys (user, section). Here is a sample of the data:
[
{ user: 1, role: "staff", section: 123 },
{ user: 2, role: "staff", section: 456 },
{ user: 3, role: "staff", section: 123 },
{ user: 1, role: "exec", section: 123 },
{ user: 2, role: "exec", section: 456 },
{ user: 3, role: "staff", section: 789 }
]
So what I would need to return is an array that contained only the rows where the same user/section combo appears more than once, like so:
[
{ user: 1, role: "staff", section: 123 },
{ user: 1, role: "exec", section: 123 },
{ user: 2, role: "staff", section: 456 },
{ user: 2, role: "exec", section: 456 }
]
The double loop solution I'm trying looks like this:
enrollments.each_with_index do |a, ai|
enrollments.each_with_index do |b, bi|
next if ai == bi
duplicates << b if a[2] == b[2] && a[6] == b[6]
end
end
but since the CSV is 145K rows it's taking forever.
How can I more efficiently get the output I need?

In terms of efficiency you might want to try this:
grouped = csv_arr.group_by{|row| [row[:user],row[:section]]}
filtered = grouped.values.select { |a| a.size > 1 }.flatten
The first statement groups the records by the :user and :section keys. the result is:
{[1, 123]=>[{:user=>1, :role=>"staff", :section=>123}, {:user=>1, :role=>"exec", :section=>123}],
[2, 456]=>[{:user=>2, :role=>"staff", :section=>456}, {:user=>2, :role=>"exec", :section=>456}],
[3, 123]=>[{:user=>3, :role=>"staff", :section=>123}],
[3, 789]=>[{:user=>3, :role=>"staff", :section=>789}]}
The second statement only selects the values of the groups with more than one member and then it flattens the result to give you:
[{:user=>1, :role=>"staff", :section=>123},
{:user=>1, :role=>"exec", :section=>123},
{:user=>2, :role=>"staff", :section=>456},
{:user=>2, :role=>"exec", :section=>456}]
This could improve the speed of your operation, but memory wise I can't say what the effect would be with a large input, because it would depend on your machine, resources and the size of file

To do this check in memory you don't need a double loop, you can keep an array of unique values and check each new csv line against it:
found = []
unique_enrollments = []
CSV.foreach('/path/to/csv') do |row|
# do whatever you're doing to parse this row into the hash you show in your question:
# => { user: 1, role: "staff", section: 123 }
# you might have to do `next if row.header_row?` if the first row is the header
enrollment = parse_row_into_enrollment_hash(row)
unique_tuple = [enrollment[:user], enrollment[:section]]
unless found.include? unique_tuple
found << unique_tuple
unique_enrollments << enrollment
end
end
Now you have unique_enrollments. With this approach you parse the CSV line by line so you're not keeping the whole thing in memory. Then building a smaller array of unique tuples made of the user and section with which you'll use for your uniqueness check and also building up the array of unique rows.
You can further optimize this by not saving the unique_enrollments in a big array but rather just building your Model and saving it to the db:
unless found.include? unique_tuple
found << unique_tuple
Enrollment.create enrollment
end
With the above tweak you'll be able to save on memory by not keeping a big array of enrollments. Although the drawback would be that if something blows up you won't be able to rollback. For example, had we done the former and kept an array of unique_enrollments at the end you can do:
Enrollment.transaction do
unique_enrollments.each &:save!
end
And now you have the ability to rollback if any of those saves blow up. Also, wrapping a bunch of db calls in a single transaction is much faster. I'd go with this approach.
Edit: Using the array of unique_enrollments you can iterate over these at the end and create a new CSV:
CSV.open('path/to/new/csv') do |csv|
csv << ['user', 'role', 'staff'] # write the header
unique_enrollments.each do |enrollment|
csv << enrollment.values # just the values not the keys
end
end

Related

compare different objects of array with same value in ruby

I have a usecase, where I have two different array of object which consist of same attributes with different/same values.
Ex:
x = [#<User _id: 32864efe, question: "comments", answer: "testing">]
y = [#<ActionController::Parameters {"question"=>"comments", "answer"=>"testing"} permitted: true>}>]
Now, When I am trying to get the difference between these two objects I am expecting difference to be nil when object consists of same value, however it is returning the below response.
x - y => [#<User _id: 32864efe, question: "comments", answer: "testing">]
In some cases, I may have more than one User object in x object. In that case, it should return the difference.
Can you please suggest how we can handle this. Any help would be appreciated.
Is this what you're looking for?
def compare_arrays(array1, array2)
result = []
# Iterate through each object in array1
array1.each do |obj1|
# Check if the object exists in array2
matching_obj = array2.find { |obj2| obj1 == obj2 }
# If no matching object was found, add the object to the result array
result << obj1 unless matching_obj
end
result
end
This function iterates through each object in array1 and checks if there is a matching object in array2. If no matching object is found, the object is added to the result array.
You can use this function like this:
array1 = [{ id: 1, name: 'John' }, { id: 2, name: 'Jane' }]
array2 = [{ id: 1, name: 'John' }]
result = compare_arrays(array1, array2)
# result is [{ id: 2, name: 'Jane' }]

How to access two elements within nested hashes within an array?

I have the following array with nested hashes:
pizza = [
{ flavor: "cheese", extras: { topping1: 1, topping2: 2, topping3: 3} },
{ flavor: "buffalo chicken", extras: { topping1: 1, topping2: 2, topping3: 3} } } ]
If want to verify that I can get an order of "buffalo chicken" pizza with two toppings. I use the .map method to iterate through the array of hashes to verify that the "flavor" I want and the "extras" I want ( 2 toppings) are available. Bingo! The code I use works, returns true, and indeed these two elements are available. BUT, if I want to check if the "buffalo chicken" flavor is available and 5 toppings are also available, then it should return false, but instead, I get an Error message that says:
Failure Error: expect(Party).not_to be_available(pizza, "buffalo chicken", :toppings5) to return false, got []
Here is my code:
def self.available?(pizza, flavor, extra)
pizza.map { |x| x if x[:flavor] == flavor && x[:extra] == extra }
end
I'm trying to figure out why I get [] returned rather than false. Perhaps there is something I'm not understanding with the way .map is being used to iterate through my array of hashes? Without changing the structure of my array of hashes, could someone please help me understand?
You have several problems here:
The keys in the hash must be unique, so the two first toppings keys are ignored. Here is an example of a wrong hash { key: 1, key: 2, key: 3 } it becomes { key: 3 }.
You must not use hash as the name of a variable in any case, it's a method.
To find an element in an array of hashes, you can use the find method, e.g.:
>> h = [{ f: "cheese", extras: [1,2,3] }, { f: "buffalo", extras: [1,3] }]
>> h.find { |h| h[:f] == "cheese" && h[:extras].size > 2 }
=> {:f=>"cheese", :extras=>[1, 2, 3]}
There are a lot of methods to iterate over an array or hash. Read more about Enumerable module. Also don't be lazy and check documentation.

Is there any way to check if hashes in an array contains similar key value pairs in ruby?

For example, I have
array = [ {name: 'robert', nationality: 'asian', age: 10},
{name: 'robert', nationality: 'asian', age: 5},
{name: 'sira', nationality: 'african', age: 15} ]
I want to get the result as
array = [ {name: 'robert', nationality: 'asian', age: 15},
{name: 'sira', nationality: 'african', age: 15} ]
since there are 2 Robert's with the same nationality.
Any help would be much appreciated.
I have tried Array.uniq! {|e| e[:name] && e[:nationality] } but I want to add both numbers in the two hashes which is 10 + 5
P.S: Array can have n number of hashes.
I would start with something like this:
array = [
{ name: 'robert', nationality: 'asian', age: 10 },
{ name: 'robert', nationality: 'asian', age: 5 },
{ name: 'sira', nationality: 'african', age: 15 }
]
array.group_by { |e| e.values_at(:name, :nationality) }
.map { |_, vs| vs.first.merge(age: vs.sum { |v| v[:age] }) }
#=> [
# {
# :name => "robert",
# :nationality => "asian",
# :age => 15
# }, {
# :name => "sira",
# :nationality => "african",
# :age => 15
# }
# ]
Let's take a look at what you want to accomplish and go from there. You have a list of some objects, and you want to merge certain objects together if they have the same ethnicity and name. So we have a key by which we will merge. Let's put that in programming terms.
key = proc { |x| [x[:name], x[:nationality]] }
We've defined a procedure which takes a hash and returns its "key" value. If this procedure returns the same value (according to eql?) for two hashes, then those two hashes need to be merged together. Now, what do we mean by "merge"? You want to add the ages together, so let's write a merge function.
merge = proc { |x, y| x.dup.tap { |x1| x1[:age] += y[:age] } }
If we have two values x and y such that key[x] and key[y] are the same, we want to merge them by making a copy of x and adding y's age to it. That's exactly what this procedure does. Now that we have our building blocks, we can write the algorithm.
We want to produce an array at the end, after merging using the key procedure we've written. Fortunately, Ruby has a handy function called each_with_object which will do something very nice for us. The method each_with_object will execute its block for each element of the array, passing in a predetermined value as the other argument. This will come in handy here.
result = array.each_with_object({}) do |x, hsh|
# ...
end.values
Since we're using keys and values to do the merge, the most efficient way to do this is going to be with a hash. Hence, we pass in an empty hash as the extra object, which we'll modify to accumulate the merge results. At the end, we don't care about the keys anymore, so we write .values to get just the objects themselves. Now for the final pieces.
if hsh.include? key[x]
hsh[ key[x] ] = merge.call hsh[ key[x] ], x
else
hsh[ key[x] ] = x
end
Let's break this down. If the hash already includes key[x], which is the key for the object x that we're looking at, then we want to merge x with the value that is currently at key[x]. This is where we add the ages together. This approach only works if the merge function is what mathematicians call a semigroup, which is a fancy way of saying that the operation is associative. You don't need to worry too much about that; addition is a very good example of a semigroup, so it works here.
Anyway, if the key doesn't exist in the hash, we want to put the current value in the hash at the key position. The resulting hash from merging is returned, and then we can get the values out of it to get the result you wanted.
key = proc { |x| [x[:name], x[:nationality]] }
merge = proc { |x, y| x.dup.tap { |x1| x1[:age] += y[:age] } }
result = array.each_with_object({}) do |x, hsh|
if hsh.include? key[x]
hsh[ key[x] ] = merge.call hsh[ key[x] ], x
else
hsh[ key[x] ] = x
end
end.values
Now, my complexity theory is a bit rusty, but if Ruby implements its hash type efficiently (which I'm fairly certain it does), then this merge algorithm is O(n), which means it will take a linear amount of time to finish, given the problem size as input.
array.each_with_object(Hash.new(0)) { |g,h| h[[g[:name], g[:nationality]]] += g[:age] }.
map { |(name, nationality),age| { name:name, nationality:nationality, age:age } }
[{ :name=>"robert", :nationality=>"asian", :age=>15 },
{ :name=>"sira", :nationality=>"african", :age=>15 }]
The two steps are as follows.
a = array.each_with_object(Hash.new(0)) { |g,h| h[[g[:name], g[:nationality]]] += g[:age] }
#=> { ["robert", "asian"]=>15, ["sira", "african"]=>15 }
This uses the class method Hash::new to create a hash with a default value of zero (represented by the block variable h). Once this hash heen obtained it is a simple matter to construct the desired hash:
a.map { |(name, nationality),age| { name:name, nationality:nationality, age:age } }

How to get the right csv format from hash in ruby

Hash to csv
hash :
{
"employee" => [
{
"name" => "Claude",
"lastname"=> "David",
"profile" => [
"age" => "43",
"jobs" => [
{
"name" => "Ingeneer",
"year" => "5"
}
],
"graduate" => [
{
"place" => "Oxford",
"year" => "1990"
},
],
"kids" => [
{
"name" => "Viktor",
"age" => "18",
}
]
}
}]
this is an example of an hash I would work on. So, as you can see, there is many level of array in it.
My question is, how do I put it properly in a CSV file?
I tried this :
column_names = hash['employee'].first.keys
s=CSV.generate do |csv|
csv << column_names
hash['scrap'].each do |x|
csv << x.values
end
end
File.write('myCSV.csv', s)
but I only get name, lastname and profile as keys, when I would catch all of them (age, jobs, name , year, graduate, place...).
Beside, how can I associate one value per case?
Because I actually have all employee[x] which take a cell alone. Is there any parameters I have missed?
Ps: This could be the following of this post
A valid CSV output has a fixed number of columns, your hash has a variable number of values. The keys jobs, graduate and kids could all have multiple values.
If your only goal is to make a CSV output that can be read in Excel for example, you could enumerate your Hash, take the maximum number of key/value pairs per key, total it and then write your CSV output, filling the blank values with "".
There are plenty of examples here on Stack Overflow, search for "deep hash" to start with.
Your result would have a different number of columns with each Hash you provide it.
That's too much work if you ask me.
If you just want to present a readable result, your best and easiest option is to convert the Hash to YAML which is created for readability:
require 'yaml'
hash = {.....}
puts hash.to_yaml
employee:
- name: Claude
lastname: David
profile:
- age: '43'
jobs:
- name: Ingeneer
year: '5'
graduate:
- place: Oxford
year: '1990'
kids:
- name: Viktor
age: '18'
If you want to convert the hash to a CSV file or record, you'll need to get a 'flat' representation of your keys and values. Something like the following:
h = {
a: 1,
b: {
c: 3,
d: 4,
e: {
f: 5
},
g: 6
}
}
def flat_keys(h)
h.keys.reject{|k| h[k].is_a?(Hash)} + h.values.select{|v| v.is_a?(Hash)}.flat_map{|v| flat_keys(v)}
end
flat_keys(h)
# [:a, :c, :d, :g, :f]
def flat_values(h)
h.values.flat_map{|v| v.is_a?(Hash) ? flat_values(v) : v}
end
flat_values(h)
# [1, 3, 4, 5, 6]
Then you can apply that to create a CSV output.
It depends on how those fields are represented in the database.
For example, your jobs has a hash with name key and your kids also has a hash with name key, so you can't just 'flatten' them, because keys have to be unique.
jobs is probably another model (database table), so you probably would have to (depending on the database) write it separately, including things like the id of the related object and so on.
Are you sure you're not in over your head? Judging from your last question and because you seem to treat csv's as simple key-values pair omitting all the database representation and relations.

Rails: Grouping Model records, then manipulating values

I am grouping Model instances, by attribute, then manipulating the hash values.
Product.create(id: 1, name: "alpha", value: "apple")
Product.create(id: 2, name: "beta", value: "bongo")
...
We want the form: [["alpha"],[["apple"],[1]]],[[beta],[["bongo"],[2]]]...]
array = []
array1 = []
Product.all.group_by(&:name).each do |a|
a[1].each do |b|
array1 << [b.value,b.id]
end
array << [a[0],array1]
array1 = []
end
Where a and b are iterator variables, array1 contains the ith a[1] values, and array contains the desired output structure.
This works, but is ugly. Can you accomplish this more cleanly?
array = Product.all.group_by(&:name).map { |name, products|
[name, products.map { |product| [product.value, product.id] }]
}
I believe this is what you want, but not 100% sure. Please try to use descriptive names, it takes a lot of effort to figure what array1, b and similar non-identifying identifiers are. It is also nice if you post an example of the output structure.

Resources