Optimise large CSV whilst using #group_by & each_with_object - ruby

I have a CSV file with 498,766 rows. The contents of the CSV are pulled remotely and stuffed into a tempfile. Once I have the tempfile I group by a specific column and then go through each of the objects and create a new hash.
report = ::RestClient::Request.execute(
url: report_url,
method: :get,
headers: {Authorization: basic_auth.to_s}
)
#file = ::Tempfile.new(["#{report_run.result.filename}", ".csv"])
#file.write(report.body.force_encoding("UTF-8"))
#file.rewind
time = Benchmark.realtime do
::CSV.foreach(#file, headers: true)
.group_by { |fee| fee['charge_id'] }
.each_with_object({}) { |key, hash| hash[key.first] = key.last.sum { |fee| fee['total_amount'].to_f}.round(2) };
end
end
Benchmarking the above it takes about 52 seconds which seems relatively long to me. Is there any further optimisations that I can make here ?
For added clarity the CSV I'm looking at contains columns: charge_id and total_amount. It is possible for there to be multiple rows with the same charge_id and as such I consolidate them and then sum the total value. A better representation of what the CSV rows would look like is something like:
#
# Note this is a dummy representation of CSV data that would come back from
# doing ::CSV.foreach(#file, headers: true)
#
csv_data = [
OpenStruct.new(charge_id: 'ch_1G79Pi4Kqv3kyKfABfXoXycx', total_amount: 10.0),
OpenStruct.new(charge_id: 'ch_1G79Pi4Kqv3kyKfABfXoXycx', total_amount: 10.0),
OpenStruct.new(charge_id: 'ch_1G79Pi4Kqv3kyKfABfXoXycx', total_amount: 10.0),
OpenStruct.new(charge_id: 'ch_1G79Pi4Kqv3kyKfABfXoXycx', total_amount: 10.0),
OpenStruct.new(charge_id: 'ch_1G79Pi4Kqv3kyKfABfXoXycx', total_amount: 10.0),
OpenStruct.new(charge_id: 'ch_1G79Pi4Kqv3kyKfABfXoXycx', total_amount: 10.0),
OpenStruct.new(charge_id: 'ch_1G79Pi4Kqv3kyKfABfXoXycx', total_amount: 10.0),
OpenStruct.new(charge_id: 'ch_1G79Xt4Kqv3kyKfAnBz9ZJGJ', total_amount: 10.0),
OpenStruct.new(charge_id: 'ch_1G79Yu4Kqv3kyKfA7CnwoNEo', total_amount: 10.0),
OpenStruct.new(charge_id: 'ch_1G79ZQ4Kqv3kyKfAYZMLs8tW', total_amount: 10.0),
OpenStruct.new(charge_id: 'ch_1G79Ze4Kqv3kyKfAmNbovTjO', total_amount: 10.0),
OpenStruct.new(charge_id: 'ch_1G79Zs4Kqv3kyKfA38s1yVmq', total_amount: 10.0),
OpenStruct.new(charge_id: 'ch_1G79Zy4Kqv3kyKfA99Arn1Lh', total_amount: 10.0),
OpenStruct.new(charge_id: 'ch_1G79b04Kqv3kyKfA8uYHL0DY', total_amount: 10.0),
OpenStruct.new(charge_id: 'ch_1G79bS4Kqv3kyKfAAWxowFGO', total_amount: 10.0),
OpenStruct.new(charge_id: 'ch_1G79dS4Kqv3kyKfADejRhlbZ', total_amount: 10.0),
OpenStruct.new(charge_id: 'ch_1G79gM4Kqv3kyKfA30s5NTAj', total_amount: 10.0),
OpenStruct.new(charge_id: 'ch_1G79hc4Kqv3kyKfAxJWbu8Ny', total_amount: 10.0),
OpenStruct.new(charge_id: 'ch_1G79j64Kqv3kyKfATjAI1JcC', total_amount: 10.0),
OpenStruct.new(charge_id: 'ch_1G79jk4Kqv3kyKfAKYdakMAk', total_amount: 10.0),
OpenStruct.new(charge_id: 'ch_1G79k64Kqv3kyKfAXmpONrNI', total_amount: 10.0),
OpenStruct.new(charge_id: 'ch_1G79le4Kqv3kyKfAJMzltr6U', total_amount: 10.0),
OpenStruct.new(charge_id: 'ch_1G79lu4Kqv3kyKfAdHG5Qw6r', total_amount: 10.0)
].group_by { |fee| fee['charge_id'] }.each_with_object({}) { |key, hash| hash[key.first] = key.last.sum { |fee| fee['total_amount'].to_f}.round(2) }
#=>
{"ch_1G79Pi4Kqv3kyKfABfXoXycx"=>70.0,
"ch_1G79Xt4Kqv3kyKfAnBz9ZJGJ"=>10.0,
"ch_1G79Yu4Kqv3kyKfA7CnwoNEo"=>10.0,
"ch_1G79ZQ4Kqv3kyKfAYZMLs8tW"=>10.0,
"ch_1G79Ze4Kqv3kyKfAmNbovTjO"=>10.0,
"ch_1G79Zs4Kqv3kyKfA38s1yVmq"=>10.0,
"ch_1G79Zy4Kqv3kyKfA99Arn1Lh"=>10.0,
"ch_1G79b04Kqv3kyKfA8uYHL0DY"=>10.0,
"ch_1G79bS4Kqv3kyKfAAWxowFGO"=>10.0,
"ch_1G79dS4Kqv3kyKfADejRhlbZ"=>10.0,
"ch_1G79gM4Kqv3kyKfA30s5NTAj"=>10.0,
"ch_1G79hc4Kqv3kyKfAxJWbu8Ny"=>10.0,
"ch_1G79j64Kqv3kyKfATjAI1JcC"=>10.0,
"ch_1G79jk4Kqv3kyKfAKYdakMAk"=>10.0,
"ch_1G79k64Kqv3kyKfAXmpONrNI"=>10.0,
"ch_1G79le4Kqv3kyKfAJMzltr6U"=>10.0,
"ch_1G79lu4Kqv3kyKfAdHG5Qw6r"=>10.0}

A more direct way to compute the desired hash from csv_data follows. Because it requires a single pass through the array, I expect it will speed things up but have not done a benchmark.
require 'ostruct'
csv_data.each_with_object(Hash.new(0)) do |os,h|
h[os[:charge_id]] += os[:total_amount]
end
#=> {"ch_1G79Pi4Kqv3kyKfABfXoXycx"=>70.0,
# "ch_1G79Xt4Kqv3kyKfAnBz9ZJGJ"=>10.0,
# "ch_1G79Yu4Kqv3kyKfA7CnwoNEo"=>10.0,
# "ch_1G79ZQ4Kqv3kyKfAYZMLs8tW"=>10.0,
# "ch_1G79Ze4Kqv3kyKfAmNbovTjO"=>10.0,
# "ch_1G79Zs4Kqv3kyKfA38s1yVmq"=>10.0,
# "ch_1G79Zy4Kqv3kyKfA99Arn1Lh"=>10.0,
# "ch_1G79b04Kqv3kyKfA8uYHL0DY"=>10.0,
# "ch_1G79bS4Kqv3kyKfAAWxowFGO"=>10.0,
# "ch_1G79dS4Kqv3kyKfADejRhlbZ"=>10.0,
# "ch_1G79gM4Kqv3kyKfA30s5NTAj"=>10.0,
# "ch_1G79hc4Kqv3kyKfAxJWbu8Ny"=>10.0,
# "ch_1G79j64Kqv3kyKfATjAI1JcC"=>10.0,
# "ch_1G79jk4Kqv3kyKfAKYdakMAk"=>10.0,
# "ch_1G79k64Kqv3kyKfAXmpONrNI"=>10.0,
# "ch_1G79le4Kqv3kyKfAJMzltr6U"=>10.0,
# "ch_1G79lu4Kqv3kyKfAdHG5Qw6r"=>10.0}
See the doc for the version of Hash::new that takes an argument called the default value.
If the data is received from a remote source a line at a time one could do the processing on the fly, while receiving the data, by writing something like the following.
CSV.foreach(#file, headers: true).
with_object(Hash.new(0)) do |csv,h|
# <your processing to produce `os`, a line of csv_data>
h[os[:charge_id]] += os[:total_amount]
end
If this could be done it would have to be benchmarked to see it if actually improved performance.
For readers unfamiliar with this form of Hash::new, suppose
h = Hash.new(0)
making h's default value zero. All that means is that if h does not have a key k h[k] returns zero, which I'll write
h[k] #=> 0
Let's add a key-value pair: h[:dog] = 1. Then
h #=> { :dog=>1 }
and
h[:dog] #=> 1
Since h does not have a key :cat
h[:cat] #=> 0
Suppose now we write
h[:dog] += 1
That's the same as
h[:dog] = h[:dog] + 1
which equals
h[:dog] = 1 + 1 #=> 2
Similarly,
h[:cat] += 1
means
h[:cat] = h[:cat] + 1
= 0 + 1
= 1
because h[:cat] on the right (the method Hash#[], as contrasted with the method Hash#[]= on the left) returns zero. At this point
h #=> { :dog=>2, :cat=>1 }
When a hash is defined in this way it is sometimes called a counting hash. It's effectively the same as
h = {}
[1,3,1,2,2].each do |n|
h[n] = 0 unless h.key?(n)
h[n] += 1
end
h #=> {1=>2, 3=>1, 2=>2}

You're doing two passes through the data, one to do the grouping (group_by) and one to accumulate the sums. Here's an example showing a single pass that does both at once along with your original. I included benchmarking.
From my tests, the one-pass method is almost 100% faster. Your mileage may vary. Also, note that I removed header info when reading the data in my method. This further reduces processing overhead and memory manipulation.
require 'csv'
require 'benchmark'
filename = './data.csv'
def one_pass(filename)
file = File.open(filename, 'r')
csv = CSV.new(file)
headers = csv.shift # get rid of headers
results = Hash.new(0)
csv.each do |row|
charge_id, total_amount = row
results[charge_id] += total_amount.to_f
end
file.close
return results
end
def with_group_by(filename)
file = File.open(filename, 'r')
results = CSV.foreach(file, headers: true)
.group_by { |fee| fee['charge_id'] }
.each_with_object({}) { |key, hash| hash[key.first] = key.last.sum { |fee| fee['total_amount'].to_f}.round(2) }
file.close
return results
end
o_results = nil
g_results = nil
time = Benchmark.realtime do
o_results = one_pass filename
end
puts "one_pass: #{time}"
time = Benchmark.realtime do
g_results = with_group_by filename
end
puts "with_group_by: #{time}"
puts "o_results == g_results: #{o_results == g_results}"
My benchmarking results with a file that has 56k lines:
one_pass: 0.24479200004134327
with_group_by: 0.4725199999520555
o_results == g_results: true

Related

How to sum arrays based on the first field

I have 3 array of hashes:
a = [{name: 'Identifier', value: 500}, {name: 'Identifier2', value: 50 }]
b = [{name: 'Identifier', value: 500}, {name: 'Identifier2', value: 50 }]
c = [{name: 'Identifier', value: 500}, {name: 'Identifier2', value: 50 }]
and I have to merge them into one, based on the name prop of each identifier, so the result will be:
d = [{name: 'Identifier', value: 1500 }, {name: 'Identifier2', value: 150}]
Is there a smart ruby way of doing this, or do I have to do create another hash where the keys are the identifiers, the values the values and then transform it into an array?
Thank you.
When the values of a single key in a collection of hashes are to be totaled I usually begin by constructing a counting hash:
h = (a+b+c).each_with_object({}) do |g,h|
h[g[:name]] = (h[g[:name]] || 0) + g[:value]
end
#=> {"Identifier"=>1500, "Identifier2"=>150}
Note that if h does not have a key g[:name], h[g[:name]] #=> nil, so:
h[g[:name]] = (h[g[:name]] || 0) + g[:value]
= (nil || 0) + g[:value]
= 0 + g[:value]
= g[:value]
We may now easily obtain the desired result:
h.map { |(name,value)| { name: name, value: value } }
#=> [{:name=>"Identifier", :value=>1500},
# {:name=>"Identifier2", :value=>150}]
If desired these two expressions can be chained:
(a+b+c).each_with_object({}) do |g,h|
h[g[:name]] = (h[g[:name]] || 0) + g[:value]
end.map { |(name,value)| { name: name, value: value } }
#=> [{:name=>"Identifier", :value=>1500},
# {:name=>"Identifier2", :value=>150}]
Sometimes you might see:
h[k1] = (h[k1] || 0) + g[k2]
written:
(h[k1] ||= 0) + g[k2]
which expands to the same thing.
Another way to calculate h, which I would say is more "Ruby-like", is the following.
h = (a+b+c).each_with_object(Hash.new(0)) do |g,h|
h[g[:name]] += g[:value]
end
This creates the hash represented by the block variable h using the form of Hash::new that takes an argument called the default value:
h = Hash.new(0)
All this means is that if h does not have a key k, h[k] returns the default value, here 0. Note that
h[g[:name]] += g[:value]
expands to:
h[g[:name]] = h[g[:name]] + g[:value]
so if h does not have a key g[:name] this reduces to:
h[g[:name]] = 0 + g[:value]
If you were wondering why h[g[:name]] on the left of the equality was not replaced by 0, it is because that part of the expression employs the method Hash#[]=, whereas the method Hash#[] is used on he right. Hash::new with a default value only concerns Hash#[].
You can do everything in ruby !
Here is a solution to your problem :
d = (a+b+c).group_by { |e| e[:name] }.map { |f| f[1][0].merge(value: f[1].sum { |g| g[:value] }) }
I encourage you to check the Array Ruby doc for more information: https://ruby-doc.org/core-2.7.0/Array.html
I am assuming that the order of identifiers in all arrays is the same. That is {name: 'Identifier', value: ...} always the first element in all 3 arrays, {name: 'Identifier2', value: ... } always the second, etc. In this simple case, a simple each_with_index is a simple and clear solution:
d = []
a.each_with_index do |hash, idx|
d[idx] = {name: hash[:name], value: a[idx][:value] + b[idx][:value] + c[idx][:value] }
end
# Or a more clear version using map:
a.each_with_index do |hash, idx|
d[idx] = {name: hash[:name], value: [a, b, c].map { |h| h[idx][:value] }.sum }
end
A couple different ways, avoiding any finicky array-indexing and the like, (also functionally, since you've added the tag):
grouped = (a + b + c).group_by { _1[:name] }
name_sums = grouped.transform_values { |hashes| hashes.map { _1[:value] }.sum }
name_vals = (a + b + c).map { Hash[*_1.values_at(:name, :value)] }
name_sums = name_vals.reduce { |l, r| l.merge(r) { |k, lval, rval| lval + rval } }
in either case, finish it off with:
name_sums.map { |name, value| { name: name, value: value } }

Consistent weighted mapping in Ruby

So I currently have the below method which randomly returns a string (of a known set of strings) based on a weighted probability (based on this):
def get_response(request)
responses = ['text1', 'text2', 'text3', 'text4', 'text5', 'text6']
weights = [5, 5, 10, 10, 20, 50]
ps = weights.map { |w| (Float w) / weights.reduce(:+) }
# => [0.05, 0.05, 0.1, 0.1, 0.2, 0.5]
weighted_response_hash = responses.zip(ps).to_h
# => {"text1"=>0.05, "text2"=>0.05, "text3"=>0.1, "text4"=>0.1, "text5"=>0.2, "text6"=>0.5}
response = weighted_response_hash.max_by { |_, weight| rand ** (1.0 / weight) }.first
response
end
Now instead of a random weighted output, I want the output to be consistent based on an input string while keeping the weighted probability of the response. So for example, a call such as:
get_response("This is my request")
Should always produce the same output, while keeping the weighted probability of the output text.
I think Modulo can be used here in some way, hash mapping to the same result but I'm kinda lost.
What #maxpleaner was trying to say with srand is
srand may be used to ensure repeatable sequences of pseudo-random numbers between different runs of the program.
So, if you seed the random generator, you will always get the same results back.
For example if you do
random = Random.new(request.hash)
response = weighted_response_hash.max_by { |_, weight| random.rand ** (1.0 / weight) }.first
you will always end up with the same response whenever you pass in the same request.
old code
3.times.collect { get_response('This is my Request') }
# => ["text6", "text1", "text6"]
3.times.collect { get_response('This is my Request 2') }
# => ["text6", "text4", "text5"]
new code, seeding the random
3.times.collect { get_response('This is my Request') }
# => ["text4", "text4", "text4"]
3.times.collect { get_response('This is my Request 2') }
# => ["text1", "text1", "text1"]
The output is still weighted, just now has some predictability:
randoms = 100.times.collect { |x| get_response("#{x}") }
randoms.group_by { |item| item }.collect { |key, values| [key, values.length / 100.0] }.sort_by(&:first)
# => [["text1", 0.03], ["text2", 0.03], ["text3", 0.08], ["text4", 0.11], ["text5", 0.27], ["text6", 0.48]]

Frequency of pairs in an array ruby

I have an array of pairs like this:
arr = [
{lat: 44.456, lng: 33.222},
{lat: 42.456, lng: 31.222},
{lat: 44.456, lng: 33.222},
{lat: 44.456, lng: 33.222},
{lat: 42.456, lng: 31.222}
]
There are some geographical coordinates of some places. I want to get an array with these coordinates grouped and sorted by frequency. The result should look like this:
[
{h: {lat: 44.456, lng: 33.222}, fr: 3},
{h: {lat: 42.456, lng: 31.222}, fr: 2},
]
How can I do this?
The standard ways of approaching this problem are to use Enumerable#group_by or a counting hash. As others have posted answers using the former, I'll go with the latter.
arr.each_with_object(Hash.new(0)) { |f,g| g[f] += 1 }.map { |k,v| { h: k, fr: v } }
#=> [{:h=>{:lat=>44.456, :lng=>33.222}, :fr=>3},
# {:h=>{:lat=>42.456, :lng=>31.222}, :fr=>2}]
First, count instances of the hashes:
counts = arr.each_with_object(Hash.new(0)) { |f,g| g[f] += 1 }
#=> {{:lat=>44.456, :lng=>33.222}=>3,
# {:lat=>42.456, :lng=>31.222}=>2}
Then construct the array of hashes:
counts.map { |k,v| { h: k, fr: v } }
#=> [{:h=>{:lat=>44.456, :lng=>33.222}, :fr=>3},
# {:h=>{:lat=>42.456, :lng=>31.222}, :fr=>2}]
g = Hash.new(0) creates an empty hash with a default value of zero. That means that if g does not have a key k, g[k] returns zero. (The hash is not altered.) g[k] += 1 is first expanded to g[k] = g[k] + 1. If g does not have a key k, g[k] on the right side returns zero, so the expression becomes:
g[k] = 1.
Alternatively, you could write:
counts = arr.each_with_object({}) { |f,g| g[f] = (g[f] ||= 0) + 1 }
If you want the elements (hashes) of the array returned to be in decreasing order of the value of :fr (here it's coincidental), tack on Enumerable#sort_by:
arr.each_with_object(Hash.new(0)) { |f,g| g[f] += 1 }.
map { |k,v| { h: k, fr: v } }.
sort_by { |h| -h[:fr] }
arr.group_by(&:itself).map{|k, v| {h: k, fr: v.length}}.sort_by{|h| h[:fr]}.reverse
# =>
# [
# {:h=>{:lat=>44.456, :lng=>33.222}, :fr=>3},
# {:h=>{:lat=>42.456, :lng=>31.222}, :fr=>2}
# ]
arr.group_by{|i| i.hash}.map{|k, v| {h: v[0], fr: v.size}
#=> [{:h=>{:lat=>44.456, :lng=>33.222}, :fr=>3}, {:h=>{:lat=>42.456, :lng=>31.222}, :fr=>2}]

Ruby calculate percentages of array objects

I have an array:
array = ["one", "two", "two", "three"]
Now I need to create a separate array with the percentages of each of these items out of the total array.
Final result:
percentages_array = [".25",".50",".50",".25]
I can do something like this:
percentage_array = []
one_count = array.grep(/one/).count
two_count = array.grep(/two/).count
array.each do |x|
if x == "one"
percentage_array << one_count.to_f / array.count.to_f
elsif x == "two"
....
end
end
But how can I write it a little more concise and dynamic?
I would use the group by function:
my_array = ["one", "two", "two", "three"]
percentages = Hash[array.group_by{|x|x}.map{|x, y| [x, 1.0*y.size/my_array.size]}]
p percentages #=> {"one"=>0.25, "two"=>0.5, "three"=>0.25}
final = array.map{|x| percentages[x]}
p final #=> [0.25, 0.5, 0.5, 0.25]
Alternative 2 without group_by:
array, result = ["one", "two", "two", "three"], Hash.new
array.uniq.each do |number|
result[number] = array.count(number)
end
p array.map{|x| 1.0*result[x]/array.size} #=> [0.25, 0.5, 0.5, 0.25]
You could do this, but you find it more useful to just use the hash h:
array = ["one", "two", "two", "three"]
fac = 1.0/array.size
h = array.reduce(Hash.new(0)) {|h, e| h[e] += fac; h}
# => {"one"=>0.25, "two"=>0.5, "three"=>0.25}
array.map {|e| h[e]} # => [0.25, 0.5, 0.5, 0.25]
Edit: as #Victor suggested, the last two lines could be replaced with:
array.reduce(Hash.new(0)) {|h, e| h[e] += fac; h}.values_at(*array)
Thanks, Victor, a definite improvement (unless use of the hash is sufficient).
percentage_array = []
percents = [0.0, 0.0, 0.0]
array.each do |x|
number = find_number(x)
percents[number] += 1.0 / array.length
end
array.each do |x|
percentage_array.append(percents[find_number(x)].to_s)
end
def find_number(x)
if x == "two"
return 1
elsif x == "three"
return 2
end
return 0
end
Here is a generalized way to do this:
def percents(arr)
map = Hash.new(0)
arr.each { |val| map[val] += 1 }
arr.map { |val| (map[val]/arr.count.to_f).to_s }
end
p percents(["one", "two", "two", "three"]) # prints ["0.25", "0.5", "0.5", "0.25"]

Creating Hash of Hash from an Array of Array

I have an array:
values = [["branding", "color", "blue"],
["cust_info", "customer_code", "some_customer"],
["branding", "text", "custom text"]]
I am having trouble tranforming it to hash as follow:
{
"branding" => {"color"=>"blue", "text"=>"custom text"},
"cust_info" => {"customer_code"=>"some customer"}
}
You can use default hash values to create something more legible than inject:
h = Hash.new {|hsh, key| hsh[key] = {}}
values.each {|a, b, c| h[a][b] = c}
Obviously, you should replace the h and a, b, c variables with your domain terms.
Bonus: If you find yourself needing to go N levels deep, check out autovivification:
fun = Hash.new { |h,k| h[k] = Hash.new(&h.default_proc) }
fun[:a][:b][:c][:d] = :e
# fun == {:a=>{:b=>{:c=>{:d=>:e}}}}
Or an overly-clever one-liner using each_with_object:
silly = values.each_with_object(Hash.new {|hsh, key| hsh[key] = {}}) {|(a, b, c), h| h[a][b] = c}
Here is an example using Enumerable#inject:
values = [["branding", "color", "blue"],
["cust_info", "customer_code", "some_customer"],
["branding", "text", "custom text"]]
# r is the value we are are "injecting" and v represents each
# value in turn from the enumerable; here we create
# a new hash which will be the result hash (res == r)
res = values.inject({}) do |r, v|
group, key, value = v # array decomposition
r[group] ||= {} # make sure group exists
r[group][key] = value # set key/value in group
r # return value for next iteration (same hash)
end
There are several different ways to write this; I think the above is relatively simple. See extracting from 2 dimensional array and creating a hash with array values for using a Hash (i.e. grouper) with "auto vivification".
Less elegant but easier to understand:
hash = {}
values.each do |value|
if hash[value[0]]
hash[value[0]][value[1]] = value[2]
else
hash[value[0]] = {value[1] => value[2]}
end
end
values.inject({}) { |m, (k1, k2, v)| m[k1] = { k2 => v }.merge m[k1] || {}; m }

Resources