Optimise large CSV whilst using #group_by & each_with_object

Optimise large CSV whilst using #group_by & each_with_object - ruby

I have a CSV file with 498,766 rows. The contents of the CSV are pulled remotely and stuffed into a tempfile. Once I have the tempfile I group by a specific column and then go through each of the objects and create a new hash.
report = ::RestClient::Request.execute(
url: report_url,
method: :get,
headers: {Authorization: basic_auth.to_s}
)
#file = ::Tempfile.new(["#{report_run.result.filename}", ".csv"])
#file.write(report.body.force_encoding("UTF-8"))
#file.rewind
time = Benchmark.realtime do
::CSV.foreach(#file, headers: true)
.group_by { |fee| fee['charge_id'] }
.each_with_object({}) { |key, hash| hash[key.first] = key.last.sum { |fee| fee['total_amount'].to_f}.round(2) };
end
end
Benchmarking the above it takes about 52 seconds which seems relatively long to me. Is there any further optimisations that I can make here ?
For added clarity the CSV I'm looking at contains columns: charge_id and total_amount. It is possible for there to be multiple rows with the same charge_id and as such I consolidate them and then sum the total value. A better representation of what the CSV rows would look like is something like:
#
# Note this is a dummy representation of CSV data that would come back from
# doing ::CSV.foreach(#file, headers: true)
#
csv_data = [
OpenStruct.new(charge_id: 'ch_1G79Pi4Kqv3kyKfABfXoXycx', total_amount: 10.0),
OpenStruct.new(charge_id: 'ch_1G79Pi4Kqv3kyKfABfXoXycx', total_amount: 10.0),
OpenStruct.new(charge_id: 'ch_1G79Pi4Kqv3kyKfABfXoXycx', total_amount: 10.0),
OpenStruct.new(charge_id: 'ch_1G79Pi4Kqv3kyKfABfXoXycx', total_amount: 10.0),
OpenStruct.new(charge_id: 'ch_1G79Pi4Kqv3kyKfABfXoXycx', total_amount: 10.0),
OpenStruct.new(charge_id: 'ch_1G79Pi4Kqv3kyKfABfXoXycx', total_amount: 10.0),
OpenStruct.new(charge_id: 'ch_1G79Pi4Kqv3kyKfABfXoXycx', total_amount: 10.0),
OpenStruct.new(charge_id: 'ch_1G79Xt4Kqv3kyKfAnBz9ZJGJ', total_amount: 10.0),
OpenStruct.new(charge_id: 'ch_1G79Yu4Kqv3kyKfA7CnwoNEo', total_amount: 10.0),
OpenStruct.new(charge_id: 'ch_1G79ZQ4Kqv3kyKfAYZMLs8tW', total_amount: 10.0),
OpenStruct.new(charge_id: 'ch_1G79Ze4Kqv3kyKfAmNbovTjO', total_amount: 10.0),
OpenStruct.new(charge_id: 'ch_1G79Zs4Kqv3kyKfA38s1yVmq', total_amount: 10.0),
OpenStruct.new(charge_id: 'ch_1G79Zy4Kqv3kyKfA99Arn1Lh', total_amount: 10.0),
OpenStruct.new(charge_id: 'ch_1G79b04Kqv3kyKfA8uYHL0DY', total_amount: 10.0),
OpenStruct.new(charge_id: 'ch_1G79bS4Kqv3kyKfAAWxowFGO', total_amount: 10.0),
OpenStruct.new(charge_id: 'ch_1G79dS4Kqv3kyKfADejRhlbZ', total_amount: 10.0),
OpenStruct.new(charge_id: 'ch_1G79gM4Kqv3kyKfA30s5NTAj', total_amount: 10.0),
OpenStruct.new(charge_id: 'ch_1G79hc4Kqv3kyKfAxJWbu8Ny', total_amount: 10.0),
OpenStruct.new(charge_id: 'ch_1G79j64Kqv3kyKfATjAI1JcC', total_amount: 10.0),
OpenStruct.new(charge_id: 'ch_1G79jk4Kqv3kyKfAKYdakMAk', total_amount: 10.0),
OpenStruct.new(charge_id: 'ch_1G79k64Kqv3kyKfAXmpONrNI', total_amount: 10.0),
OpenStruct.new(charge_id: 'ch_1G79le4Kqv3kyKfAJMzltr6U', total_amount: 10.0),
OpenStruct.new(charge_id: 'ch_1G79lu4Kqv3kyKfAdHG5Qw6r', total_amount: 10.0)
].group_by { |fee| fee['charge_id'] }.each_with_object({}) { |key, hash| hash[key.first] = key.last.sum { |fee| fee['total_amount'].to_f}.round(2) }
#=>
{"ch_1G79Pi4Kqv3kyKfABfXoXycx"=>70.0,
"ch_1G79Xt4Kqv3kyKfAnBz9ZJGJ"=>10.0,
"ch_1G79Yu4Kqv3kyKfA7CnwoNEo"=>10.0,
"ch_1G79ZQ4Kqv3kyKfAYZMLs8tW"=>10.0,
"ch_1G79Ze4Kqv3kyKfAmNbovTjO"=>10.0,
"ch_1G79Zs4Kqv3kyKfA38s1yVmq"=>10.0,
"ch_1G79Zy4Kqv3kyKfA99Arn1Lh"=>10.0,
"ch_1G79b04Kqv3kyKfA8uYHL0DY"=>10.0,
"ch_1G79bS4Kqv3kyKfAAWxowFGO"=>10.0,
"ch_1G79dS4Kqv3kyKfADejRhlbZ"=>10.0,
"ch_1G79gM4Kqv3kyKfA30s5NTAj"=>10.0,
"ch_1G79hc4Kqv3kyKfAxJWbu8Ny"=>10.0,
"ch_1G79j64Kqv3kyKfATjAI1JcC"=>10.0,
"ch_1G79jk4Kqv3kyKfAKYdakMAk"=>10.0,
"ch_1G79k64Kqv3kyKfAXmpONrNI"=>10.0,
"ch_1G79le4Kqv3kyKfAJMzltr6U"=>10.0,
"ch_1G79lu4Kqv3kyKfAdHG5Qw6r"=>10.0}

A more direct way to compute the desired hash from csv_data follows. Because it requires a single pass through the array, I expect it will speed things up but have not done a benchmark.
require 'ostruct'
csv_data.each_with_object(Hash.new(0)) do |os,h|
h[os[:charge_id]] += os[:total_amount]
end
#=> {"ch_1G79Pi4Kqv3kyKfABfXoXycx"=>70.0,
# "ch_1G79Xt4Kqv3kyKfAnBz9ZJGJ"=>10.0,
# "ch_1G79Yu4Kqv3kyKfA7CnwoNEo"=>10.0,
# "ch_1G79ZQ4Kqv3kyKfAYZMLs8tW"=>10.0,
# "ch_1G79Ze4Kqv3kyKfAmNbovTjO"=>10.0,
# "ch_1G79Zs4Kqv3kyKfA38s1yVmq"=>10.0,
# "ch_1G79Zy4Kqv3kyKfA99Arn1Lh"=>10.0,
# "ch_1G79b04Kqv3kyKfA8uYHL0DY"=>10.0,
# "ch_1G79bS4Kqv3kyKfAAWxowFGO"=>10.0,
# "ch_1G79dS4Kqv3kyKfADejRhlbZ"=>10.0,
# "ch_1G79gM4Kqv3kyKfA30s5NTAj"=>10.0,
# "ch_1G79hc4Kqv3kyKfAxJWbu8Ny"=>10.0,
# "ch_1G79j64Kqv3kyKfATjAI1JcC"=>10.0,
# "ch_1G79jk4Kqv3kyKfAKYdakMAk"=>10.0,
# "ch_1G79k64Kqv3kyKfAXmpONrNI"=>10.0,
# "ch_1G79le4Kqv3kyKfAJMzltr6U"=>10.0,
# "ch_1G79lu4Kqv3kyKfAdHG5Qw6r"=>10.0}
See the doc for the version of Hash::new that takes an argument called the default value.
If the data is received from a remote source a line at a time one could do the processing on the fly, while receiving the data, by writing something like the following.
CSV.foreach(#file, headers: true).
with_object(Hash.new(0)) do |csv,h|
# <your processing to produce `os`, a line of csv_data>
h[os[:charge_id]] += os[:total_amount]
end
If this could be done it would have to be benchmarked to see it if actually improved performance.
For readers unfamiliar with this form of Hash::new, suppose
h = Hash.new(0)
making h's default value zero. All that means is that if h does not have a key k h[k] returns zero, which I'll write
h[k] #=> 0
Let's add a key-value pair: h[:dog] = 1. Then
h #=> { :dog=>1 }
and
h[:dog] #=> 1
Since h does not have a key :cat
h[:cat] #=> 0
Suppose now we write
h[:dog] += 1
That's the same as
h[:dog] = h[:dog] + 1
which equals
h[:dog] = 1 + 1 #=> 2
Similarly,
h[:cat] += 1
means
h[:cat] = h[:cat] + 1
= 0 + 1
= 1
because h[:cat] on the right (the method Hash#[], as contrasted with the method Hash#[]= on the left) returns zero. At this point
h #=> { :dog=>2, :cat=>1 }
When a hash is defined in this way it is sometimes called a counting hash. It's effectively the same as
h = {}
[1,3,1,2,2].each do |n|
h[n] = 0 unless h.key?(n)
h[n] += 1
end
h #=> {1=>2, 3=>1, 2=>2}

You're doing two passes through the data, one to do the grouping (group_by) and one to accumulate the sums. Here's an example showing a single pass that does both at once along with your original. I included benchmarking.
From my tests, the one-pass method is almost 100% faster. Your mileage may vary. Also, note that I removed header info when reading the data in my method. This further reduces processing overhead and memory manipulation.
require 'csv'
require 'benchmark'
filename = './data.csv'
def one_pass(filename)
file = File.open(filename, 'r')
csv = CSV.new(file)
headers = csv.shift # get rid of headers
results = Hash.new(0)
csv.each do |row|
charge_id, total_amount = row
results[charge_id] += total_amount.to_f
end
file.close
return results
end
def with_group_by(filename)
file = File.open(filename, 'r')
results = CSV.foreach(file, headers: true)
.group_by { |fee| fee['charge_id'] }
.each_with_object({}) { |key, hash| hash[key.first] = key.last.sum { |fee| fee['total_amount'].to_f}.round(2) }
file.close
return results
end
o_results = nil
g_results = nil
time = Benchmark.realtime do
o_results = one_pass filename
end
puts "one_pass: #{time}"
time = Benchmark.realtime do
g_results = with_group_by filename
end
puts "with_group_by: #{time}"
puts "o_results == g_results: #{o_results == g_results}"
My benchmarking results with a file that has 56k lines:
one_pass: 0.24479200004134327
with_group_by: 0.4725199999520555
o_results == g_results: true

Related

How to sum arrays based on the first field

I have 3 array of hashes:
a = [{name: 'Identifier', value: 500}, {name: 'Identifier2', value: 50 }]
b = [{name: 'Identifier', value: 500}, {name: 'Identifier2', value: 50 }]
c = [{name: 'Identifier', value: 500}, {name: 'Identifier2', value: 50 }]
and I have to merge them into one, based on the name prop of each identifier, so the result will be:
d = [{name: 'Identifier', value: 1500 }, {name: 'Identifier2', value: 150}]
Is there a smart ruby way of doing this, or do I have to do create another hash where the keys are the identifiers, the values the values and then transform it into an array?
Thank you.

When the values of a single key in a collection of hashes are to be totaled I usually begin by constructing a counting hash:
h = (a+b+c).each_with_object({}) do |g,h|
h[g[:name]] = (h[g[:name]] || 0) + g[:value]
end
#=> {"Identifier"=>1500, "Identifier2"=>150}
Note that if h does not have a key g[:name], h[g[:name]] #=> nil, so:
h[g[:name]] = (h[g[:name]] || 0) + g[:value]
= (nil || 0) + g[:value]
= 0 + g[:value]
= g[:value]
We may now easily obtain the desired result:
h.map { |(name,value)| { name: name, value: value } }
#=> [{:name=>"Identifier", :value=>1500},
# {:name=>"Identifier2", :value=>150}]
If desired these two expressions can be chained:
(a+b+c).each_with_object({}) do |g,h|
h[g[:name]] = (h[g[:name]] || 0) + g[:value]
end.map { |(name,value)| { name: name, value: value } }
#=> [{:name=>"Identifier", :value=>1500},
# {:name=>"Identifier2", :value=>150}]
Sometimes you might see:
h[k1] = (h[k1] || 0) + g[k2]
written:
(h[k1] ||= 0) + g[k2]
which expands to the same thing.
Another way to calculate h, which I would say is more "Ruby-like", is the following.
h = (a+b+c).each_with_object(Hash.new(0)) do |g,h|
h[g[:name]] += g[:value]
end
This creates the hash represented by the block variable h using the form of Hash::new that takes an argument called the default value:
h = Hash.new(0)
All this means is that if h does not have a key k, h[k] returns the default value, here 0. Note that
h[g[:name]] += g[:value]
expands to:
h[g[:name]] = h[g[:name]] + g[:value]
so if h does not have a key g[:name] this reduces to:
h[g[:name]] = 0 + g[:value]
If you were wondering why h[g[:name]] on the left of the equality was not replaced by 0, it is because that part of the expression employs the method Hash#[]=, whereas the method Hash#[] is used on he right. Hash::new with a default value only concerns Hash#[].

You can do everything in ruby !
Here is a solution to your problem :
d = (a+b+c).group_by { |e| e[:name] }.map { |f| f[1][0].merge(value: f[1].sum { |g| g[:value] }) }
I encourage you to check the Array Ruby doc for more information: https://ruby-doc.org/core-2.7.0/Array.html

I am assuming that the order of identifiers in all arrays is the same. That is {name: 'Identifier', value: ...} always the first element in all 3 arrays, {name: 'Identifier2', value: ... } always the second, etc. In this simple case, a simple each_with_index is a simple and clear solution:
d = []
a.each_with_index do |hash, idx|
d[idx] = {name: hash[:name], value: a[idx][:value] + b[idx][:value] + c[idx][:value] }
end
# Or a more clear version using map:
a.each_with_index do |hash, idx|
d[idx] = {name: hash[:name], value: [a, b, c].map { |h| h[idx][:value] }.sum }
end

A couple different ways, avoiding any finicky array-indexing and the like, (also functionally, since you've added the tag):
grouped = (a + b + c).group_by { _1[:name] }
name_sums = grouped.transform_values { |hashes| hashes.map { _1[:value] }.sum }
name_vals = (a + b + c).map { Hash[*_1.values_at(:name, :value)] }
name_sums = name_vals.reduce { |l, r| l.merge(r) { |k, lval, rval| lval + rval } }
in either case, finish it off with:
name_sums.map { |name, value| { name: name, value: value } }

Consistent weighted mapping in Ruby

So I currently have the below method which randomly returns a string (of a known set of strings) based on a weighted probability (based on this):
def get_response(request)
responses = ['text1', 'text2', 'text3', 'text4', 'text5', 'text6']
weights = [5, 5, 10, 10, 20, 50]
ps = weights.map { |w| (Float w) / weights.reduce(:+) }
# => [0.05, 0.05, 0.1, 0.1, 0.2, 0.5]
weighted_response_hash = responses.zip(ps).to_h
# => {"text1"=>0.05, "text2"=>0.05, "text3"=>0.1, "text4"=>0.1, "text5"=>0.2, "text6"=>0.5}
response = weighted_response_hash.max_by { |_, weight| rand ** (1.0 / weight) }.first
response
end
Now instead of a random weighted output, I want the output to be consistent based on an input string while keeping the weighted probability of the response. So for example, a call such as:
get_response("This is my request")
Should always produce the same output, while keeping the weighted probability of the output text.
I think Modulo can be used here in some way, hash mapping to the same result but I'm kinda lost.

What #maxpleaner was trying to say with srand is
srand may be used to ensure repeatable sequences of pseudo-random numbers between different runs of the program.
So, if you seed the random generator, you will always get the same results back.
For example if you do
random = Random.new(request.hash)
response = weighted_response_hash.max_by { |_, weight| random.rand ** (1.0 / weight) }.first
you will always end up with the same response whenever you pass in the same request.
old code
3.times.collect { get_response('This is my Request') }
# => ["text6", "text1", "text6"]
3.times.collect { get_response('This is my Request 2') }
# => ["text6", "text4", "text5"]
new code, seeding the random
3.times.collect { get_response('This is my Request') }
# => ["text4", "text4", "text4"]
3.times.collect { get_response('This is my Request 2') }
# => ["text1", "text1", "text1"]
The output is still weighted, just now has some predictability:
randoms = 100.times.collect { |x| get_response("#{x}") }
randoms.group_by { |item| item }.collect { |key, values| [key, values.length / 100.0] }.sort_by(&:first)
# => [["text1", 0.03], ["text2", 0.03], ["text3", 0.08], ["text4", 0.11], ["text5", 0.27], ["text6", 0.48]]

Frequency of pairs in an array ruby

I have an array of pairs like this:
arr = [
{lat: 44.456, lng: 33.222},
{lat: 42.456, lng: 31.222},
{lat: 44.456, lng: 33.222},
{lat: 44.456, lng: 33.222},
{lat: 42.456, lng: 31.222}
]
There are some geographical coordinates of some places. I want to get an array with these coordinates grouped and sorted by frequency. The result should look like this:
[
{h: {lat: 44.456, lng: 33.222}, fr: 3},
{h: {lat: 42.456, lng: 31.222}, fr: 2},
]
How can I do this?

The standard ways of approaching this problem are to use Enumerable#group_by or a counting hash. As others have posted answers using the former, I'll go with the latter.
arr.each_with_object(Hash.new(0)) { |f,g| g[f] += 1 }.map { |k,v| { h: k, fr: v } }
#=> [{:h=>{:lat=>44.456, :lng=>33.222}, :fr=>3},
# {:h=>{:lat=>42.456, :lng=>31.222}, :fr=>2}]
First, count instances of the hashes:
counts = arr.each_with_object(Hash.new(0)) { |f,g| g[f] += 1 }
#=> {{:lat=>44.456, :lng=>33.222}=>3,
# {:lat=>42.456, :lng=>31.222}=>2}
Then construct the array of hashes:
counts.map { |k,v| { h: k, fr: v } }
#=> [{:h=>{:lat=>44.456, :lng=>33.222}, :fr=>3},
# {:h=>{:lat=>42.456, :lng=>31.222}, :fr=>2}]
g = Hash.new(0) creates an empty hash with a default value of zero. That means that if g does not have a key k, g[k] returns zero. (The hash is not altered.) g[k] += 1 is first expanded to g[k] = g[k] + 1. If g does not have a key k, g[k] on the right side returns zero, so the expression becomes:
g[k] = 1.
Alternatively, you could write:
counts = arr.each_with_object({}) { |f,g| g[f] = (g[f] ||= 0) + 1 }
If you want the elements (hashes) of the array returned to be in decreasing order of the value of :fr (here it's coincidental), tack on Enumerable#sort_by:
arr.each_with_object(Hash.new(0)) { |f,g| g[f] += 1 }.
map { |k,v| { h: k, fr: v } }.
sort_by { |h| -h[:fr] }

arr.group_by(&:itself).map{|k, v| {h: k, fr: v.length}}.sort_by{|h| h[:fr]}.reverse
# =>
# [
# {:h=>{:lat=>44.456, :lng=>33.222}, :fr=>3},
# {:h=>{:lat=>42.456, :lng=>31.222}, :fr=>2}
# ]

arr.group_by{|i| i.hash}.map{|k, v| {h: v[0], fr: v.size}
#=> [{:h=>{:lat=>44.456, :lng=>33.222}, :fr=>3}, {:h=>{:lat=>42.456, :lng=>31.222}, :fr=>2}]

Ruby calculate percentages of array objects

I have an array:
array = ["one", "two", "two", "three"]
Now I need to create a separate array with the percentages of each of these items out of the total array.
Final result:
percentages_array = [".25",".50",".50",".25]
I can do something like this:
percentage_array = []
one_count = array.grep(/one/).count
two_count = array.grep(/two/).count
array.each do |x|
if x == "one"
percentage_array << one_count.to_f / array.count.to_f
elsif x == "two"
....
end
end
But how can I write it a little more concise and dynamic?

I would use the group by function:
my_array = ["one", "two", "two", "three"]
percentages = Hash[array.group_by{|x|x}.map{|x, y| [x, 1.0*y.size/my_array.size]}]
p percentages #=> {"one"=>0.25, "two"=>0.5, "three"=>0.25}
final = array.map{|x| percentages[x]}
p final #=> [0.25, 0.5, 0.5, 0.25]
Alternative 2 without group_by:
array, result = ["one", "two", "two", "three"], Hash.new
array.uniq.each do |number|
result[number] = array.count(number)
end
p array.map{|x| 1.0*result[x]/array.size} #=> [0.25, 0.5, 0.5, 0.25]

You could do this, but you find it more useful to just use the hash h:
array = ["one", "two", "two", "three"]
fac = 1.0/array.size
h = array.reduce(Hash.new(0)) {|h, e| h[e] += fac; h}
# => {"one"=>0.25, "two"=>0.5, "three"=>0.25}
array.map {|e| h[e]} # => [0.25, 0.5, 0.5, 0.25]
Edit: as #Victor suggested, the last two lines could be replaced with:
array.reduce(Hash.new(0)) {|h, e| h[e] += fac; h}.values_at(*array)
Thanks, Victor, a definite improvement (unless use of the hash is sufficient).

percentage_array = []
percents = [0.0, 0.0, 0.0]
array.each do |x|
number = find_number(x)
percents[number] += 1.0 / array.length
end
array.each do |x|
percentage_array.append(percents[find_number(x)].to_s)
end
def find_number(x)
if x == "two"
return 1
elsif x == "three"
return 2
end
return 0
end

Here is a generalized way to do this:
def percents(arr)
map = Hash.new(0)
arr.each { |val| map[val] += 1 }
arr.map { |val| (map[val]/arr.count.to_f).to_s }
end
p percents(["one", "two", "two", "three"]) # prints ["0.25", "0.5", "0.5", "0.25"]

Creating Hash of Hash from an Array of Array

I have an array:
values = [["branding", "color", "blue"],
["cust_info", "customer_code", "some_customer"],
["branding", "text", "custom text"]]
I am having trouble tranforming it to hash as follow:
{
"branding" => {"color"=>"blue", "text"=>"custom text"},
"cust_info" => {"customer_code"=>"some customer"}
}

You can use default hash values to create something more legible than inject:
h = Hash.new {|hsh, key| hsh[key] = {}}
values.each {|a, b, c| h[a][b] = c}
Obviously, you should replace the h and a, b, c variables with your domain terms.
Bonus: If you find yourself needing to go N levels deep, check out autovivification:
fun = Hash.new { |h,k| h[k] = Hash.new(&h.default_proc) }
fun[:a][:b][:c][:d] = :e
# fun == {:a=>{:b=>{:c=>{:d=>:e}}}}
Or an overly-clever one-liner using each_with_object:
silly = values.each_with_object(Hash.new {|hsh, key| hsh[key] = {}}) {|(a, b, c), h| h[a][b] = c}

Here is an example using Enumerable#inject:
values = [["branding", "color", "blue"],
["cust_info", "customer_code", "some_customer"],
["branding", "text", "custom text"]]
# r is the value we are are "injecting" and v represents each
# value in turn from the enumerable; here we create
# a new hash which will be the result hash (res == r)
res = values.inject({}) do |r, v|
group, key, value = v # array decomposition
r[group] ||= {} # make sure group exists
r[group][key] = value # set key/value in group
r # return value for next iteration (same hash)
end
There are several different ways to write this; I think the above is relatively simple. See extracting from 2 dimensional array and creating a hash with array values for using a Hash (i.e. grouper) with "auto vivification".

Less elegant but easier to understand:
hash = {}
values.each do |value|
if hash[value[0]]
hash[value[0]][value[1]] = value[2]
else
hash[value[0]] = {value[1] => value[2]}
end
end

values.inject({}) { |m, (k1, k2, v)| m[k1] = { k2 => v }.merge m[k1] || {}; m }

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Optimise large CSV whilst using #group_by & each_with_object - ruby

Related

How to sum arrays based on the first field

Consistent weighted mapping in Ruby

Frequency of pairs in an array ruby

Ruby calculate percentages of array objects

Creating Hash of Hash from an Array of Array

Categories

Resources