I'm generating some load test results with jmeter and it outputs nicely formatted csv file, but now I need to do some number crunching with ruby. An example beginning of the csv file:
threadName,grpThreads,allThreads,URL,Latency,SampleCount,ErrorCount
Thread Group 1-1,1,1,urlXX,240,1,0
Thread Group 1-1,1,1,urlYY,463,1,0
Thread Group 1-2,1,1,urlXX,200,1,0
Thread Group 1-3,1,1,urlXX,212,1,0
Thread Group 1-2,1,1,urlYY,454,1,0
.
.
.
Thread Group 1-N,1,1,urlXX,210,1,0
Now, for statistics I need to read the first line of each thread group, add the Latency fields up and then divide with the amount of thread groups I have, to just get an average latency. Then iterate to the second line of every thread group and so forth..
I was thinking that maybe I would need to write some temporary sorted csv files for each thread group (the order of the url's are hit is always the same within a thread group) and then use those as input, add first lines, do math, add second lines until there are no more lines.
But since the amount of thread groups change, I haven't been able to write ruby so that it could flex around that... any code examples would be really appreciated :)
[update] - Is this what you want, I wonder?
How about this - it's probably inefficient but does it do what you want?
CSV = File.readlines("data.csv")
CSV.shift # minus the header.
# Hash where key is grp name; value is list of HASHES with keys {:grp, :lat}
hash = CSV.
map {|l| # Turn every line into a HASH of grp name and it's lats.
fs = l.split(","); {:grp => fs[0], :lat => fs[4]}
}.
group_by{|o| o[:grp]}
# The largest number of lines we have in any group
max_lines = hash.max_by{|gname, l| l.size}.size
# AVGS is a list of averages.
# AVGS[0] is the average lat. for all the first lines,
# AVGS[1] is the average lat. for all second lines, etc.
AVGS =
(0..(max_lines-1)).map{|lno| # line no
total = # total latency for the i'th line...
hash.map {|gname, l|
if l[lno] then l[lno][:lat].to_i
else 0 end
}
total.reduce{|a,b| a+b} / (hash.size)
}
# So we have 'L' Averages - where L is the maximum number of
# lines in any group. You could do anything with this list
# of numbers... find the average again?
puts AVGS.inspect
Should return something like:
[217/*avg for 1st-liners*/, 305 /*avg for 2nd liners*/]
Related
Summary
Looking at the other questions that are somewhat in line with this is not helping, because I'm still opening the file line-by-line so I'm not running out of memory on the large file. In fact my memory usage is pretty low, but it is taking a really long time to create the smaller file so that I can search and concatenate the other CSV into the file.
Question
It has been 5 days and I'm not sure how far I have left to go, but it hasn't exited the foreach line of the main file, there are 17.8 million records in the csv file. Is there a faster way to handle this processing in ruby? Anything I can do to the MacOSX to optimize it? Any advice would be great.
# # -------------------------------------------------------------------------------------
# # USED TO GET ID NUMBERS OF THE SPECIFIC ITEMS THAT ARE NEEDED
# # -------------------------------------------------------------------------------------
etas_title_file = './HathiTrust ETAS Titles.csv'
oclc_id_array = []
angies_csv = []
CSV.foreach(etas_title_file ,'r', {:headers => true, :header_converters => :symbol}) do |row|
oclc_id_array << row[:oclc]
angies_csv << row.to_h
end
oclc_id_array.uniq!
# -------------------------------------------------------------------------------------
# RUN ONCE IF DATABASE IS NOT POPULATED
# -------------------------------------------------------------------------------------
headers = %i[htid access rights ht_bib_key description source source_bib_num oclc_num isbn issn lccn title imprint rights_reason_code rights_timestamp us_gov_doc_flag rights_date_used pub_place lang bib_fmt collection_code content_provider_code responsible_entity_code digitization_agent_code access_profile_code author]
remove_keys = %i[access rights description source source_bib_num isbn issn lccn title imprint rights_reason_code rights_timestamp us_gov_doc_flag rights_date_used pub_place lang bib_fmt collection_code content_provider_code responsible_entity_code digitization_agent_code access_profile_code author]
new_hathi_csv = []
processed_keys = []
CSV.foreach('./hathi_full_20200401.txt' ,'r', {:headers => headers, :col_sep => "\t", quote_char: "\0" }) do |row|
next unless oclc_id_array.include? row[:oclc_num]
next if processed_keys.include? row[:oclc_num]
puts "#{row[:oclc_num]} included? #{oclc_id_array.include? row[:oclc_num]}"
new_hathi_csv << row.to_h.except(*remove_keys)
processed_keys << row[:oclc_num]
end
As far as I was able to determine, OCLC IDs are alphanumeric. This means we want to use a Hash to store these IDs. A Hash has a general lookup complexity of O(1), while your unsorted Array has a lookup complexity of O(n).
If you use an Array, you worst case lookup is 18 million comparisons (to find a single element, Ruby has to go through all 18 million IDs), while with a Hash it will be one comparison. To put it simply: using a Hash will be millions of times faster than your current implementation.
The pseudocode below will give you an idea how to proceed. We will use a Set, which is like a Hash, but handy when all you need to do is check for inclusion:
oclc_ids = Set.new
CSV.foreach(...) {
oclc_ids.add(row[:oclc]) # Add ID to Set
...
}
# No need to call unique on a Set.
# The elements in a Set are always unique.
processed_keys = Set.new
CSV.foreach(...) {
next unless oclc_ids.include?(row[:oclc_num]) # Extremely fast lookup
next if processed_keys.include?(row[:oclc_num]) # Extremely fast lookup
...
processed_keys.add(row[:oclc_num])
}
I have a large CSV file with the following headers: "sku", "year", "color", "price", "discount", "inventory", "published_on", "rate", "demographic" and "tags".
I would like to perform various calculations for each contiguous group of rows having the same values for "sku", "year" and "color". I will refer to this partition of the file as each group of rows. For example, if the file looked like this:
sku,year,color,price,discount,...
100,2019,white,24.61,2.3,...
100,2019,white,29.11,2.1,...
100,2019,white,33.48,2.9,...
100,2019,black,58.12,1.3,...
200,2018,brown,44.15,3.1,...
200,2018,brown,53.07,3.2,...
100,2019,white,16.91,2.9,...
there would be four groups of rows: rows 1, 2 and 3 (after the header row), row 4 alone, rows 5 and 6 and row 7 alone. Notice that the last row is not included in the first group even though it has the same values for the first three fields. That it is because it is not contiguous with the first group.
An example of a calculation that might be performed for each group of rows would be to determine the total inventory for the group. In general, the measure to be computed is some function of the values contained in all the rows of the group of rows. The specific calculations for each group of rows is not central to my question. Let us simply assume that each group of rows is passed to some method which returns the measure of interest.
I wish to return an array containing one element per group of rows, each element (perhaps an array or hash) containing the common values of "sku", "year" and "color" and the calculated measure of interest.
Because the file is large it must be read line-by-line, rather than gulping it into an array.
What's the best way to do this?
Enumerator#chunk is perfect for this.
CSV.foreach('path/to/csv', headers: true).
chunk { |row| row.values_at('sku', 'year', 'color') }.
each do |(sku, year, color), rows|
# process `rows` with the current `[sku, year, color]` combination
end
Obviously, that last each can be replaced by map or flat_map, as needed.
Here is an example of how that might be done. I will read the CSV file line-by-line to minimize memory requirements.
Code
require 'csv'
def doit(fname, common_headers)
CSV.foreach(fname, headers: true).
slice_when { |csv1,csv2| csv1.values_at(*common_headers) !=
csv2.values_at(*common_headers) }.
each_with_object({}) { |arr,h|
h[arr.first.to_h.slice(*common_headers)] = calc(arr) }
end
def calc(arr)
arr.sum { |csv| csv['price'].to_f }.fdiv(arr.size).round(2)
end
The method calc needs to be customized for the application. Here I am computing the average price for each contiguous group of records having the same values for "sku", "year" and "color".
See CSV::foreach, Enumerable#slice_when, CSV::Row#values_at, CSV::Row#to_h and Hash#slice.
Example
Now let's construct a CSV file.
str =<<~END
sku,year,color,price
1,2015,red,22.41
1,2015,red,33.61
1,2015,red,12.15
1,2015,blue,36.18
2,2015,yellow,9.08
2,2015,yellow,13.71
END
fname = 't.csv'
File.write(fname, str)
#=> 129
The common headers must be given:
common_headers = ['sku', 'year', 'color']
The average prices are obtained by executing doit:
doit(fname, common_headers)
#=> {{"sku"=>"1", "year"=>"2015", "color"=>"red"}=>22.72,
# {"sku"=>"1", "year"=>"2015", "color"=>"blue"}=>36.18,
# {"sku"=>"2", "year"=>"2015", "color"=>"yellow"}=>11.4}
Note:
((22.41 + 33.61 + 12.15)/3).round(2)
#=> 22.72
((36.18)/1).round(2)
#=> 36.18
((9.08 + 13.71)/2).round(2)
#=> 11.4
The methods foreach and slice_when both return enumerators. Therefore, for each contiguous block of lines from the file having the same values for the keys in common_headers, memory is acquired, calculations are performed for those lines and then that memory is released (by Ruby). In addition, memory is needed to hold the hash that is returned at the end.
Here is an example CSV file for this problem:
Jack,6
Sam,10
Milo,9
Jacqueline,7
Sam,5
Sam,8
Sam,10
Let's take the context to be the names and scores of a quiz these people took. We can see that Sam has taken this quiz 4 times but I want to only have an X number of the same person's result (They also need to be the most recent entries). Let's assume we wanted no more than 3 of the same person's results.
I realised it probably wouldn't be possible to achieve having no more than 3 of each person's result without some extra information. Here is the updated CSV file:
Jack,6,1793
Sam,10,2079
Milo,9,2132
Jacqueline,7,2590
Sam,5,2881
Sam,8,3001
Sam,10,3013
The third column is essentially the number of seconds from the "Epoch", which is a reference point for time. With this, I thought I could simply sort the file in terms of lowest to highest for the epoch column and use set() to remove all but a certain number of duplicates for the name column while also removing the removed persons score as well.
In theory, this should leave me with the 3 most recent results per person but in practice, I have no idea how I could adapt the set() function to do this unless there is some alternative way. So my question is, what possible methods are there to achieve this?
You could use a defaultdict of a list, and each time you add an entry check the length of the list: if it's more than three items pop the first one off (or do the check after cycling through the file). This assumes the file is in time sequence.
from collections import defaultdict
# looping over a csv file gives one row at a time
# so we will emulate that
raw_data = [
('Jack', '6'),
('Sam', '10'),
('Milo', '9'),
('Jacqueline', '7'),
('Sam', '5'),
('Sam', '8'),
('Sam', '10'),
]
# this will hold our information, and works by providing an empty
# list for any missing key
student_data = defaultdict(list)
for row in raw_data: # note 1
# separate the row into its component items, and convert
# score from str to int
name, score = row
score = int(score)
# get the current list for the student, or a brand-new list
student = student_data[name]
student.append(score)
# after addeng the score to the end, remove the first scores
# until we have no more than three items in the list
if len(student) > 3:
student.pop(0)
# print the items for debugging
for item in student_data.items():
print(item)
which results in:
('Milo', [9])
('Jack', [6])
('Sam', [5, 8, 10])
('Jacqueline', [7])
Note 1: to use an actual csv file you want code like this:
raw_file = open('some_file.csv')
csv_file = csv.reader(raw_file)
for row in csv_file:
...
To handle the timestamps, and as an alternative, you could use itertools.groupby:
from itertools import groupby, islice
from operator import itemgetter
raw_data = [
('Jack','6','1793'),
('Sam','10','2079'),
('Milo','9','2132'),
('Jacqueline','7','2590'),
('Sam','5','2881'),
('Sam','8','3001'),
('Sam','10','3013'),
]
# Sort by name in natural order, then by timestamp from highest to lowest
sorted_data = sorted(raw_data, key=lambda x: x[0], -int(x[2]))
# Group by user
grouped = groupby(sorted_data, key=itemgetter(0))
# And keep only three most recent values for each user
most_recent = [(k, [v for _, v, _ in islice(grp, 3)]) for k, grp in grouped]
I am building a tool to help me reverse engineer database files. I am targeting my tool towards fixed record length flat files.
What I know:
1) Each record has an index(ID).
2) Each record is separated by a delimiter.
3) Each record is fixed width.
4) Each column in each record is separated by at least one x00 byte.
5) The file header is at the beginning (I say this because the header does not contain the delimiter..)
Delimiters I have found in other files are: ( xFAxFA, xFExFE, xFDxFD ) But this is kind of irrelevant considering that I may use the tool on a different database in the future. So I will need something that will be able to pick out a 'pattern' despite how many bytes it is made of. Probably no more than 6 bytes? It would probably eat up too much data if it was more. But, my experience doing this is limited.
So I guess my question is, how would I find UNKNOWN delimiters in a large file? I feel that given, 'what I know' I should be able to program something, I just dont know where to begin...
# Really loose pseudo code
def begin_some_how
# THIS IS THE PART I NEED HELP WITH...
# find all non-zero non-ascii sets of 2 or more bytes that repeat more than twice.
end
def check_possible_record_lengths
possible_delimiter = begin_some_how
# test if any of the above are always the same number of bytes apart from each other(except one instance, the header...)
possible_records = file.split(possible_delimiter)
rec_length_count = possible_records.map{ |record| record.length}.uniq.count
if rec_length_count == 2 # The header will most likely not be the same size.
puts "Success! We found the fixed record delimiter: #{possible_delimiter}
else
puts "Wrong delimiter found"
end
end
possible = [",", "."]
result = [0, ""]
possible.each do |delimiter|
sizes = file.split( delimiter ).map{ |record| record.size }
next if sizes.size < 2
average = 0.0 + sizes.inject{|sum,x| sum + x }
average /= sizes.size #This should be the record length if this is the right delimiter
deviation = 0.0 + sizes.inject{|sum,x| sum + (x-average)**2 }
matching_value = average / (deviation**2)
if matching_value > result[0] then
result[0] = matching_value
result[1] = delimiter
end
end
Take advantage of the fact that the records have constant size. Take every possible delimiter and check how much each record deviates from the usual record length. If the header is small enough compared rest of the file this should work.
My CSV contains about 60 million rows. The 10th column contains some alphanumeric entries, some of which repeat, that I want to convert into integers with a one-to-one mapping. That is, I don't want the same entry in Original.csv to have multiple corresponding integer values in Processed.csv. So, initially, I wrote the following code:
require 'csv'
udids = []
CSV.open('Original.csv', "wb") do |csv|
CSV.foreach('Processed.csv', :headers=>true) do |row|
unless udids.include?(row[9])
udids << row[9]
end
udid = udids.index(row[9]) + 1
array = [udid]
csv<<array
end
end
But, the program was taking a lot of time, which I soon realized was because it had to check all the previous rows to make sure only the new values get assigned a new integer value, and the existing ones are not assigned any new value.
So, I thought of hashing them, because when exploring the web about this issue, I learnt that hashing is faster than sequential comparing, somehow (I have not read the details about the how, but anyway...) So, I wrote the following code to hash them:
arrayUDID=[]
arrayUser=[]
arrayHash=[]
array1=[]
f = File.open("Original.csv", "r")
f.each_line { |line|
row = line.split(",");
arrayUDID<<row[9]
arrayUser<<row[9]
}
arrayUser = arrayUser.uniq
arrayHash = []
for i in 0..arrayUser.size-1
arrayHash<<arrayUser[i]
arrayHash<<i
end
hash = Hash[arrayHash.each_slice(2).to_a]
array1=hash.values_at *arrayUDID
logfile = File.new("Processed.csv","w")
for i in 0..array1.size-1
logfile.print("#{array1[i]}\n")
end
logfile.close
But here again, I observed that the program was taking a lot of time, which I realized must be due to the hash array (or hash table) running out of memory.
So, can you kindly suggest any method that will work for my huge file in a reasonable amount of time? By reasonable amount, I mean within 10 hours, because I realize that it's going to take some hours at least as it took about 5 hours to extract that dataset from an even bigger dataset. So, with my aforementioned codes, it was not getting finished even after 2 days of running the programs. So, if you can suggest a method which can do the task by leaving the computer on overnight, that would be great. Thanks.
I think this should work:
udids = {}
unique_count = 1
output_csv = CSV.open("Processed.csv", "w")
CSV.foreach("Original.csv").with_index do |row, i|
output_csv << row and next if i == 0 # skip first row (header info)
val = row[9]
if udids[val.to_sym]
row[9] = udids[val.to_sym]
else
udids[val.to_sym] = unique_count
row[9] = unique_count
unique_count += 1
end
output_csv << row
end
output_csv.close
The performance depends heavily on how many duplicates there are (the more the better), but basically it keeps track of each value as a key in a hash, and checks to see if it has encountered that value yet or not. If so, it uses the corresponding value, and if not it increments a counter, stores that count as the new value for that key and continues.
I was able to process a 10 million line test CSV file in about 3 minutes.