I have a large CSV file with the following headers: "sku", "year", "color", "price", "discount", "inventory", "published_on", "rate", "demographic" and "tags".
I would like to perform various calculations for each contiguous group of rows having the same values for "sku", "year" and "color". I will refer to this partition of the file as each group of rows. For example, if the file looked like this:
sku,year,color,price,discount,...
100,2019,white,24.61,2.3,...
100,2019,white,29.11,2.1,...
100,2019,white,33.48,2.9,...
100,2019,black,58.12,1.3,...
200,2018,brown,44.15,3.1,...
200,2018,brown,53.07,3.2,...
100,2019,white,16.91,2.9,...
there would be four groups of rows: rows 1, 2 and 3 (after the header row), row 4 alone, rows 5 and 6 and row 7 alone. Notice that the last row is not included in the first group even though it has the same values for the first three fields. That it is because it is not contiguous with the first group.
An example of a calculation that might be performed for each group of rows would be to determine the total inventory for the group. In general, the measure to be computed is some function of the values contained in all the rows of the group of rows. The specific calculations for each group of rows is not central to my question. Let us simply assume that each group of rows is passed to some method which returns the measure of interest.
I wish to return an array containing one element per group of rows, each element (perhaps an array or hash) containing the common values of "sku", "year" and "color" and the calculated measure of interest.
Because the file is large it must be read line-by-line, rather than gulping it into an array.
What's the best way to do this?
Enumerator#chunk is perfect for this.
CSV.foreach('path/to/csv', headers: true).
chunk { |row| row.values_at('sku', 'year', 'color') }.
each do |(sku, year, color), rows|
# process `rows` with the current `[sku, year, color]` combination
end
Obviously, that last each can be replaced by map or flat_map, as needed.
Here is an example of how that might be done. I will read the CSV file line-by-line to minimize memory requirements.
Code
require 'csv'
def doit(fname, common_headers)
CSV.foreach(fname, headers: true).
slice_when { |csv1,csv2| csv1.values_at(*common_headers) !=
csv2.values_at(*common_headers) }.
each_with_object({}) { |arr,h|
h[arr.first.to_h.slice(*common_headers)] = calc(arr) }
end
def calc(arr)
arr.sum { |csv| csv['price'].to_f }.fdiv(arr.size).round(2)
end
The method calc needs to be customized for the application. Here I am computing the average price for each contiguous group of records having the same values for "sku", "year" and "color".
See CSV::foreach, Enumerable#slice_when, CSV::Row#values_at, CSV::Row#to_h and Hash#slice.
Example
Now let's construct a CSV file.
str =<<~END
sku,year,color,price
1,2015,red,22.41
1,2015,red,33.61
1,2015,red,12.15
1,2015,blue,36.18
2,2015,yellow,9.08
2,2015,yellow,13.71
END
fname = 't.csv'
File.write(fname, str)
#=> 129
The common headers must be given:
common_headers = ['sku', 'year', 'color']
The average prices are obtained by executing doit:
doit(fname, common_headers)
#=> {{"sku"=>"1", "year"=>"2015", "color"=>"red"}=>22.72,
# {"sku"=>"1", "year"=>"2015", "color"=>"blue"}=>36.18,
# {"sku"=>"2", "year"=>"2015", "color"=>"yellow"}=>11.4}
Note:
((22.41 + 33.61 + 12.15)/3).round(2)
#=> 22.72
((36.18)/1).round(2)
#=> 36.18
((9.08 + 13.71)/2).round(2)
#=> 11.4
The methods foreach and slice_when both return enumerators. Therefore, for each contiguous block of lines from the file having the same values for the keys in common_headers, memory is acquired, calculations are performed for those lines and then that memory is released (by Ruby). In addition, memory is needed to hold the hash that is returned at the end.
I want to do a demo for some peers, and I want to force IRB to return values in only binary.
For example it currently returns the result in base 10 no matter which base the input is in:
0b1111
# => 15 #should return 1111 or 0b1111
0b0001 | 0b0011
# => 3 #should return 0011 or 0b0011
Is there a way to force this result? I want to demo bitwise operators and it is much easier for them to understand if they see the bits flowing around rather than base 10 numbers being returned, that I would have to convert to base 2 afterwards.
Also I would like for all results to be in multiples of 4 bits. If possible with underscores or spaces separating the half byte groupings.
For example:
0b0001_0101
# => 0b0001_0101 #or 0b0001 0101, or 0001 0101, or 0001_0101
If I result does not need to be represented by 4 bits (example 3, 11) pad it to 4, 8, 16 bits in length depending on the number.
If you write 0b0001.class you will find that it is Fixnum.
Writing puts 0b1000 shows 8, and it's clear this Fixnum is stored in base 10.
So as far as I'm aware, there isn't any way to prevent the conversion to base 10.
If you want to control the way that Fixnum objects are displayed in IRB, you can implement a custom inspect method for the class:
class Fixnum
def inspect
unpadded_binary_string = to_s(2)
binary_width = if unpadded_binary_string.length % 4 == 0
unpadded_binary_string.length
else
((unpadded_binary_string.length / 4) * 4) + 4
end
padded_binary_string = "%0#{binary_width}d" % unpadded_binary_string
# join groups of 4 with an underscore
padded_binary_string = padded_binary_string.scan(/.{4}/).join("_")
"0b" + padded_binary_string
end
end
results:
irb(main):007:0> 0b1000
=> 0b1000
irb(main):011:0> 99999999
=> 0b0101_1111_0101_1110_0000_1111_1111
The inspect method uses to_s(2), which takes an integer and produces a string representation of it's binary. But the zeroes at the front of the binary are lost when it's converted to base 10. That's why the inspect method
needs to manually add zeroes to the front of the string.
There's no way I can think of to add the correct number of zeroes to the front of the string in a completely dynamic way.
What I'm doing here is calculating the minimum width (in a multiple of 4) that can contain the unpadded binary string. So if the unpadded length is 5 characters, the final width will be 8. If the unpadded length is 2, the final length is 4.
Instead of calculating it on-the-go, you could alternatively set the binary_width as an external variable that you change at runtime, then reference it from the inspect function.
I have a long list made up of text like this
Email: example#example.com
Language Spoken: Sample
Points: 52600
Lifetime points: 100000
Country: US
Number: 1234
Gender: Male
Status: Activated
=============================================
I need a way of filtering this list so that only students with higher than 52600 points gets shown. I am currently looking at solutions for this, I thought maybe excel would be a start but am not too sure and wanted input.
Here's a solution in Excel:
1) Copy Text into Column A
2) In B1 enter "1", then in B2 enter the formula: =IF(LEFT(A1,1)="=",B1+1,B1), then copy that down to the end.
(This splits the text into groups divided by the equal signs)
3) In C1 enter the formula: =IF(LEFT(A1,8)="Points: ",VALUE(RIGHT(A1,LEN(A1)-8)),0), then copy that down to the end.
(Basically this is populating the points in column B)
4) In D1 enter the formula: =SUMIF(B:B,B1,C:C), then copy that down to the end.
(This just sums the amounts in column B by grouping)
5) Finally put a filter on Column D, and filter by greater than or equal to the amount desired.
I am building a tool to help me reverse engineer database files. I am targeting my tool towards fixed record length flat files.
What I know:
1) Each record has an index(ID).
2) Each record is separated by a delimiter.
3) Each record is fixed width.
4) Each column in each record is separated by at least one x00 byte.
5) The file header is at the beginning (I say this because the header does not contain the delimiter..)
Delimiters I have found in other files are: ( xFAxFA, xFExFE, xFDxFD ) But this is kind of irrelevant considering that I may use the tool on a different database in the future. So I will need something that will be able to pick out a 'pattern' despite how many bytes it is made of. Probably no more than 6 bytes? It would probably eat up too much data if it was more. But, my experience doing this is limited.
So I guess my question is, how would I find UNKNOWN delimiters in a large file? I feel that given, 'what I know' I should be able to program something, I just dont know where to begin...
# Really loose pseudo code
def begin_some_how
# THIS IS THE PART I NEED HELP WITH...
# find all non-zero non-ascii sets of 2 or more bytes that repeat more than twice.
end
def check_possible_record_lengths
possible_delimiter = begin_some_how
# test if any of the above are always the same number of bytes apart from each other(except one instance, the header...)
possible_records = file.split(possible_delimiter)
rec_length_count = possible_records.map{ |record| record.length}.uniq.count
if rec_length_count == 2 # The header will most likely not be the same size.
puts "Success! We found the fixed record delimiter: #{possible_delimiter}
else
puts "Wrong delimiter found"
end
end
possible = [",", "."]
result = [0, ""]
possible.each do |delimiter|
sizes = file.split( delimiter ).map{ |record| record.size }
next if sizes.size < 2
average = 0.0 + sizes.inject{|sum,x| sum + x }
average /= sizes.size #This should be the record length if this is the right delimiter
deviation = 0.0 + sizes.inject{|sum,x| sum + (x-average)**2 }
matching_value = average / (deviation**2)
if matching_value > result[0] then
result[0] = matching_value
result[1] = delimiter
end
end
Take advantage of the fact that the records have constant size. Take every possible delimiter and check how much each record deviates from the usual record length. If the header is small enough compared rest of the file this should work.
I have some critical error with the hash's size function. This is acting irationnal.
Here is my hash :
"questionnaires"=>{"1"=>{"6"=>"8", "7"=>"12", "5"=>"19"}}
#questions=evt["questionnaires"]["1"] # not really "1", that's an id but don't matter here
#questions.each do |(key,question)| # should be "6"=>"8", then "7"=>"12", ect ...
temp = question.size
And results are 1 , 2 , 2. So it is bugging, i am testing with size cause sometimes i get an array like this :
so, i don't know why
"6"=>"8".size == 1, "7"=>"12".size == 2 and "5"=>"19".size == 2.
And with this array
"questionnaires"=>{"3"=>{"8"=>{"16"=>"16", "18"=>"18"}}}
results are correct. Size = 2, like expected.
Any ideas ?
When you have (key,question) parameters like you do, they get filled in parallel assignment as it iterates through the hash. So, for example, the first iteration key is "6" and question is "8". The second iteration, key is "7" and question is "12".
And you are asking question.size. But since question is just a String, question.size returns the length of the string. The first iteration through, the question id "8" is 1 character long. The second iteration, the question id "12" is 2 characters long. That's where the numbers you are getting are coming from.