Pull min and max value from CSV file - ruby

I have a CSV file like:
123,hat,19.99
321,cap,13.99
I have this code:
products_file = File.open('text.txt')
while ! products_file.eof?
line = products_file.gets.chomp
puts line.inspect
products[ line[0].to_i] = [line[1], line[2].to_f]
end
products_file.close
which is reading the file. While it's not at the end of the file, it reads each line. I don't need the line.inspect in there. but it stores the file in an array inside of my products hash.
Now I want to pull the min and max value from the hash.
My code so far is:
read_file = File.open('text.txt', "r+").read
read_file.(?) |line|
products[ products.length] = gets.chomp.to_f
products.min_by { |x| x.size }
smallest = products
puts "Your highest priced product is #{smallest}"
Right now I don't have anything after read_file.(?) |line| so I get an error. I tried using min or max but neither worked.

Without using CSV
If I understand your question correctly, you don't have to use CSV class methods: just read the file (less header) into an array and determine the min and max as follows:
arr = ["123,hat,19.99", "321,cap,13.99",
"222,shoes,33.41", "255,shirt,19.95"]
arr.map { |s| s.split(',').last.to_f }.minmax
#=> [13.99, 33.41]
or
arr.map { |s| s[/\d+\.\d+$/].to_f }.minmax
#=> [13.99, 33.41]
If you want the associated records:
arr.minmax_by { |s| s.split(',').last.to_f }
=> ["321,cap,13.99", "222,shoes,33.41"]
With CSV
If you wish to use CSV to read the file into an array:
arr = [["123", "hat", "19.99"],
["321", "cap", "13.99"],
["222", "shoes", "33.41"],
["255", "shirt", "19.95"]]
then
arr.map(&:last).minmax
# => ["13.99", "33.41"]
or
arr.minmax_by(&:last)
#=> [["321", "cap", "13.99"],
# ["222", "shoes", "33.41"]]
if you want the records. Note that in the CSV examples I didn't convert the last field to a float, assuming that all records have two decimal digits.

You should use the built-in CSV class as such:
require 'CSV'
data = CSV.read("text.txt")
data.sort!{ |row1, row2| row1[2].to_f <=> row2[2].to_f }
least_expensive = data.first
most_expensive = data.last
The Array#sort! method modifies data in place, so it is sorted based on the condition in the block for later usage. As you can see, the block sorts based on the values in each row at index 2 - in your case, the prices. Incidentally, you don't need to convert these values to floats - strings will sort the same way. Using to_f stops working if you have leading non-digit characters (eg, $), so you might find it better just keep the values as strings during your sort.
Then you can grab the most and least expensive, or the 5 most expensive, or whatever, at your leisure.

Related

CSV iteration in Ruby, and grouping by column value to get last line of each group

I have a csv of transaction data, with columns like:
ID,Name,Transaction Value,Running Total,
5,mike,5,5,
5,mike,2,7,
20,bob,1,1,
20,bob,15,16,
1,jane,4,4,
etc...
I need to loop through every line and do something with the transaction value, and do something different when I get to the last line of each ID.
I currently do something like this:
total = ""
id = ""
idHold = ""
totalHold = ""
CSV.foreach(csvFile) do |row|
totalHold = total
idHold = id
id = row[0]
value = row[2]
total = row[3]
if id != idHold
# do stuff with the totalHold here
end
end
But this has a problem - it skips the last line. Also, something about it doesn't feel right. I feel like there should be a better way of detecting the last line of an 'ID'.
Is there a way of grouping the id's and then detecting the last item in the id group?
note: all id's are grouped together in the csv
Let's first construct a CSV file.
str =<<~END
ID,Name,Transaction Value,Running Total
5,mike,5,5
5,mike,2,7
20,bob,1,1
20,bob,15,16
1,jane,4,4
END
CSVFile = 't.csv'
File.write(CSVFile, str)
#=> 107
I will first create a method that takes two arguments: an instance of CSV::row and a boolean to indicate whether the CSV row is the last of the group (true if it is).
def process_row(row, is_last)
puts "Do something with row #{row}"
puts "last row: #{is_last}"
end
This method would of course be modified to perform whatever operations need be performed for each row.
Below are three ways to process the file. All three use the method CSV::foreach to read the file line-by-line. This method is called with two arguments, the file name and an options hash { header: true, converters: :numeric } that indicates that the first line of the file is a header row and that strings representing numbers are to be converted to the appropriate numeric object. Here values for "ID", "Transaction Value" and "Running Total" will be converted to integers.
Though it is not mentioned in the doc, when foreach is called without a block it returns an enumerator (in the same way that IO::foreach does).
We of course need:
require 'csv'
Chain foreach to Enumerable#chunk
I have chosen to use chunk, as opposed to Enumerable#group_by, because the lines of the file are already grouped by ID.
CSV.foreach(CSVFile, headers:true, converters: :numeric).
chunk { |row| row['ID'] }.
each do |_,(*arr, last_row)|
arr.each { |row| process_row(row, false) }
process_row(last_row, true)
end
displays
Do something with row 5,mike,5,5
last row: false
Do something with row 5,mike,2,7
last row: true
Do something with row 20,bob,1,1
last row: false
Do something with row 20,bob,15,16
last row: true
Do something with row 1,jane,4,4
last row: true
Note that
enum = CSV.foreach(CSVFile, headers:true, converters: :numeric).
chunk { |row| row['ID'] }.
each
#=> #<Enumerator: #<Enumerator::Generator:0x00007ffd1a831070>:each>
Each element generated by this enumerator is passed to the block and the block variables are assigned values by a process called array decomposition:
_,(*arr,last_row) = enum.next
#=> [5, [#<CSV::Row "ID":5 "Name":"mike" "Transaction Value":5 "Running Total ":5>,
# #<CSV::Row "ID":5 "Name":"mike" "Transaction Value":2 "Running Total ":7>]]
resulting in the following:
_ #=> 5
arr
#=> [#<CSV::Row "ID":5 "Name":"mike" "Transaction Value":5 "Running Total ":5>]
last_row
#=> #<CSV::Row "ID":5 "Name":"mike" "Transaction Value":2 "Running Total ":7>
See Enumerator#next.
I have followed the convention of using an underscore for block variables that are used in the block calculation (to alert readers of your code). Note that an underscore is a valid block variable.1
Use Enumerable#slice_when in place of chunk
CSV.foreach(CSVFile, headers:true, converters: :numeric).
slice_when { |row1,row2| row1['ID'] != row2['ID'] }.
each do |*arr, last_row|
arr.each { |row| process_row(row, false) }
process_row(last_row, true)
end
This displays the same information that is produced when chunk is used.
Use Kernel#loop to step through the enumerator CSV.foreach(CSVFile, headers:true)
enum = CSV.foreach(CSVFile, headers:true, converters: :numeric)
row = nil
loop do
row = enum.next
next_row = enum.peek
process_row(row, row['ID'] != next_row['ID'])
end
process_row(row, true)
This displays the same information that is produced when chunk is used. See Enumerator#next and Enumerator#peek.
After enum.next returns the last CSV::Row object enum.peek will generate a StopIteration exception. As explained in its doc, loop handles that exception by breaking out of the loop. row must be initialized to an arbitrary value before entering the loop so that row is visible after the loop terminates. At that time row will contain the CSV::Row object for the last line of the file.
1 IRB uses the underscore for its own purposes, resulting in the block variable _ being assigned an erroneous value when the code above is run.
Yes.. ruby has got your back.
grouped = CSV.table('./test.csv').group_by { |r| r[:id] }
# Then process the rows of each group individually:
grouped.map { |id, rows|
puts [id, rows.length ]
}
Tip: You can access each row as a hash by using CSV.table
CSV.table('./test.csv').first[:name]
=> "mike"

Ruby Nokogiri parsing omit duplicates

I'm parsing XML files and wanting to omit duplicate values from being added to my Array. As it stands, the XML will looks like this:
<vulnerable-software-list>
<product>cpe:/a:octopus:octopus_deploy:3.0.0</product>
<product>cpe:/a:octopus:octopus_deploy:3.0.1</product>
<product>cpe:/a:octopus:octopus_deploy:3.0.2</product>
<product>cpe:/a:octopus:octopus_deploy:3.0.3</product>
<product>cpe:/a:octopus:octopus_deploy:3.0.4</product>
<product>cpe:/a:octopus:octopus_deploy:3.0.5</product>
<product>cpe:/a:octopus:octopus_deploy:3.0.6</product>
</vulnerable-software-list>
document.xpath("//entry[
number(substring(translate(last-modified-datetime,'-.T:',''), 1, 12)) > #{last_imported_at} and
cvss/base_metrics/access-vector = 'NETWORK'
]").each do |entry|
product = entry.xpath('vulnerable-software-list/product').map { |product| product.content.split(':')[-2] }
effected_versions = entry.xpath('vulnerable-software-list/product').map { |product| product.content.split(':').last }
puts product
end
However, because of the XML input, that's parsing quite a bit of duplicates, so I end up with an array like ['Redhat','Redhat','Redhat','Fedora']
I already have the effected_versions taken care of, since those values don't duplicate.
Is there a method of .map to only add unique values?
If you need to get an array of unique values, then just call uniq method to get the unique values:
product =
entry.xpath('vulnerable-software-list/product').map do |product|
product.content.split(':')[-2]
end.uniq
There are many ways to do this:
input = ['Redhat','Redhat','Redhat','Fedora']
# approach 1
# self explanatory
result = input.uniq
# approach 2
# iterate through vals, and build a hash with the vals as keys
# since hashes cannot have duplicate keys, it provides a 'unique' check
result = input.each_with_object({}) { |val, memo| memo[val] = true }.keys
# approach 3
# Similar to the previous, we iterate through vals and add them to a Set.
# Adding a duplicate value to a set has no effect, and we can convert it to array
result = input.each_with_object.(Set.new) { |val, memo| memo.add(val) }.to_a
If you're not familiar with each_with_object, it's very similar to reduce
Regarding performance, you can find some info if you search for it, for example What is the fastest way to make a uniq array?
From a quick test, I see these performing in increasing time. uniq is 5 times faster than each_with_object, which is 25% slower than the Set.new approach. Probably because sort is implemetned using C. I only tested with only an arbitrary input though, so it might not be true for all cases.

Why am i getting the error `stack level too deep (systemstackerror)` when hashing 2 columns in a CSV?

This is my code, which is supposed to hash the 2 columns in fotoFd.csv and then save the hashed columns in a separate file, T4Friendship.csv:
require "csv"
arrayUser=[]
arrayUserUnique=[]
arrayFriends=[]
fileLink = "fotoFd.csv"
f = File.open(fileLink, "r")
f.each_line { |line|
row = line.split(",");
arrayUser<<row[0]
arrayFriends<<row[1]
}
arrayUserUnique = arrayUser.uniq
arrayHash = []
for i in 0..arrayUser.size-1
arrayHash<<arrayUser[i]
arrayHash<<i
end
hash = Hash[arrayHash.each_slice(2).to_a]
array1 =hash.values_at *arrayUser
array2 =hash.values_at *arrayFriends
fileLink = "T4Friendship.csv"
for i in 0..array1.size-1
logfile = File.new(fileLink,"a")
logfile.print("#{array1[i]},#{array2[i]}\n")
logfile.close
end
The first columns contains users, and the second column contains their friends. So, I want it to produce something like this in the T4Friendship.csv:
1 2
1 4
1 10
1 35
2 1
2 8
2 11
3 28
3 31
...
The problem is caused by the splat expansion of a large array. The splat * can be used to expand an array as a parameter list. The parameters are passed on the stack. If there are too many parameters, you'll exhaust stack space and get the mentioned error.
Here's a quick demo of the problem in irb that tries to splat an array of one million elements when calling puts:
irb
irb(main):001:0> a = [0] * 1000000; nil # Use nil to suppress statement output
=> nil
irb(main):002:0> puts *a
SystemStackError: stack level too deep
from /usr/lib/ruby/1.9.1/irb/workspace.rb:80
Maybe IRB bug!
irb(main):003:0>
You seem to be processing large CSV files, and so your arrayUser array is quite large. Expanding the large array with the splat causes the problem on the line:
array1 =hash.values_at *arrayUser
You can avoid the splat by calling map on arrayUser, and converting each value in a block:
array1 = arrayUser.map{ |user| hash[user] }
Suggested Code
Your code appears to map names to unique ID numbers. The output appears to be the same format as the input, except with the names translated to ID numbers. You can do this without keeping any arrays around eating up memory, and just use a single hash built up during read, and used to translate the names to numbers on the fly. The code would look like this:
def convertCsvNamesToNums(inputFileName, outputFileName)
# Create unique ID number hash
# When unknown key is lookedup, it is added with new unique ID number
# Produces a 0 based index
nameHash = Hash.new { |hash, key| hash[key] = hash.size }
# Convert input CSV with names to output CSV with ID numbers
File.open(inputFileName, "r") do |inputFile|
File.open(outputFileName, 'w') do |outputFile|
inputFile.each_line do |line|
# Parse names from input CSV
userName, friendName = line.split(",")
# Map names to unique ID numbers
userNum = nameHash[userName]
friendNum = nameHash[friendName]
# Write unique ID numbers to output CSV
outputFile.puts "#{userNum}, #{friendNum}"
end
end
end
end
convertCsvNamesToNums("fotoFd.csv", "T4Friendship.csv")
Note: This code assigns ID numbers to user and friends, as they are encountered. Your previous code assigned ID numbers to users only, and then looked up the friends after. The code I suggested will ensure friends are assigned ID numbers, even if they never appeared in the user list. The numerical ordering will different slightly from what you supplied, but I assume that is not important.
You can also shorten the body of the inner loop to:
# Parse names from input, map to ID numbers, and write to output
outputFile.puts line.split(",").map{|name| nameHash[name]}.join(',')
I thought I'd include this change separately for readability.
Updated Code
As per your request in the comments, here is code that gives priority to the user column for ID numbers. Only once the first column is completely processed will ID numbers be assigned to entries in the second column. It does this by first passing over the input once, adding the first column to the hash, and then passing over the input a second time to process it as before, using the pre-prepared hash from the first pass. New entries can still be added in the second pass in the case where the friend column contains a new entry that doesn't exist anywhere in the user column.
def convertCsvNamesToNums(inputFileName, outputFileName)
# Create unique ID number hash
# When unknown key is lookedup, it is added with new unique ID number
# Produces a 0 based index
nameHash = Hash.new { |hash, key| hash[key] = hash.size }
# Pass over the data once to give priority to user column for ID numbers
File.open(inputFileName, "r") do |inputFile|
inputFile.each_line do |line|
name, = line.split(",") # Parse name from line, ignore the rest
nameHash[name] # Add name to unique ID number hash (if it doesn't already exist)
end
end
# Convert input CSV with names to output CSV with ID numbers
File.open(inputFileName, "r") do |inputFile|
File.open(outputFileName, 'w') do |outputFile|
inputFile.each_line do |line|
# Parse names from input, map to ID numbers, and write to output
outputFile.puts line.split(",").map{|name| nameHash[name]}.join(',')
end
end
end
end
convertCsvNamesToNums("fotoFd.csv", "T4Friendship.csv")

Ruby's optimized implementation of Histogram/Aggregator

i'm about to write my own but i was wondering if there are any gems/libs that i can use as aggregator/histogram
my goal would be to sum up values based on a matching key:
["fish","2"]
["fish","40"]
["meat","56"]
["meat","1"]
Should sum op the values per unique key and return ["fish","42"] and ["meat","57"]
.The files i have to aggregate are relatively large, about 4gb text files made of tsv key/value pair
.My goal is to try not to use temporary files in order not to take too much space on the machine, so i was wondering if something similar already optimized already exists, i have found a jeb on github named 'histogram' but it does not really contain the functionalities i need
Thx
You can use a Hash with a default value of 0 to do the counting, then in the end you could convert it to Array to yield the format you want, though I think you might just want to keep using the Hash instead.
data = [
["fish","2"],
["fish","40"],
["meat","56"],
["meat","1"]
]
hist = data.each_with_object(Hash.new(0)) do |(k,v), h|
h[k] += v.to_i
end
hist # => {"fish"=>42, "meat"=>57}
hist.to_a # => [["fish", 42], ["meat", 57]]
# To get String values, "42" instead of 42, etc:
hist.map { |k,v| [k, v.to_s] } # => [["fish", "42"], ["meat", "57"]]
Since you stated you had to read the data from a file, here is the above when applied to a file. The input.txt file contents are as follows for this example:
fish,2
fish,40
meat,56
meat,1
Then, to create the same output as before by reading it line by line:
file = File.open('input.txt')
hist = file.each_with_object(Hash.new(0)) do |line, h|
key, value = line.split(',')
h[key] += value.to_i
end
file.close

Best way of Parsing 2 CSV files and printing the common values in a third file

I am new to Ruby, and I have been struggling with a problem that I suspect has a simple answer. I have two CSV files, one with two columns, and one with a single column. The single column is a subset of values that exist in one column of my first file. Example:
file1.csv:
abc,123
def,456
ghi,789
jkl,012
file2.csv:
def
jkl
All I need to do is look up the column 2 value in file1 for each value in file2 and output the results to a separate file. So in this case, my output file should consist of:
456
012
I’ve got it working this way:
pairs=IO.readlines("file1.csv").map { |columns| columns.split(',') }
f1 =[]
pairs.each do |x| f1.push(x[0]) end
f2 = IO.readlines("file2.csv").map(&:chomp)
collection={}
pairs.each do |x| collection[x[0]]=x[1] end
f=File.open("outputfile.txt","w")
f2.each do |col1,col2| f.puts collection[col1] end
f.close
...but there has to be a better way. If anyone has a more elegant solution, I'd be very appreciative! (I should also note that I will eventually need to run this on files with millions of lines, so speed will be an issue.)
To be as memory efficient as possible, I'd suggest only reading the full file2 (which I gather would be the smaller of the two input files) into memory. I'm using a hash for fast lookups and to store the resulting values, so as you read through file1 you only store the values for those keys you need. You could go one step further and write the outputfile while reading file2.
require 'CSV'
# Read file 2, the smaller file, and store keys in result Hash
result = {}
CSV.foreach("file2.csv") do |row|
result[row[0]] = false
end
# Read file 1, the larger file, and look for keys in result Hash to set values
CSV.foreach("file1.csv") do |row|
result[row[0]] = row[1] if result.key? row[0]
end
# Write the results
File.open("outputfile.txt", "w") do |f|
result.each do |key, value|
f.puts value if value
end
end
Tested with Ruby 1.9.3
Parsing For File 1
data_csv_file1 = File.read("file1.csv")
data_csv1 = CSV.parse(data_csv_file1, :headers => true)
Parsing For File 2
data_csv_file2 = File.read("file2.csv")
data_csv2 = CSV.parse(data_csv_file1, :headers => true)
Collection of names
names_from_sheet1 = data_csv1.collect {|data| data[0]} #returns an array of names
names_from_sheet2 = data_csv2.collect {|data| data[0]} #returns an array of names
common_names = names_from_sheet1 & names_from_sheet2 #array with common names
Collecting results to be printed
results = [] #this will store the values to be printed
data_csv1.each {|data| results << data[1] if common_names.include?(data[0]) }
Final output
f = File.open("outputfile.txt","w")
results.each {|result| f.puts result }
f.close

Resources