I'm sorting a CSV::Table object. I have a table with headers ("date", "amount", "source") and O(50) entries.
Input:
data = CSV.table('filename.csv', headers:true) # note headers are :date, :source, :amount
amounts = []
data[:amount].each {|i| amounts << i.to_f}
data.sort_by! {|row| row[:amount]}
# error - not a defined function
data = data.sort_by {|row| row[:amount]}
# sorted but data is now an array not CSV::Table. would like to retain access to headers
I want a bang function to sort the table in place by the "amount" column without loosing the CSV::Table structure. Specifically, I want the result to be a CSV::Table, so that I still have access to the headers. Right now, I'm getting an Array, which is not what I want.
I'm sure there is an easier way to do this, especially with the CSV::Table class. Any help?
You can use:
CSV::Table.new(data) to convert Array to CSV::Table object if that is what you want.
sort_by is a method from Enumerable module which will always return an array when block is given as an argument
Suppose you define the following string:
txt=<<-CSV_TXT
Item, Type, Amount, Date
gasoline, expense, 200.00, 2022-01-01
Food, expense, 25.66, 2021-12-24
Plates, expense, 333.03, 2021-04-24
Presents, expense, 1500.01, 2021-12-15
Pay check, income, 2000, 2021-12-07
Consulting, income, 300, 2021-12-16
CSV_TXT
# for giggles, using a multi character separator of ', '
Now create a CSV Table from that (this in the IRB...):
> require 'csv'
=> true
> options={:col_sep=>", ", :headers=>true, :return_headers=>true}
=> {:col_sep=>", ", :headers=>true, :return_headers=>true}
> data=CSV.parse(txt, **options)
=> #<CSV::Table mode:col_or_row row_count:7>
We now have a CSV::Table with a header defined. If you use CSV::Table the header is not optional.
There are two ways (that I know of) you can now sort this table by the Date field and end up with a CSV::Table with the header unchanged. Neither is fully an 'in-place' solution.
The first, create a new CSV::Table after a round trip an array of CSV::Rows. The call to .sort_by creates that array of CSV::Rows for you and you can index a CSV::Row for sorting purposes. You use the first row of the existing table as the header argument:
> data=CSV::Table.new([data[0]]+data[1..].sort_by{ |r| r[3] })
=> #<CSV::Table mode:col_or_row row_count:7>
The second, is similar but allows the header to more easily be split off by using .to_a to create an array. This also allows the individual rows to be filtered or otherwise processed further:
> data=CSV.parse(txt, **options).to_a
=>
[["Item", "Type", "Amount", "Date"],
...
> header=data.shift.to_csv(**options)
=> "Item, Type, Amount, Date\n"
Now you have data with the header split off. With that array, you can sort or filter or process at will; then put back into an array of CSV strings. This is all in place:
> data.sort_by!{|r| r[3]}.map!{|r| r.to_csv(**options)}
=>
["Plates, expense, 333.03, 2021-04-24\n",
"\"Pay check\", income, 2000, 2021-12-07\n",
"Presents, expense, 1500.01, 2021-12-15\n",
"Consulting, income, 300, 2021-12-16\n",
"Food, expense, 25.66, 2021-12-24\n",
"gasoline, expense, 200.00, 2022-01-01\n"]
(Note the field with "Pay check" is quoted. If any character from a multi-character :col_sep is in a field, Ruby will quote it...)
To print the first, just use puts data since Ruby knows how to print a CSV::Table; to print the second, you can do puts header,data.join("")
For the second, to rejoin the header and data into a new table, use parse with options again:
> data_new=CSV.parse(header+data.join(""), **options)
=> #<CSV::Table mode:col_or_row row_count:7>
Related
I have a csv of transaction data, with columns like:
ID,Name,Transaction Value,Running Total,
5,mike,5,5,
5,mike,2,7,
20,bob,1,1,
20,bob,15,16,
1,jane,4,4,
etc...
I need to loop through every line and do something with the transaction value, and do something different when I get to the last line of each ID.
I currently do something like this:
total = ""
id = ""
idHold = ""
totalHold = ""
CSV.foreach(csvFile) do |row|
totalHold = total
idHold = id
id = row[0]
value = row[2]
total = row[3]
if id != idHold
# do stuff with the totalHold here
end
end
But this has a problem - it skips the last line. Also, something about it doesn't feel right. I feel like there should be a better way of detecting the last line of an 'ID'.
Is there a way of grouping the id's and then detecting the last item in the id group?
note: all id's are grouped together in the csv
Let's first construct a CSV file.
str =<<~END
ID,Name,Transaction Value,Running Total
5,mike,5,5
5,mike,2,7
20,bob,1,1
20,bob,15,16
1,jane,4,4
END
CSVFile = 't.csv'
File.write(CSVFile, str)
#=> 107
I will first create a method that takes two arguments: an instance of CSV::row and a boolean to indicate whether the CSV row is the last of the group (true if it is).
def process_row(row, is_last)
puts "Do something with row #{row}"
puts "last row: #{is_last}"
end
This method would of course be modified to perform whatever operations need be performed for each row.
Below are three ways to process the file. All three use the method CSV::foreach to read the file line-by-line. This method is called with two arguments, the file name and an options hash { header: true, converters: :numeric } that indicates that the first line of the file is a header row and that strings representing numbers are to be converted to the appropriate numeric object. Here values for "ID", "Transaction Value" and "Running Total" will be converted to integers.
Though it is not mentioned in the doc, when foreach is called without a block it returns an enumerator (in the same way that IO::foreach does).
We of course need:
require 'csv'
Chain foreach to Enumerable#chunk
I have chosen to use chunk, as opposed to Enumerable#group_by, because the lines of the file are already grouped by ID.
CSV.foreach(CSVFile, headers:true, converters: :numeric).
chunk { |row| row['ID'] }.
each do |_,(*arr, last_row)|
arr.each { |row| process_row(row, false) }
process_row(last_row, true)
end
displays
Do something with row 5,mike,5,5
last row: false
Do something with row 5,mike,2,7
last row: true
Do something with row 20,bob,1,1
last row: false
Do something with row 20,bob,15,16
last row: true
Do something with row 1,jane,4,4
last row: true
Note that
enum = CSV.foreach(CSVFile, headers:true, converters: :numeric).
chunk { |row| row['ID'] }.
each
#=> #<Enumerator: #<Enumerator::Generator:0x00007ffd1a831070>:each>
Each element generated by this enumerator is passed to the block and the block variables are assigned values by a process called array decomposition:
_,(*arr,last_row) = enum.next
#=> [5, [#<CSV::Row "ID":5 "Name":"mike" "Transaction Value":5 "Running Total ":5>,
# #<CSV::Row "ID":5 "Name":"mike" "Transaction Value":2 "Running Total ":7>]]
resulting in the following:
_ #=> 5
arr
#=> [#<CSV::Row "ID":5 "Name":"mike" "Transaction Value":5 "Running Total ":5>]
last_row
#=> #<CSV::Row "ID":5 "Name":"mike" "Transaction Value":2 "Running Total ":7>
See Enumerator#next.
I have followed the convention of using an underscore for block variables that are used in the block calculation (to alert readers of your code). Note that an underscore is a valid block variable.1
Use Enumerable#slice_when in place of chunk
CSV.foreach(CSVFile, headers:true, converters: :numeric).
slice_when { |row1,row2| row1['ID'] != row2['ID'] }.
each do |*arr, last_row|
arr.each { |row| process_row(row, false) }
process_row(last_row, true)
end
This displays the same information that is produced when chunk is used.
Use Kernel#loop to step through the enumerator CSV.foreach(CSVFile, headers:true)
enum = CSV.foreach(CSVFile, headers:true, converters: :numeric)
row = nil
loop do
row = enum.next
next_row = enum.peek
process_row(row, row['ID'] != next_row['ID'])
end
process_row(row, true)
This displays the same information that is produced when chunk is used. See Enumerator#next and Enumerator#peek.
After enum.next returns the last CSV::Row object enum.peek will generate a StopIteration exception. As explained in its doc, loop handles that exception by breaking out of the loop. row must be initialized to an arbitrary value before entering the loop so that row is visible after the loop terminates. At that time row will contain the CSV::Row object for the last line of the file.
1 IRB uses the underscore for its own purposes, resulting in the block variable _ being assigned an erroneous value when the code above is run.
Yes.. ruby has got your back.
grouped = CSV.table('./test.csv').group_by { |r| r[:id] }
# Then process the rows of each group individually:
grouped.map { |id, rows|
puts [id, rows.length ]
}
Tip: You can access each row as a hash by using CSV.table
CSV.table('./test.csv').first[:name]
=> "mike"
I am brand new to Ruby and using it to try to read/write to csv. So far, I have a script that does the following:
Imports data from a CSV file, storing select columns as a separate array (I don't need data from every column)
Performs calculations on the data, stores the results in newly created arrays
Transposes the arrays to table rows, to be outputted to a csv
table = [Result1, Result2, Result3].transpose
Currently, I am able to output the table using the following:
CSV.open(resultsFile, "wb",
:write_headers=> true,
:headers => ["Result1", "Result2", "Result3"]
) do |csv|
table.each do |row|
csv << row
end
My question is, how can I add a conditional to only output rows where one of the results equals a certain text string. For example, if the value in result2 is equal to "Apple", I want the data in that row to be written to the csv file. If not, then skip that row.
I've tried placing if/else in a few different areas and have not had any success.
Thanks for any help
You could do something like below:
header = ["Result1", "Result2", "Result3"]
CSV.open(resultsFile, "wb", :write_headers=> true, :headers => header) do |csv|
table.each do |row|
csv << row if header.zip(row).to_h["Result2"] == "Apple"
end
end
zip merges two arrays and produces array of arrays where each sub-array has element from input arrays at same index, and to_h can convert any array of 2-element arrays into hash. For example:
row = ["Orange", "Apple", "Guava"]
header = ["Result1", "Result2", "Result3"]
header.zip(row).to_h
=> {"Result1"=>"Orange", "Result2"=>"Apple", "Result3"=>"Guava"}
I have a CSV file like:
123,hat,19.99
321,cap,13.99
I have this code:
products_file = File.open('text.txt')
while ! products_file.eof?
line = products_file.gets.chomp
puts line.inspect
products[ line[0].to_i] = [line[1], line[2].to_f]
end
products_file.close
which is reading the file. While it's not at the end of the file, it reads each line. I don't need the line.inspect in there. but it stores the file in an array inside of my products hash.
Now I want to pull the min and max value from the hash.
My code so far is:
read_file = File.open('text.txt', "r+").read
read_file.(?) |line|
products[ products.length] = gets.chomp.to_f
products.min_by { |x| x.size }
smallest = products
puts "Your highest priced product is #{smallest}"
Right now I don't have anything after read_file.(?) |line| so I get an error. I tried using min or max but neither worked.
Without using CSV
If I understand your question correctly, you don't have to use CSV class methods: just read the file (less header) into an array and determine the min and max as follows:
arr = ["123,hat,19.99", "321,cap,13.99",
"222,shoes,33.41", "255,shirt,19.95"]
arr.map { |s| s.split(',').last.to_f }.minmax
#=> [13.99, 33.41]
or
arr.map { |s| s[/\d+\.\d+$/].to_f }.minmax
#=> [13.99, 33.41]
If you want the associated records:
arr.minmax_by { |s| s.split(',').last.to_f }
=> ["321,cap,13.99", "222,shoes,33.41"]
With CSV
If you wish to use CSV to read the file into an array:
arr = [["123", "hat", "19.99"],
["321", "cap", "13.99"],
["222", "shoes", "33.41"],
["255", "shirt", "19.95"]]
then
arr.map(&:last).minmax
# => ["13.99", "33.41"]
or
arr.minmax_by(&:last)
#=> [["321", "cap", "13.99"],
# ["222", "shoes", "33.41"]]
if you want the records. Note that in the CSV examples I didn't convert the last field to a float, assuming that all records have two decimal digits.
You should use the built-in CSV class as such:
require 'CSV'
data = CSV.read("text.txt")
data.sort!{ |row1, row2| row1[2].to_f <=> row2[2].to_f }
least_expensive = data.first
most_expensive = data.last
The Array#sort! method modifies data in place, so it is sorted based on the condition in the block for later usage. As you can see, the block sorts based on the values in each row at index 2 - in your case, the prices. Incidentally, you don't need to convert these values to floats - strings will sort the same way. Using to_f stops working if you have leading non-digit characters (eg, $), so you might find it better just keep the values as strings during your sort.
Then you can grab the most and least expensive, or the 5 most expensive, or whatever, at your leisure.
I have been using Ruby for a while, but this is my first time doing anything with a database. I've been playing around with MongoDB for a bit and, at this point, I've begun to try and populate a simple database.
Here is my problem. I have a text file containing data in a particular format. When I read that file in, the data is stored in nested arrays like so:
dataFile = ["sectionName", ["key1", "value1"], ["key2", "value2", ["key3", ["value3A", "value3B"]]]
The format will always be that the first value of the array is a string and each subsequent value is an array. Each array is formatted in as a key/value pair. However, the value can be a string, an array of two strings, or a series of arrays that have their own key/value array pairs. I don't know any details about the data file before I read it in, just that it conforms to these rules.
Now, here is my problem. I want to read this into to a Mongo database preserving this basic structure. So, for instance, if I were to do this by hand, it would look like this:
newDB = mongo_client.db("newDB")
newCollection = newDB["dataFile1"]
doc = {"section_name" => "sectionName", "key1" => "value1", "key2" => "value2", "key3" => ["value3A", "value3B"]}
ID = newCollection.insert(doc)
I know there has to be an easy way to do this. So far, I've been trying various recursive functions to parse the data out, turn it into mongo commands and try to populate my database. But it just feels clunky, like there is a better way. Any insight into this problem would be appreciated.
The value that you gave for the variable dataFile isn't a valid array, because it is missing an closing square bracket.
If we made the definition of dataFile a valid line of ruby code, the following code would yield the hash that you described. It uses map.with_index to visit each element of the array and transforms this array into a new array of key/value hashes. This transformed array of hashes is flatted and converted into single hash using the inject method.
dataFile = ["sectionName", ["key1", "value1"], ["key2", "value2", ["key3", ["value3A", "value3B"]]]]
puts dataFile.map.with_index {
|e, ix|
case ix
when 0
{ "section_name" => e }
else
list = []
list.push( { e[0] => e[1] } )
if( e.length > 2 )
list.push(
e[2..e.length-1].map {|p|
{ p[0] => p[1] }
}
)
end
list
end
}.flatten.inject({ }) {
|accum, e|
key = e.keys.first
accum[ key ] = e[ key ]
accum
}.inspect
The output looks like:
{"section_name"=>"sectionName", "key1"=>"value1", "key2"=>"value2", "key3"=>["value3A", "value3B"]}
For input that looked like this:
["sectionName", ["key1", "value1"], ["key2", "value2", ["key3", ["value3A", "value3B"]], ["key4", ["value4A", "value4B"]]], ["key5", ["value5A", "value5B"]]]
We would see:
{"section_name"=>"sectionName", "key1"=>"value1", "key2"=>"value2", "key3"=>["value3A", "value3B"], "key4"=>["value4A", "value4B"], "key5"=>["value5A", "value5B"]}
Note the arrays "key3" and "key4", which is what I consider as being called a series of arrays. If the structure has array of arrays of unknown depth then we would need a different implementation - maybe use an array to keep track of the position as the program walks through this arbitrarily nested array of arrays.
In the following test, please find two solutions.
The first converts to a nested Hash which is what I think that you want without flattening the input data.
The second stores the key-value pairs exactly as given from the input.
I've chosen to fix missing closing square bracket by preserving key values pairs.
The major message here is that while the top-level data structure for MongoDB is a document mapped to a Ruby Hash
that by definition has key-value structure, the values can be any shape including nested arrays or hashes.
So I hope that test examples cover the range, showing that you can match storage in MongoDB to fit your needs.
test.rb
require 'mongo'
require 'test/unit'
require 'pp'
class MyTest < Test::Unit::TestCase
def setup
#coll = Mongo::MongoClient.new['test']['test']
#coll.remove
#dataFile = ["sectionName", ["key1", "value1"], ["key2", "value2"], ["key3", ["value3A", "value3B"]]]
#key, *#value = #dataFile
end
test "nested array data as hash value" do
input_doc = {#key => Hash[*#value.flatten(1)]}
#coll.insert(input_doc)
fetched_doc = #coll.find.first
assert_equal(input_doc[#key], fetched_doc[#key])
puts "#{name} fetched hash value doc:"
pp fetched_doc
end
test "nested array data as array value" do
input_doc = {#key => #value}
#coll.insert(input_doc)
fetched_doc = #coll.find.first
assert_equal(input_doc[#key], fetched_doc[#key])
puts "#{name} fetched array doc:"
pp fetched_doc
end
end
ruby test.rb
$ ruby test.rb
Loaded suite test
Started
test: nested array data as array value(MyTest) fetched array doc:
{"_id"=>BSON::ObjectId('5357d4ac7f11ba0678000001'),
"sectionName"=>
[["key1", "value1"], ["key2", "value2"], ["key3", ["value3A", "value3B"]]]}
.test: nested array data as hash value(MyTest) fetched hash value doc:
{"_id"=>BSON::ObjectId('5357d4ac7f11ba0678000002'),
"sectionName"=>
{"key1"=>"value1", "key2"=>"value2", "key3"=>["value3A", "value3B"]}}
.
Finished in 0.009493 seconds.
2 tests, 2 assertions, 0 failures, 0 errors, 0 pendings, 0 omissions, 0 notifications
100% passed
210.68 tests/s, 210.68 assertions/s
This question already has answers here:
Ignore header line when parsing CSV file
(6 answers)
Closed 8 years ago.
The community reviewed whether to reopen this question 1 year ago and left it closed:
Original close reason(s) were not resolved
Ruby's CSV class makes it pretty easy to iterate over each row:
CSV.foreach(file) { |row| puts row }
However, this always includes the header row, so I'll get as output:
header1, header2
foo, bar
baz, yak
I don't want the headers though. Now, when I call …
CSV.foreach(file, :headers => true)
I get this result:
#<CSV::Row:0x10112e510
#header_row = false,
attr_reader :row = [
[0] [
[0] "header1",
[1] "foo"
],
[1] [
[0] "header2",
[1] "bar"
]
]
>
Of course, because the documentation says:
This setting causes #shift to return rows as CSV::Row objects instead of Arrays
But, how can I skip the header row, returning the row as a simple array? I don't want the complicated CSV::Row object to be returned.
I definitely don't want to do this:
first = true
CSV.foreach(file) do |row|
if first
puts row
first = false
else
# code for other rows
end
end
Look at #shift from CSV Class:
The primary read method for wrapped Strings and IOs, a single row is pulled from the data source, parsed and returned as an Array of fields (if header rows are not used)
An Example:
require 'csv'
# CSV FILE
# name, surname, location
# Mark, Needham, Sydney
# David, Smith, London
def parse_csv_file_for_names(path_to_csv)
names = []
csv_contents = CSV.read(path_to_csv)
csv_contents.shift
csv_contents.each do |row|
names << row[0]
end
return names
end
You might want to consider CSV.parse(csv_file, { :headers => false }) and passing a block, as mentioned here
A cool way to ignore the headers is to read it as an array and ignore the first row:
data = CSV.read("dataset.csv")[1 .. -1]
# => [["first_row", "with data"],
["second_row", "and more data"],
...
["last_row", "finally"]]
The problem with the :headers => false approach is that CSV won't try to read the first row as a header, but will consider it part of the data. So, basically, you have a useless first row.