I am trying to take an existing CSV file, add a fourth row to it, and then iterate through the second and third row to create the fourth rows values. Using Ruby I've created hashes where the headers are the keys and the column values are the hash values (ex: "id"=>"1", "new_fruit" => "apple")
My practice CSV file looks like this:practice csv file image
My goal is to create a fourth column: "brand_new" (which I was able to do) and then add values to it by concatenating the values from the second and third row (which I am stuck on). At the moment I just have a placement value (x) for the fourth columns values so I could see if adding the fourth column to the hash actually worked: Results with x = 1
Here is my code:
require 'csv'
def self.import
table = []
CSV.foreach(File.path("practice.csv"), headers: true) do |row|
table.each do |row|
row["brand_new"] = full_name
end
table << row.to_h
end
table
end
def full_name
x = 1
return x
end
# Add another col, row by row:
import.each do |row|
row["brand_new"] = full_name
end
puts import
Any suggestions or guidance would be much appreciated. Thank you.
Simplified your code a bit. I read the file first, then iterate about the read content.
Note: Change col_sep to comma or delete it to use the default if needed.
require "csv"
def self.import
table = CSV.read("practice.csv", headers: true , col_sep: ";")
table.each do |row|
row["brand_new"] = "#{row["old_fruit"]} #{row["new_fruit"]}"
end
puts table
end
I use the read method to read the CSV file content. It allows you to directly access the column/cell values.
Line 7 shows how to concatenate the column values as string:
"#{row["old_fruit"]} #{row["new_fruit"]}"
Refer to this old SO post and the CSV Ruby docs to learn more about working with CSV files.
I have a csv of transaction data, with columns like:
ID,Name,Transaction Value,Running Total,
5,mike,5,5,
5,mike,2,7,
20,bob,1,1,
20,bob,15,16,
1,jane,4,4,
etc...
I need to loop through every line and do something with the transaction value, and do something different when I get to the last line of each ID.
I currently do something like this:
total = ""
id = ""
idHold = ""
totalHold = ""
CSV.foreach(csvFile) do |row|
totalHold = total
idHold = id
id = row[0]
value = row[2]
total = row[3]
if id != idHold
# do stuff with the totalHold here
end
end
But this has a problem - it skips the last line. Also, something about it doesn't feel right. I feel like there should be a better way of detecting the last line of an 'ID'.
Is there a way of grouping the id's and then detecting the last item in the id group?
note: all id's are grouped together in the csv
Let's first construct a CSV file.
str =<<~END
ID,Name,Transaction Value,Running Total
5,mike,5,5
5,mike,2,7
20,bob,1,1
20,bob,15,16
1,jane,4,4
END
CSVFile = 't.csv'
File.write(CSVFile, str)
#=> 107
I will first create a method that takes two arguments: an instance of CSV::row and a boolean to indicate whether the CSV row is the last of the group (true if it is).
def process_row(row, is_last)
puts "Do something with row #{row}"
puts "last row: #{is_last}"
end
This method would of course be modified to perform whatever operations need be performed for each row.
Below are three ways to process the file. All three use the method CSV::foreach to read the file line-by-line. This method is called with two arguments, the file name and an options hash { header: true, converters: :numeric } that indicates that the first line of the file is a header row and that strings representing numbers are to be converted to the appropriate numeric object. Here values for "ID", "Transaction Value" and "Running Total" will be converted to integers.
Though it is not mentioned in the doc, when foreach is called without a block it returns an enumerator (in the same way that IO::foreach does).
We of course need:
require 'csv'
Chain foreach to Enumerable#chunk
I have chosen to use chunk, as opposed to Enumerable#group_by, because the lines of the file are already grouped by ID.
CSV.foreach(CSVFile, headers:true, converters: :numeric).
chunk { |row| row['ID'] }.
each do |_,(*arr, last_row)|
arr.each { |row| process_row(row, false) }
process_row(last_row, true)
end
displays
Do something with row 5,mike,5,5
last row: false
Do something with row 5,mike,2,7
last row: true
Do something with row 20,bob,1,1
last row: false
Do something with row 20,bob,15,16
last row: true
Do something with row 1,jane,4,4
last row: true
Note that
enum = CSV.foreach(CSVFile, headers:true, converters: :numeric).
chunk { |row| row['ID'] }.
each
#=> #<Enumerator: #<Enumerator::Generator:0x00007ffd1a831070>:each>
Each element generated by this enumerator is passed to the block and the block variables are assigned values by a process called array decomposition:
_,(*arr,last_row) = enum.next
#=> [5, [#<CSV::Row "ID":5 "Name":"mike" "Transaction Value":5 "Running Total ":5>,
# #<CSV::Row "ID":5 "Name":"mike" "Transaction Value":2 "Running Total ":7>]]
resulting in the following:
_ #=> 5
arr
#=> [#<CSV::Row "ID":5 "Name":"mike" "Transaction Value":5 "Running Total ":5>]
last_row
#=> #<CSV::Row "ID":5 "Name":"mike" "Transaction Value":2 "Running Total ":7>
See Enumerator#next.
I have followed the convention of using an underscore for block variables that are used in the block calculation (to alert readers of your code). Note that an underscore is a valid block variable.1
Use Enumerable#slice_when in place of chunk
CSV.foreach(CSVFile, headers:true, converters: :numeric).
slice_when { |row1,row2| row1['ID'] != row2['ID'] }.
each do |*arr, last_row|
arr.each { |row| process_row(row, false) }
process_row(last_row, true)
end
This displays the same information that is produced when chunk is used.
Use Kernel#loop to step through the enumerator CSV.foreach(CSVFile, headers:true)
enum = CSV.foreach(CSVFile, headers:true, converters: :numeric)
row = nil
loop do
row = enum.next
next_row = enum.peek
process_row(row, row['ID'] != next_row['ID'])
end
process_row(row, true)
This displays the same information that is produced when chunk is used. See Enumerator#next and Enumerator#peek.
After enum.next returns the last CSV::Row object enum.peek will generate a StopIteration exception. As explained in its doc, loop handles that exception by breaking out of the loop. row must be initialized to an arbitrary value before entering the loop so that row is visible after the loop terminates. At that time row will contain the CSV::Row object for the last line of the file.
1 IRB uses the underscore for its own purposes, resulting in the block variable _ being assigned an erroneous value when the code above is run.
Yes.. ruby has got your back.
grouped = CSV.table('./test.csv').group_by { |r| r[:id] }
# Then process the rows of each group individually:
grouped.map { |id, rows|
puts [id, rows.length ]
}
Tip: You can access each row as a hash by using CSV.table
CSV.table('./test.csv').first[:name]
=> "mike"
I have the following piece of code to process a CSV:
CSV.foreach("./matrix1.csv") do |row|
puts "Row is: "
print row
line_count += 1
end
And it successfully found out the number of line in the CSV. However, how can I find the number of CSV elements in one line(row).
For example, I have the following CSV
1,2,3,4,5
How can I see that the number of elements is 5?
If each line contains same number of elements then:
CSV.open('test.csv', 'r') { |csv| puts csv.first.length }
If not then count for each line:
CSV.foreach('test.csv', 'r') { |row| puts r.length }
row.size
worked as suggested by #Stefan
But a more error prone approach is:
CSV.foreach('test.csv','r').max_by(&:length).length
Another way is to get the size of the current line:
CSV.foreach('test.csv', 'r') { |row| puts r.length }
I have two CSV files, one is 3.5GB and the other one is 100MB. The larger file contains two columns that I need to add to two other columns from the other file to make a third CSV file.
Both files contain a postcode column which is how I'm trying to match the rows from them. However, as the files are quite large, the operation is slow. I tried looking for matches in two ways but they were both slow:
CSV.foreach('ukpostcodes.csv') do |row|
CSV.foreach('pricepaid.csv') do |item|
if row[1] == item[3]
puts "match"
end
end
end
and:
firstFile = CSV.read('pricepaid.csv')
secondFile = CSV.read('ukpostcodes.csv')
post_codes = Array.new
lat_longs = Array.new
firstFile.each do |row|
post_codes << row[3]
end
secondFile.each do |row|
lat_longs << row[1]
end
post_codes.each do |row|
lat_longs.each do |item|
if row == item
puts "Match"
end
end
end
Is there a more efficient way of handling this task as the CSV files are large in size?
This question already has answers here:
Ignore header line when parsing CSV file
(6 answers)
Closed 8 years ago.
The community reviewed whether to reopen this question 1 year ago and left it closed:
Original close reason(s) were not resolved
Ruby's CSV class makes it pretty easy to iterate over each row:
CSV.foreach(file) { |row| puts row }
However, this always includes the header row, so I'll get as output:
header1, header2
foo, bar
baz, yak
I don't want the headers though. Now, when I call …
CSV.foreach(file, :headers => true)
I get this result:
#<CSV::Row:0x10112e510
#header_row = false,
attr_reader :row = [
[0] [
[0] "header1",
[1] "foo"
],
[1] [
[0] "header2",
[1] "bar"
]
]
>
Of course, because the documentation says:
This setting causes #shift to return rows as CSV::Row objects instead of Arrays
But, how can I skip the header row, returning the row as a simple array? I don't want the complicated CSV::Row object to be returned.
I definitely don't want to do this:
first = true
CSV.foreach(file) do |row|
if first
puts row
first = false
else
# code for other rows
end
end
Look at #shift from CSV Class:
The primary read method for wrapped Strings and IOs, a single row is pulled from the data source, parsed and returned as an Array of fields (if header rows are not used)
An Example:
require 'csv'
# CSV FILE
# name, surname, location
# Mark, Needham, Sydney
# David, Smith, London
def parse_csv_file_for_names(path_to_csv)
names = []
csv_contents = CSV.read(path_to_csv)
csv_contents.shift
csv_contents.each do |row|
names << row[0]
end
return names
end
You might want to consider CSV.parse(csv_file, { :headers => false }) and passing a block, as mentioned here
A cool way to ignore the headers is to read it as an array and ignore the first row:
data = CSV.read("dataset.csv")[1 .. -1]
# => [["first_row", "with data"],
["second_row", "and more data"],
...
["last_row", "finally"]]
The problem with the :headers => false approach is that CSV won't try to read the first row as a header, but will consider it part of the data. So, basically, you have a useless first row.