How to dump a 2D array directly into a CSV file? - ruby

I have this 2D array:
arr = [[1,2],[3,4]]
I usually do:
CSV.open(file) do |csv|
arr.each do |row|
csv << row
end
end
Is there any easier or direct way of doing it other than adding row by row?

Assuming that your array is just numbers (no strings that potentially have commas in them) then:
File.open(file,'w'){ |f| f << arr.map{ |row| row.join(',') }.join('\n') }
One enormous string blatted to disk, with no involving the CSV library.
Alternatively, using the CSV library to correctly escape each row:
require 'csv'
# #to_csv automatically appends '\n', so we don't need it in #join
File.open(file,'w'){ |f| f << arr.map(&:to_csv).join }
If you have to do this often and the code bothers you, you could monkeypatch it in:
class CSV
def self.dump_array(array,path,mode="rb",opts={})
open(path,mode,opts){ |csv| array.each{ |row| csv << row } }
end
end
CSV.dump_array(arr,file)

Extending the answer above by #Phrogz, while using the csv library and requiring to change the default delimiter:
File.open(file,'w'){ |f| f << arr.map{|x| x.to_csv(col_sep: '|')}.join }

Related

How to process CSV data from two files using a matching column

I have two CSV files, one is 3.5GB and the other one is 100MB. The larger file contains two columns that I need to add to two other columns from the other file to make a third CSV file.
Both files contain a postcode column which is how I'm trying to match the rows from them. However, as the files are quite large, the operation is slow. I tried looking for matches in two ways but they were both slow:
CSV.foreach('ukpostcodes.csv') do |row|
CSV.foreach('pricepaid.csv') do |item|
if row[1] == item[3]
puts "match"
end
end
end
and:
firstFile = CSV.read('pricepaid.csv')
secondFile = CSV.read('ukpostcodes.csv')
post_codes = Array.new
lat_longs = Array.new
firstFile.each do |row|
post_codes << row[3]
end
secondFile.each do |row|
lat_longs << row[1]
end
post_codes.each do |row|
lat_longs.each do |item|
if row == item
puts "Match"
end
end
end
Is there a more efficient way of handling this task as the CSV files are large in size?

Dir.glob to get all csv and xls files in folder

folder_to_analyze = ARGV.first
folder_path = File.join(Dir.pwd, folder_to_analyze)
unless File.directory?(folder_path)
puts "Error: #{folder_path} no es un folder valido."
exit
end
def get_csv_file_paths(path)
files = []
Dir.glob(path + '/**/*.csv').each do |f|
files << f
end
return files
end
def get_xlsx_file_path(path)
files = []
Dir.glob(path + '/**/*.xls').each do |f|
files << f
end
return files
end
files_to_process = []
files_to_process << get_csv_file_paths(folder_path)
files_to_process << get_xlsx_file_path(folder_path)
puts files_to_process[1].length # Not what I want, I want:
# puts files_to_process.length
I'm trying to make a simple script in Ruby that allows me to call it from the command line, like ruby counter.rb mailing_list1 and it goes to the folder and counts all .csv and .xls files.
I intend to operate on each file, getting a row count, etc.
Currently the files_to_process array is actually an array of array - I don't want that. I want to have a single array of both .csv and .xls files.
Since I don't know how to yield from the Dir.glob call, I added them to an array and returned that.
How can I accomplish this using a single array?
Just stick the file extensions together into one group:
Dir[path + "/**/*.{csv,xls}"]
Well, yielding is simple. Just yield.
def get_csv_file_paths(path)
Dir.glob(path + '/**/*.csv').each do |f|
yield f
end
end
def get_xlsx_file_path(path)
Dir.glob(path + '/**/*.xls').each do |f|
yield f
end
end
files_to_process = []
get_csv_file_paths(folder_path) {|f| files_to_process << f }
get_xlsx_file_path(folder_path) {|f| files_to_process << f }
puts files_to_process.length
Every method in ruby can be passed a block. And yield keyword sends data to that block. If the block may or may not be provided, yield is usually used with block_given?.
yield f if block_given?
Update
The code can be further simplified by passing your block directly to glob.each:
def get_csv_file_paths(path, &block)
Dir.glob(path + '/**/*.txt').each(&block)
end
def get_xlsx_file_path(path, &block)
Dir.glob(path + '/**/*.xls').each(&block)
end
Although this block/proc conversion is a little bit of advanced topic.
def get_folder_paths(root_path)
Dir.glob('**/*.csv') + Dir.glob('**/*.xls')
end
folder_path = File.join(Dir.pwd, ARGV.first || '')
raise "#{folder_path} is not a valid folder" unless File.directory?(folder_path)
puts get_folder_paths(folder_path).length
The get_folder_paths method returns an array of CSV and XLS files. Building an array of file names may not be what you really want, especially if there are a lot of them. An approach using the Enumerator returned by Dir.glob would be more appropriate in that case if you did not need the file count first.

How do I two merge several CSV files horizontally?

I've got about 50 CSV files that need to be merged together horizontally into one CSV.
The headers can be ignored. A little bit simplified the files look like this:
File 1:
1,2,4,5,6
4,5,68,7,4,2
1,2
1,2,3
File 2:
1,2,4
4,5,6,4
3,4,5
3,4,5
The output should look like this:
1,2,4,5,6,1,2,4
4,5,68,7,4,2,4,5,6,4
1,2,3,4,5
3,4,5
1,2,3
The order of mergeing the files is also not important. I know how to merge them vertically, but I have no clue how to merge horizontally.
I thought about something like this with a nested array, but it does not work, but I don't know why. It seems like the data array does not accept the line array.
#!/usr/bin/env ruby
require 'csv'
data = Array.new
filecount=1
linecount=1
CSV.open("output.csv", "wb") do |output|
Dir.glob('*.csv').each do |each|
next if each == 'output.csv'
file = CSV.read(each)
file.each do |line|
data[filecount][linecount] = line
linecount=linecount+1
end
filecount=filecount+1
end
end
puts data
I prepared a small script that solves your problem, and added some comments for better explanation.
The main idea is to catch the input line by line so you do not have to use much memory.
#!/usr/bin/env ruby
require 'csv'
# map "treats" each element of the array with the block
files = Dir.glob('csv/*.csv').map { |file| CSV.open file, 'r' }
CSV.open("output.csv", "wb") do |out|
loop do
# shift returns the next line
# compact remove nil entries
line = files.map { |file| file.shift }.compact
# remove entry if file has no row
line.reject! { |e| e.empty? }
# break the endless loop if no input to handle
break if line.empty?
out << line.flatten
end
end

Best way of Parsing 2 CSV files and printing the common values in a third file

I am new to Ruby, and I have been struggling with a problem that I suspect has a simple answer. I have two CSV files, one with two columns, and one with a single column. The single column is a subset of values that exist in one column of my first file. Example:
file1.csv:
abc,123
def,456
ghi,789
jkl,012
file2.csv:
def
jkl
All I need to do is look up the column 2 value in file1 for each value in file2 and output the results to a separate file. So in this case, my output file should consist of:
456
012
I’ve got it working this way:
pairs=IO.readlines("file1.csv").map { |columns| columns.split(',') }
f1 =[]
pairs.each do |x| f1.push(x[0]) end
f2 = IO.readlines("file2.csv").map(&:chomp)
collection={}
pairs.each do |x| collection[x[0]]=x[1] end
f=File.open("outputfile.txt","w")
f2.each do |col1,col2| f.puts collection[col1] end
f.close
...but there has to be a better way. If anyone has a more elegant solution, I'd be very appreciative! (I should also note that I will eventually need to run this on files with millions of lines, so speed will be an issue.)
To be as memory efficient as possible, I'd suggest only reading the full file2 (which I gather would be the smaller of the two input files) into memory. I'm using a hash for fast lookups and to store the resulting values, so as you read through file1 you only store the values for those keys you need. You could go one step further and write the outputfile while reading file2.
require 'CSV'
# Read file 2, the smaller file, and store keys in result Hash
result = {}
CSV.foreach("file2.csv") do |row|
result[row[0]] = false
end
# Read file 1, the larger file, and look for keys in result Hash to set values
CSV.foreach("file1.csv") do |row|
result[row[0]] = row[1] if result.key? row[0]
end
# Write the results
File.open("outputfile.txt", "w") do |f|
result.each do |key, value|
f.puts value if value
end
end
Tested with Ruby 1.9.3
Parsing For File 1
data_csv_file1 = File.read("file1.csv")
data_csv1 = CSV.parse(data_csv_file1, :headers => true)
Parsing For File 2
data_csv_file2 = File.read("file2.csv")
data_csv2 = CSV.parse(data_csv_file1, :headers => true)
Collection of names
names_from_sheet1 = data_csv1.collect {|data| data[0]} #returns an array of names
names_from_sheet2 = data_csv2.collect {|data| data[0]} #returns an array of names
common_names = names_from_sheet1 & names_from_sheet2 #array with common names
Collecting results to be printed
results = [] #this will store the values to be printed
data_csv1.each {|data| results << data[1] if common_names.include?(data[0]) }
Final output
f = File.open("outputfile.txt","w")
results.each {|result| f.puts result }
f.close

Append a a hash to a CSV in Ruby 1.8

How to append to a array of hashes into CSV in Ruby 1.8. There is FasterCSV for Ruby 1.9 but how do I do in 1.8?
This is what I have tried. hasharray is an array which contains elements which are hashes.
CSV.open("data.csv", "wb") { |csv|
hasharray.each{ |oput|
oput.to_a.each {|elem| csv << elem}
}
}
This way puts all the data in the CSV but it puts them one below another instead of side-by-side.
When iterating over hashes, you want to use two arguments in the block, one for the key, the other for the value. Consider:
hasharray.each { |k,v| puts "#{k},#{v}" }

Resources