How to process CSV data from two files using a matching column - ruby

I have two CSV files, one is 3.5GB and the other one is 100MB. The larger file contains two columns that I need to add to two other columns from the other file to make a third CSV file.
Both files contain a postcode column which is how I'm trying to match the rows from them. However, as the files are quite large, the operation is slow. I tried looking for matches in two ways but they were both slow:
CSV.foreach('ukpostcodes.csv') do |row|
CSV.foreach('pricepaid.csv') do |item|
if row[1] == item[3]
puts "match"
end
end
end
and:
firstFile = CSV.read('pricepaid.csv')
secondFile = CSV.read('ukpostcodes.csv')
post_codes = Array.new
lat_longs = Array.new
firstFile.each do |row|
post_codes << row[3]
end
secondFile.each do |row|
lat_longs << row[1]
end
post_codes.each do |row|
lat_longs.each do |item|
if row == item
puts "Match"
end
end
end
Is there a more efficient way of handling this task as the CSV files are large in size?

Related

Fetching second row from csv file in Ruby [duplicate]

This question already has answers here:
Ignore header line when parsing CSV file
(6 answers)
Closed 4 years ago.
actual_row = File.open(file_name[0], 'r')
first_row_data = []
CSV.foreach(actual_row) do |row|
first_row_data << row[1]
end
puts first_row_data
With this I am trying to fetch the second row of CSV but it is printing the second column instead.
The foreach method returns an enumerator if no block is given, which allows you to use methods such as drop from Enumerable:
# outputs all rows after the first
CSV.foreach('test.csv').drop(1).each { |row| puts row.inspect }
To limit to just one row, we can then take:
# outputs only the second row
CSV.foreach('test.csv').drop(1).take(1).each { |row| puts row.inspect }
But, we're still parsing the entire file and just discarding most of it. Luckily, we can add lazy into the mix:
# outputs only the second row, parsing only the first 2 rows of the file
CSV.foreach('test.csv').lazy.drop(1).take(1).each { |row| puts row.inspect }
But, if the first row is a header row, don't forgot you can tell CSV about it:
# outputs only the second row, as a CSV::Row, only parses 2 rows
CSV.foreach('test.csv', headers: true).take(1).each { |row| puts row.inspect }
As an aside (in case I did this wrong), it looks like the shift method is what CSV is using for parsing the rows, so I just added:
class CSV
alias :orig_shift :shift
def shift
$stdout.puts "shifting row"
orig_shift
end
end
and ran with a sample csv to see how many times "shifting row" was output for each of the examples.
If you'd like the entire row, you should change
row[1]
to just
row
row[1] is grabbing the second column's value of the entire row. Each column value is stored sequentially in the row variable. You can see this directly in your console if you print
puts row.inspect
If you want just the second row, you can try something like this:
actual_row = File.open(file_name[0], 'r')
first_row_data = []
CSV.foreach(actual_row) do |row|
if $. == 1
first_row_data << row
end
end
puts first_row_data
You can learn more about $. and similar variables here: https://docs.ruby-lang.org/en/2.4.0/globals_rdoc.html

How to create a new CSV row of data per X amount of strings in an array

I'm trying to create a spreadsheet from an array.
#Loop through each .olpOffer (product listing) and gather content from various elements
parse_page.css('.olpOffer').each do |a|
if a.css('.olpSellerName img').empty?
seller = a.css('.olpSellerName').text.strip
else
seller = a.css('.olpSellerName img').attr('alt').value
end
offer_price = a.css('.olpOfferPrice').text.strip
prime = a.css('.supersaver').text.strip
shipping_info = a.css('.olpShippingInfo').text.strip.squeeze(" ").gsub!(/(\n)/, '')
condition = a.css('.olpCondition').text.strip
fba = "FBA" unless a.css('.olpBadge').empty?
#Push data from each product listing into array
arr.push(seller,offer_price,prime,shipping_info,condition,fba)
end
#Need to make each product listing's data begin in new row [HELP!!]
CSV.open("file.csv", "wb") do |csv|
csv << ["Seller", "Price", "Prime", "Shipping", "Condition", "FBA"]
end
end
I need to reset the row that the array is writing to after the "FBA" column so that I don't end up with one huge row of data in row 2.
I can't figure out how to correlate each string to a specific column header. Should I not use an array?
I figured it out. I needed the array that I was feeding into my csv to create a new row after every 7 strings in the array. Here's how I did it:
arr = an array that has some given amount of strings, always divisible by 7
rows = arr.each_slice(7)
CSV.open("#{file_name}", "ab") do |csv|
csv << [title, asin]
rows.each do |row|
csv << row
end
end

count number of elements in a CSV in a row in Ruby

I have the following piece of code to process a CSV:
CSV.foreach("./matrix1.csv") do |row|
puts "Row is: "
print row
line_count += 1
end
And it successfully found out the number of line in the CSV. However, how can I find the number of CSV elements in one line(row).
For example, I have the following CSV
1,2,3,4,5
How can I see that the number of elements is 5?
If each line contains same number of elements then:
CSV.open('test.csv', 'r') { |csv| puts csv.first.length }
If not then count for each line:
CSV.foreach('test.csv', 'r') { |row| puts r.length }
row.size
worked as suggested by #Stefan
But a more error prone approach is:
CSV.foreach('test.csv','r').max_by(&:length).length
Another way is to get the size of the current line:
CSV.foreach('test.csv', 'r') { |row| puts r.length }

Best way of Parsing 2 CSV files and printing the common values in a third file

I am new to Ruby, and I have been struggling with a problem that I suspect has a simple answer. I have two CSV files, one with two columns, and one with a single column. The single column is a subset of values that exist in one column of my first file. Example:
file1.csv:
abc,123
def,456
ghi,789
jkl,012
file2.csv:
def
jkl
All I need to do is look up the column 2 value in file1 for each value in file2 and output the results to a separate file. So in this case, my output file should consist of:
456
012
I’ve got it working this way:
pairs=IO.readlines("file1.csv").map { |columns| columns.split(',') }
f1 =[]
pairs.each do |x| f1.push(x[0]) end
f2 = IO.readlines("file2.csv").map(&:chomp)
collection={}
pairs.each do |x| collection[x[0]]=x[1] end
f=File.open("outputfile.txt","w")
f2.each do |col1,col2| f.puts collection[col1] end
f.close
...but there has to be a better way. If anyone has a more elegant solution, I'd be very appreciative! (I should also note that I will eventually need to run this on files with millions of lines, so speed will be an issue.)
To be as memory efficient as possible, I'd suggest only reading the full file2 (which I gather would be the smaller of the two input files) into memory. I'm using a hash for fast lookups and to store the resulting values, so as you read through file1 you only store the values for those keys you need. You could go one step further and write the outputfile while reading file2.
require 'CSV'
# Read file 2, the smaller file, and store keys in result Hash
result = {}
CSV.foreach("file2.csv") do |row|
result[row[0]] = false
end
# Read file 1, the larger file, and look for keys in result Hash to set values
CSV.foreach("file1.csv") do |row|
result[row[0]] = row[1] if result.key? row[0]
end
# Write the results
File.open("outputfile.txt", "w") do |f|
result.each do |key, value|
f.puts value if value
end
end
Tested with Ruby 1.9.3
Parsing For File 1
data_csv_file1 = File.read("file1.csv")
data_csv1 = CSV.parse(data_csv_file1, :headers => true)
Parsing For File 2
data_csv_file2 = File.read("file2.csv")
data_csv2 = CSV.parse(data_csv_file1, :headers => true)
Collection of names
names_from_sheet1 = data_csv1.collect {|data| data[0]} #returns an array of names
names_from_sheet2 = data_csv2.collect {|data| data[0]} #returns an array of names
common_names = names_from_sheet1 & names_from_sheet2 #array with common names
Collecting results to be printed
results = [] #this will store the values to be printed
data_csv1.each {|data| results << data[1] if common_names.include?(data[0]) }
Final output
f = File.open("outputfile.txt","w")
results.each {|result| f.puts result }
f.close

How to dump a 2D array directly into a CSV file?

I have this 2D array:
arr = [[1,2],[3,4]]
I usually do:
CSV.open(file) do |csv|
arr.each do |row|
csv << row
end
end
Is there any easier or direct way of doing it other than adding row by row?
Assuming that your array is just numbers (no strings that potentially have commas in them) then:
File.open(file,'w'){ |f| f << arr.map{ |row| row.join(',') }.join('\n') }
One enormous string blatted to disk, with no involving the CSV library.
Alternatively, using the CSV library to correctly escape each row:
require 'csv'
# #to_csv automatically appends '\n', so we don't need it in #join
File.open(file,'w'){ |f| f << arr.map(&:to_csv).join }
If you have to do this often and the code bothers you, you could monkeypatch it in:
class CSV
def self.dump_array(array,path,mode="rb",opts={})
open(path,mode,opts){ |csv| array.each{ |row| csv << row } }
end
end
CSV.dump_array(arr,file)
Extending the answer above by #Phrogz, while using the csv library and requiring to change the default delimiter:
File.open(file,'w'){ |f| f << arr.map{|x| x.to_csv(col_sep: '|')}.join }

Resources