Compressing using Bzip2 on-the-fly to a file? - ruby

There is a program that generates huge CSV files. For example:
arr = (0..10).to_a
CSV.open("foo.csv", "wb") do |csv|
(2**16).times { csv << arr }
end
It will generate a big file, so I want to be compressed on-the-fly, and, instead of output a non-compressed CSV file (foo.csv), output a bzip-compressed CSV file (foo.csv.bzip).
I have an example from the "ruby-bzip2" gem:
writer = Bzip2::Writer.new File.open('file')
writer << 'data1'
writer.close
I am not sure how to compose Bzip2 write from the CSV one.

You can also construct a CSV object with an IO or something sufficiently like an IO, such as a Bzip2::Writer.
For example
File.open('file.bz2', 'wb') do |f|
writer = Bzip2::Writer.new f
CSV(writer) do |csv|
(2**16).times { csv << arr }
end
writer.close
end

Maybe it would be more flexible to write the CSV data to stdout:
# csv.rb
require 'csv'
$stdout.sync = true
arr = (0..10).to_a
(2**16).times do
puts arr.to_csv
end
... and pipe the output to bzip2:
$ ruby csv.rb | bzip2 > foo.csv.bz2

Related

Ruby CSV - Write on same row without overwriting?

I'm using
CSV.open(filename, "w") do |csv|
to create and write to a csv file in one ruby.rb file and now I need to open it and edit it in a second .rb file. Right now I'm using CSV.open(filename, "a") do |csv| but that creates new rows rather than adding the new content to the end of the existing rows.
If I use CSV.open(filename, "w") do |csv| the second time it overwrites the first rows.
edit:
# Create export CSV
final_export_csv = "filepath_final.csv"
# Create filename for CSV file
imported_csv_filename = "imported_file.csv"
CSV.open(final_export_csv, "w", headers: ["several", "headers"] + [:new_header], write_headers: true) do |final_csv|
# Read existing CSV file
CSV.foreach(imported_csv_filename) do |old_csv_row|
# Read a row, add the new column, write it to the new row
CSV.open(denominator_csv_filename, "r+") do |new_csv_col|
# gathering some data code
data = { passed.in }
# Write data
new_csv_col <<
[
passedin[:data]
]
old_csv_row[:new_header] = passedin[:data]
final_export_csv << old_csv_row
end
end
end
end
end
As tadman comments, you can't actually edit a file in place. Well, you can but all the lines have to remain the same length. You're not doing that.
Instead, read a row, modify it, and write it to a new CSV. Then replace the old file with the new one. Be careful to avoid slurping the entire CSV into memory, CSV files can get quite large.
require 'csv'
require 'tempfile'
require 'fileutils'
csv_file = "test.csv"
# Write the new file to a tempfile to avoid polluting the directory.
temp = Tempfile.new
# Read the header line.
old_csv = CSV.open(csv_file, "r", headers: true, return_headers: true)
old_csv.readline
# Open the new CSV with the existing headers plus a new one.
new_csv = CSV.open(
temp, "w",
headers: old_csv.headers + [:new],
write_headers: true
)
# Read a row, add the new column, write it to the new CSV.
old_csv.each do |row|
row[:new] = 42
new_csv << row
end
old_csv.close
new_csv.close
# Replace the old CSV with the new one.
FileUtils.move(temp.path, csv_file)

Generate CSV from Ruby results

I currently have this script that generates usernames from a given CSV. Rather than printing these results to the console, how can I write a new CSV with these results?
This is the script I currently have, runs with no errors. I am assuming if I write a new CSV in the do |row| block it is going to create x amount of new files which I do not want.
require 'csv'
CSV.foreach('data.csv', :headers => true) do |row|
id = row['id']
fn = row['first_name']
ln = row['last_name']
p fn[0] + ln + id[3,8]
end
Just manage the CSV file to write around the reading:
CSV.open("path/to/file.csv", "wb") do |csv|
CSV.foreach('data.csv', :headers => true) do |row|
id = row['id']
fn = row['first_name']
ln = row['last_name']
csv << [fn[0], ln, id[3,8]]
# or, to output it as a single column:
# csv << ["#{fn[0]}#{ln}#{id[3,8]}"]
end
end
Writing CSV to a file.

How to map and edit a CSV file with Ruby

Is there a way to edit a CSV file using the map method in Ruby? I know I can open a file using:
CSV.open("file.csv", "a+")
and add content to it, but I have to edit some specific lines.
The foreach method is only useful to read a file (correct me if I'm wrong).
I checked the Ruby CSV documentation but I can't find any useful info.
My CSV file has less than 1500 lines so I don't mind reading all the lines.
Another answer using each.with_index():
rows_array = CSV.read('sample.csv')
desired_indices = [3, 4, 5].sort # these are rows you would like to modify
rows_array.each.with_index(desired_indices[0]) do |row, index|
if desired_indices.include?(index)
# modify over here
rows_array[index][target_column] = 'modification'
end
end
# now update the file
CSV.open('sample3.csv', 'wb') { |csv| rows_array.each{|row| csv << row}}
You can also use each_with_index {} insead of each.with_index {}
Is there a way to edit a CSV file using the map method in Ruby?
Yes:
rows = CSV.open('sample.csv')
rows_array = rows.to_a
or
rows_array = CSV.read('sample.csv')
desired_indices = [3, 4, 5] # these are rows you would like to modify
edited_rows = rows_array.each_with_index.map do |row, index|
if desired_indices.include?(index)
# simply return the row
# or modify over here
row[3] = 'shiva'
# store index in each edited rows to keep track of the rows
[index, row]
end
end.compact
# update the main row_array with updated data
edited_rows.each{|row| rows_array[row[0]] = row[1]}
# now update the file
CSV.open('sample2.csv', 'wb') { |csv| rows_array.each{|row| csv << row}}
This is little messier. Is not it? I suggest you to use each_with_index with out map to do this. See my another answer
Here is a little script I wrote as an example on how read CSV data, do something to data, and then write out the edited text to a new file:
read_write_csv.rb:
#!/usr/bin/env ruby
require 'csv'
src_dir = "/home/user/Desktop/csvfile/FL_insurance_sample.csv"
dst_dir = "/home/user/Desktop/csvfile/FL_insurance_sample_out.csv"
puts " Reading data from : #{src_dir}"
puts " Writing data to : #{dst_dir}"
#create a new file
csv_out = File.open(dst_dir, 'wb')
#read from existing file
CSV.foreach(src_dir , :headers => false) do |row|
#then you can do this
# newrow = row.each_with_index { |rowcontent , row_num| puts "# {rowcontent} #{row_num}" }
# OR array to hash .. just saying .. maybe hash of arrays..
#h = Hash[*row]
#csv_out << h
# OR use map
#newrow = row.map(&:capitalize)
#csv_out << h
#OR use each ... Add and end
#newrow.each do |k,v| puts "#{k} is #{v}"
#Lastly, write back the edited , regexed data ..etc to an out file.
#csv_out << newrow
end
# close the file
csv_out.close
The output file has the desired data:
USER#USER-SVE1411EGXB:~/Desktop/csvfile$ ls
FL_insurance_sample.csv FL_insurance_sample_out.csv read_write_csv.rb
The input file data looked like this:
policyID,statecode,county,eq_site_limit,hu_site_limit,fl_site_limit,fr_site_limit,tiv_2011,tiv_2012,eq_site_deductible,hu_site_deductible,fl_site_deductible,fr_site_deductible,point_latitude,point_longitude,line,construction,point_granularity
119736,FL,CLAY COUNTY,498960,498960,498960,498960,498960,792148.9,0,9979.2,0,0,30.102261,-81.711777,Residential,Masonry,1
448094,FL,CLAY COUNTY,1322376.3,1322376.3,1322376.3,1322376.3,1322376.3,1438163.57,0,0,0,0,30.063936,-81.707664,Residential,Masonry,3
206893,FL,CLAY COUNTY,190724.4,190724.4,190724.4,190724.4,190724.4,192476.78,0,0,0,0,30.089579,-81.700455,Residential,Wood,1
333743,FL,CLAY COUNTY,0,79520.76,0,0,79520.76,86854.48,0,0,0,0,30.063236,-81.707703,Residential,Wood,3
172534,FL,CLAY COUNTY,0,254281.5,0,254281.5,254281.5,246144.49,0,0,0,0,30.060614,-81.702675,Residential,Wood,1

Compressing using LZMA on-the-fly to a file?

This code compress on-the-fly data using a Bzip2 writer a csvfile.
File.open('file.bz2', 'wb') do |f|
writer = Bzip2::Writer.new f
CSV(writer) do |csv|
(2**16).times { csv << arr }
end
writer.close
end
I want to do the same using lzma algorithm and ruby-lzma gem could be useful, but this gem only one method compressed = LZMA.compress('data to compress').
Question:
Is there a way to do a similar compression using lzma?
Use ruby-xz which has a much better interface to liblzma (using FFI).
The lib has XZ::StreamWriter class. Check the docs for ruby-xz
However CSV constructor does not take the XZ::StreamWriter, so you need to change the code to use CSV.generate_line. I was able to run this, which does generate the file on the fly
require 'xz'
require 'csv'
arr = ['one', 'two', 'three']
File.open('file.xz', 'wb') do |f|
XZ::StreamWriter.new(f) do |writer|
(2**16).times { writer << CSV.generate_line(arr) }
writer.finish
end
end

Split output data using CSV in Ruby 1.9

I have a csv file that has 7000+ records that I process/manipulate and export to a new csv file. I have no issues doing that and everything works as expected.
I would like to change the process to where it breaks the output into multiple files. So instead of writing all 7000+ rows to the new csv file it would write the first 1000 rows to newexport1.csv and the next 1000 rows to newexport2.csv until it reaches the end of the data.
Is there an easy way to do this with CSV in Ruby 1.9?
My current write method:
CSV.open("#{PATH_TO_EXPORT_FILE}/newexport.csv", "w+", :col_sep => '|', :headers => true) do |f|
export_rows.each do |row|
f << row
The short answer is "no". You'll want to adjust your current code to split up the set and then dump each subset to a different file. This ought to be pretty close:
export_rows.each_slice(1000).with_index do |rows, idx|
CSV.open("#{PATH_TO_EXPORT_FILE}/newexport-#{idx.to_s}.csv", "w+", :col_sep => '|', :headers => true) do |f|
rows.each { |row| f << row }
end
end
Yes, there is.
It's embedded in Ruby 1.9
Check this link
To read:
CSV.foreach("path/to/file.csv") do |row|
# manipulate the content
end
To write:
CSV.open("path/to/file.csv", "wb") do |csv|
csv << ["row", "of", "CSV", "data"]
csv << ["another", "row"]
# something else
end
I think that you'll need to combine one inside the other.
FasterCSV is the standard CSV library since ruby 1.9, you can find a lot of example code in the examples folder:
https://github.com/JEG2/faster_csv/tree/master/examples
For the example code to work, you should change:
require "faster_csv"
for
require "csv"

Resources