dealing with large CSV files (20G) in ruby - ruby

I am working on little problem and would have some advice on how to solve it:
Given a csv file with an unknown number of columns and rows, output a list of columns with values and the number of times each value was repeated. without using any library.
if the file is small this shouldn't be a problem, but when it is a few Gigs, i get NoMemoryError: failed to allocate memory. is there a way to create a hash and read from the disk instead of loading the file to Memory? you can do that in perl with tied Hashes
EDIT: will IO#foreach load the file into memory? how about File.open(filename).each?

Read the file one line at a time, discarding each line as you go:
open("big.csv") do |csv|
csv.each_line do |line|
values = line.split(",")
# process the values
end
end
Using this method, you should never run out of memory.

Do you read the whole file at once? Reading it on a per-line basis, i.e. using ruby -pe, ruby -ne or $stdin.each should reduce the memory usage by garbage collecting lines which were processed.
data = {}
$stdin.each do |line|
# Process line, store results in the data hash.
end
Save it as script.rb and pipe the huge CSV file into this script's standard input:
ruby script.rb < data.csv
If you don't feel like reading from the standard input we'll need a small change.
data = {}
File.open("data.csv").each do |line|
# Process line, store results in the data hash.
end

For future reference, in such cases you want to use CSV.foreach('big_file.csv', headers: true) do |row|
This will read the file line by line from the IO object with minimal memory footprint (should be below 1MB regardless of file size).

Related

Generating CSV/Excel file for 100 Million of records in Ruby on rails?

Requirement is like
We get the huge dataset from the database( > 1 billion records) and need to export it to the csv file or excel.
Currently implementation use CSV class CSV.generate
CSV.generate(headers: true) do |csv|
csv << header
#obj.find_each do |c|
arr = [c.id,c.name,soon]
csv << array
end
end
and sending the output to
Zip::File.open(file, Zip::File::CREATE) do |zip|
zip.get_output_stream("test.#{#format}") { |f| f.puts(convert_to_csv) }
end
All this operation is done other delayed jobs
This works good when record is < 20,000
But when rows starts growing it gets some memory issues.
What i was thinking is to chunk the record to pieces say 1 million rows into 50 files (1million/20000)(csv1.csv,csv2.csv,csv3.csv,csv4.csv,csv5.csv) and then concat them into single file or zip all files together(faster way)
Can any one give me idea how can I start on it.
Taking a look at the source for CSV.generate gives me the impression that the csv data is kept in memory while the contents are being accumulated. That seems like a good target for optimization, especially if you see that memory is scaling linearly with the data set. Since your data is pretty simple, could you skip CSV and go directly to File instead? You'd have a bit more control about when data was flushed out to disk.
File.open("my.csv") do |file|
file.puts '"ID","Name","Soon"'
#obj.find_each do |c|
file.puts "\"#{c.id}\",\"#{c.name}\",\"#{c.soon}\""
# flush if necessary
end
end
You'd need to write to disk and then zip the results later with this approach.
Write to the CSV in chunks and find_in_batches and pluck. Something like:
Model.pluck(:id, :name, ...).find_in_batches(10_000) do |ary|
CSV.open("tmp.csv", "ab") do |csv|
csv << ary.map{|a| a.join ','}.join("\n")
end
end
It would depend whether
arr = [c.id,c.name,soon]
needs to be calculated in Ruby, or you would be able to rewrite it in SQL.
If you have to keep it in Ruby, you can try to avoid the ActiveRecord overhead and use a raw query instead. You'd probably have to implement the chunk wise processing by yourself
Otherwise, you could check out some database native tool for CSV export. E.g., for MySQL that would be something like SELECT INTO OUTFILE or mysql

How to read a large file into a string

I'm trying to save and load the states of Matrices (using Matrix) during the execution of my program with the functions dump and load from Marshal. I can serialize the matrix and get a ~275 KB file, but when I try to load it back as a string to deserialize it into an object, Ruby gives me only the beginning of it.
# when I want to save
mat_dump = Marshal.dump(#mat) # serialize object - OK
File.open('mat_save', 'w') {|f| f.write(mat_dump)} # write String to file - OK
# somewhere else in the code
mat_dump = File.read('mat_save') # read String from file - only reads like 5%
#mat = Marshal.load(mat_dump) # deserialize object - "ArgumentError: marshal data too short"
I tried to change the arguments for load but didn't find anything yet that doesn't cause an error.
How can I load the entire file into memory? If I could read the file chunk by chunk, then loop to store it in the String and then deserialize, it would work too. The file has basically one big line so I can't even say I'll read it line by line, the problem stays the same.
I saw some questions about the topic:
"Ruby serialize array and deserialize back"
"What's a reasonable way to read an entire text file as a single string?"
"How to read whole file in Ruby?"
but none of them seem to have the answers I'm looking for.
Marshal is a binary format, so you need to read and write in binary mode. The easiest way is to use IO.binread/write.
...
IO.binwrite('mat_save', mat_dump)
...
mat_dump = IO.binread('mat_save')
#mat = Marshal.load(mat_dump)
Remember that Marshaling is Ruby version dependent. It's only compatible under specific circumstances with other Ruby versions. So keep that in mind:
In normal use, marshaling can only load data written with the same major version number and an equal or lower minor version number.

How to read a large text file line-by-line and append this stream to a file line-by-line in Ruby?

Let's say I want to combine several massive files into one and then uniq! the one (THAT alone might take a hot second)
It's my understanding that File.readlines() loads ALL the lines into memory. Is there a way to read it line by line, sort of like how node.js pipe() system works?
One of the great things about Ruby is that you can do file IO in a block:
File.open("test.txt", "r").each_line do |row|
puts row
end # file closed here
so things get cleaned up automatically. Maybe it doesn't matter on a little script but it's always nice to know you can get it for free.
you aren't operating on the entire file contents at once, and you don't need to store the entirety of each line either if you use readline.
file = File.open("sample.txt", 'r')
while !file.eof?
line = file.readline
puts line
end
Large files are best read by streaming methods like each_line as shown in the other answer or with foreach which opens the file and reads line by line. So if the process doesn't request to have the whole file in memory you should use the streaming methods. While using streaming the required memory won't increase even if the file size increases opposing to non-streaming methods like readlines.
File.foreach("name.txt") { |line| puts line }
uniq! is defined on Array, so you'll have to read the files into an Array anyway. You cannot process the file line-by-line because you don't want to process a file, you want to process an Array, and an Array is a strict in-memory data structure.

Choose starting row for CSV.foreach or similar method? Don't want to load file into memory

Edit (I adjusted the title): I am currently using CSV.foreach but that starts at the first row. I'd like to start reading a file at an arbitrary line without loading the file into memory. CSV.foreach works well for retrieving data at the beginning of a file but not for data I need towards the end of a file.
This answer is similar to what I am looking to do but it loads the entire file into memory; which is what I don't want to do.
I have a 10gb file and the key column is sorted in ascending order:
# example 10gb file rows
key,state,name
1,NY,Jessica
1,NY,Frank
1,NY,Matt
2,NM,Jesse
2,NM,Saul
2,NM,Walt
etc..
I find the line I want to start with this way ...
file = File.expand_path('~/path/10gb_file.csv')
File.open(file, 'rb').each do |line|
if line[/^2,/]
puts "#{$.}: #{line}" # 5: 2,NM,Jesse
row_number = $. # 5
break
end
end
... and I'd like to take row_number and do something like this but not load the 10gb file into memory:
CSV.foreach(file, headers: true).drop(row_number) { |row| "..load data..." }
Lastly, I'm currently handling it like the next snippet; It works fine when the rows are towards the front of the file but not when they're near the end.
CSV.foreach(file, headers: true) do |row|
next if row['key'].to_i < row_number.to_i
break if row['key'].to_i > row_number.to_i
"..load data..."
end
I am trying to use CSV.foreach but I'm open to suggestions. An alternative approach I am considering but does not seem to be efficient for numbers towards the middle of a file:
Use IO or File and read the file line by line
Get the header row and build the hash manually
Read the file from the bottom for numbers near the max key value
I think you have the right idea. Since you've said you're not worried about fields spanning multiple lines, you can seek to a certain line in the file using IO methods and start parsing there. Here's how you might do it:
begin
file = File.open(FILENAME)
# Get the headers from the first line
headers = CSV.parse_line(file.gets)
# Seek in the file until we find a matching line
match = "2,"
while line = file.gets
break if line.start_with?(match)
end
# Rewind the cursor to the beginning of the line
file.seek(-line.size, IO::SEEK_CUR)
csv = CSV.new(file, headers: headers)
# ...do whatever you want...
ensure
# Don't forget the close the file
file.close
end
The result of the above is that csv will be a CSV object whose first row is the row that starts with 2,.
I benchmarked this with an 8MB (170k rows) CSV file (from Lahman's Baseball Database) and found that it was much, much faster than using CSV.foreach alone. For a record in the middle of the file it was about 110x faster, and for a record toward the end about 66x faster. If you want, you can take a look at the benchmark here: https://gist.github.com/jrunning/229f8c2348fee4ba1d88d0dffa58edb7
Obviously 8MB is nothing like 10GB, so regardless this is going to take you a long time. But I'm pretty sure this will be quite a bit faster for you while also accomplishing your goal of not reading all of the data into the file at once.
Foreach will do everything you need. It streams, so it works well with big files.
CSV.foreach('~/path/10gb_file.csv') do |line|
# Only one line will be read into memory at a time.
line
end
Fastest way to skip data that we’re not interested in is to use read to advance through a portion of the file.
File.open("/path/10gb_file.csv") do |f|
f.seek(107) # skip 107 bytes eg. one line. (constant time)
f.read(50) # read first 50 on second line
end

Ruby File.read vs. File.gets

If I want to append the contents of a src file into the end of a dest file in Ruby, is it better to use:
while line = src.gets do
or
while buffer = src.read( 1024 )
I have seen both used and was wondering when should I use each method and why?
One is for reading "lines", one is for reading n bytes.
While byte buffering might be faster, a lot of that may disappear into the OS which likely does buffering anyway. IMO it has more to do with the context of the read--do you want lines, or are you just shuffling chunks of data around?
That said, a performance test in your specific environment may be helpful when deciding.
You have a number of options when reading a file that are tailored to different situations.
Read in the file line-by-line, but only store one line at a time:
while (line = file.gets) do
# ...
end
Read in all lines of a file at once:
file.readlines.each do |line|
# ...
end
Read the file in as a series of blocks:
while (data = file.read(block_size))
# ...
end
Read in the whole file at once:
data = file.read
It really depends on what kind of data you're working with. Generally read is better suited towards binary files, or those where you want it as one big string. gets and readlines are similar, but readlines is more convenient if you're confident the file will fit in memory. Don't do this on multi-gigabyte log files or you'll be in for a world of hurt as your system starts swapping. Use gets for situations like that.
gets will read until the end of the line based on a separator
read will read n bytes at a time
It all depends on what you are trying to read.
It may be more efficient to use read if your src file has unpredictable line lengths.

Resources