Editing a CSV file in place, row by row - ruby

I have a long CSV file with two columns of numbers:
1,2
2,5
7,3
etc...
I would like to add a third column equal to the sum of the first two:
1,2,3
2,5,7
7,3,10
The following code is a solution to the problem, and it makes a copy of the input file, with the third column appended. Instead, I would like to operate on the input file line by line, writing the third column to each line as I went along. If the process gave error through for some reason, the answers to the first half of the file should already be saved and would not need to be recalculated.
I can't come up with a good way to do this using ruby's CSV class. Here's my current solution with the copied file:
require 'csv'
CSV.open("big_file.csv", "w") do |csv|
csv << %w{1 2}
csv << %w{2 5}
csv << %w{3 8}
end
big_csv_file = CSV.open("big_file.csv", 'r')
# I'm creating a copy of big_file.csv here
# I'd rather edit it in place
CSV.open("copy_with_extra_column.csv", "w") do |csv|
big_csv_file.each do |row|
row << eval(row[0] + row[1])
csv << row
end
end

To put this another, way, there is no way, at the fundamental file level, to "insert" the sum into the file. In your example:
1,2
2,5
7,2
If we ignore the whole notion of a "CSV" file (which is really just a concept layered on top of a stream text file) To "insert" the text ,3 at the end of the first line, we need to do all of these things:
move the "\n" after the 2, and all the following text two positions later in the file (leaving some junk in its place)
overwrite the junk with ",3"
Then you would repeat this process for each additional row.
This is obviously very inefficient. In simple terms, the CSV file format is not designed for efficient insertion of data.
Your two options are:
Load the file into memory (as, i.e., an array of lines), operate on it there, and then write it all back out over the existing file. Assuming your file only grows, this will work fine, but you'll need to be willing to allocate enough memory to read and operate on the whole file.
Write to a temporary file as you work through the data, and then move the temporary file in place of the original when you're done.
Updating the file "in place" is not practical.

A file is like one long string, for example:
1,2\n2,5
However, unlike a string, you can only overwrite characters in a file. In the example above, there are 7 characters. You can overwrite any of those characters with any characters you choose. So for instance, if you put the sum of the numbers at position 0 and position 2 into position 3, the result is:
1,232,5
That's probably not what you want because it looks like the first two numbers are 1 and 232 and their sum is 5. However, that is all you can do when editing a file inplace: you can only overwrite characters with other characters.
For a large file, you can read in one line, then write the altered line to a new file. When you are done, you can delete the original file, and then you can rename the new file to the old file name. You can use the Tempfile class to avoid name clashes for the new file name.

Instead of CSV.open(), try CSV.read(). For example, it's obviously a little ugly, but:
big_csv_file = CSV.read("big_file.csv")
big_csv_file[0] << eval(big_csv_file[0][0] + big_csv_file[0][1])
CSV.open("copy_with_extra_column.csv", "w") do |csv|
big_csv_file.each do |row|
csv << row
end
end
If you need the file to always be at the latest, the alterations and the writing will need to be in a loop, obviously.

Related

Ruby modify file instead of creating new file

Say I have the following Ruby code which, given a hash of insert positions, reads a file and creates a new file with extra text inserted at those positions:
insertpos = {14=>25,16=>25}
File.open('file.old', 'r') do |oldfile|
File.open('file.new', 'w') do |newfile|
oldfile.each_with_index do |line,linenum|
inserthere = insertpos[linenum]
if(!inserthere.nil?)then
line.insert(inserthere,"foo")
end
newfile.write(line)
end
end
end
Now, instead of creating that new file, I would like to modify this original (old) file. Can someone give me a hint on how to modify the code? Thanks!
At a very fundamental level, this is an extremely difficult thing to do, in any language, on any operating system. Envision a file as a contiguous series of bytes on disk (this is a very simplistic scenario, but it serves to illustrate the point). You want to insert some bytes in the middle of the file. Where do you put those bytes? There's no place to put them! You would have to basically "shift" the existing bytes after the insertion point "down" by the number of bytes you want to insert. If you're inserting multiple sections into an existing file, you would have to do this multiple times! It will be extremely slow, and you will run a high risk of corrupting your data if something goes awry.
You can, however, overwrite existing bytes, and/or append to the end of the file. Most Unix utilities give the appearance of modifying files by creating new files and swapping them with the old. Some more sophisticated schemes, such as those used by databases, allow inserts in the middle of files by 1. reserving space for such operations (when the data is first written), 2. allowing non-contiguous blocks of data within the file through indexing and other techniques, and/or 3. copy-on-write schemes where a new version of the data is written to the end of the file and the old version is invalidated by overwriting an indicator of some kind. You are most likely not wanting to go through all this trouble for your simple use case!
Anyway, you've already found the best way to do what you're trying to do. The only thing you're missing is a FileUtils.mv('file.new', 'file.old') at the very end to replace the old file with the new. Please let me know in the comments if I can help explain this any further.
(Of course, you can read the entire file into memory, make your changes, and overwrite the old file with the updated contents, but I don't believe that's what you're asking here.)
Here's something that hopefully solves your purpose:
# 'source' param is a string, the entire source text
# 'lines' param is an array, a list of line numbers to insert after
# 'new' param is a string, the text to add
def insert(source, lines, new)
results = []
source.split("\n").each_with_index do |line, idx|
if lines.include?(idx)
results << (line + new)
else
results << line
end
end
results.join("\n")
end
File.open("foo", "w") do |f|
10.times do |i|
f.write("#{i}\n")
end
end
puts "initial text: \n\n"
txt = File.read("foo")
puts txt
puts "\n\n after inserting at lines 1,3, and 5: \n\n"
result = insert(txt, [1,3,5], "\nfoo")
puts result
Running this shows:
initial text:
0
1
2
3
4
5
6
7
8
9
after inserting at lines 1,3, and 5:
0
1
foo
2
3
foo
4
5
foo
6
7
8
If its a relatively simple operation you can do it with a ruby one-liner, like this
ruby -i -lpe '$_.reverse!' thefile.txt
(found e.g. at https://gist.github.com/KL-7/1590797).

How to read a large text file line-by-line and append this stream to a file line-by-line in Ruby?

Let's say I want to combine several massive files into one and then uniq! the one (THAT alone might take a hot second)
It's my understanding that File.readlines() loads ALL the lines into memory. Is there a way to read it line by line, sort of like how node.js pipe() system works?
One of the great things about Ruby is that you can do file IO in a block:
File.open("test.txt", "r").each_line do |row|
puts row
end # file closed here
so things get cleaned up automatically. Maybe it doesn't matter on a little script but it's always nice to know you can get it for free.
you aren't operating on the entire file contents at once, and you don't need to store the entirety of each line either if you use readline.
file = File.open("sample.txt", 'r')
while !file.eof?
line = file.readline
puts line
end
Large files are best read by streaming methods like each_line as shown in the other answer or with foreach which opens the file and reads line by line. So if the process doesn't request to have the whole file in memory you should use the streaming methods. While using streaming the required memory won't increase even if the file size increases opposing to non-streaming methods like readlines.
File.foreach("name.txt") { |line| puts line }
uniq! is defined on Array, so you'll have to read the files into an Array anyway. You cannot process the file line-by-line because you don't want to process a file, you want to process an Array, and an Array is a strict in-memory data structure.

Choose starting row for CSV.foreach or similar method? Don't want to load file into memory

Edit (I adjusted the title): I am currently using CSV.foreach but that starts at the first row. I'd like to start reading a file at an arbitrary line without loading the file into memory. CSV.foreach works well for retrieving data at the beginning of a file but not for data I need towards the end of a file.
This answer is similar to what I am looking to do but it loads the entire file into memory; which is what I don't want to do.
I have a 10gb file and the key column is sorted in ascending order:
# example 10gb file rows
key,state,name
1,NY,Jessica
1,NY,Frank
1,NY,Matt
2,NM,Jesse
2,NM,Saul
2,NM,Walt
etc..
I find the line I want to start with this way ...
file = File.expand_path('~/path/10gb_file.csv')
File.open(file, 'rb').each do |line|
if line[/^2,/]
puts "#{$.}: #{line}" # 5: 2,NM,Jesse
row_number = $. # 5
break
end
end
... and I'd like to take row_number and do something like this but not load the 10gb file into memory:
CSV.foreach(file, headers: true).drop(row_number) { |row| "..load data..." }
Lastly, I'm currently handling it like the next snippet; It works fine when the rows are towards the front of the file but not when they're near the end.
CSV.foreach(file, headers: true) do |row|
next if row['key'].to_i < row_number.to_i
break if row['key'].to_i > row_number.to_i
"..load data..."
end
I am trying to use CSV.foreach but I'm open to suggestions. An alternative approach I am considering but does not seem to be efficient for numbers towards the middle of a file:
Use IO or File and read the file line by line
Get the header row and build the hash manually
Read the file from the bottom for numbers near the max key value
I think you have the right idea. Since you've said you're not worried about fields spanning multiple lines, you can seek to a certain line in the file using IO methods and start parsing there. Here's how you might do it:
begin
file = File.open(FILENAME)
# Get the headers from the first line
headers = CSV.parse_line(file.gets)
# Seek in the file until we find a matching line
match = "2,"
while line = file.gets
break if line.start_with?(match)
end
# Rewind the cursor to the beginning of the line
file.seek(-line.size, IO::SEEK_CUR)
csv = CSV.new(file, headers: headers)
# ...do whatever you want...
ensure
# Don't forget the close the file
file.close
end
The result of the above is that csv will be a CSV object whose first row is the row that starts with 2,.
I benchmarked this with an 8MB (170k rows) CSV file (from Lahman's Baseball Database) and found that it was much, much faster than using CSV.foreach alone. For a record in the middle of the file it was about 110x faster, and for a record toward the end about 66x faster. If you want, you can take a look at the benchmark here: https://gist.github.com/jrunning/229f8c2348fee4ba1d88d0dffa58edb7
Obviously 8MB is nothing like 10GB, so regardless this is going to take you a long time. But I'm pretty sure this will be quite a bit faster for you while also accomplishing your goal of not reading all of the data into the file at once.
Foreach will do everything you need. It streams, so it works well with big files.
CSV.foreach('~/path/10gb_file.csv') do |line|
# Only one line will be read into memory at a time.
line
end
Fastest way to skip data that we’re not interested in is to use read to advance through a portion of the file.
File.open("/path/10gb_file.csv") do |f|
f.seek(107) # skip 107 bytes eg. one line. (constant time)
f.read(50) # read first 50 on second line
end

Deleting contents of file after a specific line in ruby

Probably a simple question, but I need to delete the contents of a file after a specific line number? So I wan't to keep the first e.g 5 lines and delete the rest of the contents of a file. I have been searching for a while and can't find a way to do this, I am an iOS developer so Ruby is not a language I am very familiar with.
That is called truncate. The truncate method needs the byte position after which everything gets cut off - and the File.pos method delivers just that:
File.open("test.csv", "r+") do |f|
f.each_line.take(5)
f.truncate( f.pos )
end
The "r+" mode from File.open is read and write, without truncating existing files to zero size, like "w+" would.
The block form of File.open ensures that the file is closed when the block ends.
I'm not aware of any methods to delete from a file so my first thought was to read the file and then write back to it. Something like this:
path = '/path/to/thefile'
start_line = 0
end_line = 4
File.write(path, File.readlines(path)[start_line..end_line].join)
File#readlines reads the file and returns an array of strings, where each element is one line of the file. You can then use the subscript operator with a range for the lines you want
This isn't going to be very memory efficient for large files, so you may want to optimise if that's something you'll be doing.

How to import a column of a CSV file into a Ruby array?

My goal is to import a one column of a CSV file into a Ruby array. This is for a self-contained Ruby script, not an application. I'll just be running the script in Terminal and getting an output.
I'm having trouble finding the best way to import the file and finding the best way to dynamically insert the name of the file into that line of code. The filename will be different each time, and will be passed in by the user. I'm using $stdin.gets.chomp to ask the user for the filename, and setting it equal to file_name.
Can someone help me with this? Here's what I have for this part of the script:
require 'csv'
zip_array = CSV.read("path/to/file_name.csv")
and I need to be able to insert the proper file path above. Is this correct? And how do I get that path name in there? Maybe I'll need to totally re-structure my script, but any suggestions on how to do this?
There are two questions here, I think. The first is about getting user input from the command line. The usual way to do this is with ARGV. In your program you could do file_name = ARGV[0] so a user could type ruby your_program.rb path/to/file_name.csv on the command line.
The next is about reading CSVs. Using CSV.read will take the whole CSV, not just a single column. If you want to choose one column of many, you are likely better off doing:
zip_array = []
CSV.foreach(file_name) { |row| zip_array << row[whichever_column] }
Okay, first problem:
a) The file name will be different on each run (I'm supposing it will always be a CSV file, right?)
You can solve this problem with creating a folder, say input_data inside your Ruby script. Then do:
Dir.glob('input_data/*.csv')
This will produce an array of ALL files inside that folder that end with CSV. If we assume there will be only 1 file at a time in that folder (with a different name), we can do:
file_name = Dir.glob('input_data/*.csv')[0]
This way you'll dynamically get the file path, no matter what the file is named. If the csv file is inside the same directory as your Ruby script, you can just do:
Dir.glob('*.csv')[0]
Now, for importing only 1 column into a Ruby array (let's suppose it's the first column):
require 'csv'
array = []
CSV.foreach(file_name) do |csv_row|
array << csv_row[0] # [0] for the first column, [1] for the second etc.
end
What if your CSV file has headers? Suppose your column name is 'Total'. You can do:
require 'csv'
array = []
CSV.foreach(file_name, headers: true) do |csv_row|
array << csv_row['Total']
end
Now it doesn't matter if your column is the 1st column, the 3rd etc, as long as it has a header named 'Total', Ruby will find it.
CSV.foreach reads your file line-by-line and is good for big files. CSV.read will read it at once but using it you can make your code more concise:
array = CSV.read(, headers: true).map do |csv_row|
csv_row['Total']
end
Hope this helped.
First, you need to assign the returned value from $stdin.gets.chomp to a variable:
foo = $stdin.gets.chomp
Which will assign the entered input to foo.
You don't need to use $stdin though, as gets will use the standard input channel by default:
foo = gets.chomp
At that point use the variable as your read parameter:
zip_array = CSV.read(foo)
That's all basic coding and covered in any intro book for a language.

Resources