Ruby File.read vs. File.gets - ruby

If I want to append the contents of a src file into the end of a dest file in Ruby, is it better to use:
while line = src.gets do
or
while buffer = src.read( 1024 )
I have seen both used and was wondering when should I use each method and why?

One is for reading "lines", one is for reading n bytes.
While byte buffering might be faster, a lot of that may disappear into the OS which likely does buffering anyway. IMO it has more to do with the context of the read--do you want lines, or are you just shuffling chunks of data around?
That said, a performance test in your specific environment may be helpful when deciding.

You have a number of options when reading a file that are tailored to different situations.
Read in the file line-by-line, but only store one line at a time:
while (line = file.gets) do
# ...
end
Read in all lines of a file at once:
file.readlines.each do |line|
# ...
end
Read the file in as a series of blocks:
while (data = file.read(block_size))
# ...
end
Read in the whole file at once:
data = file.read
It really depends on what kind of data you're working with. Generally read is better suited towards binary files, or those where you want it as one big string. gets and readlines are similar, but readlines is more convenient if you're confident the file will fit in memory. Don't do this on multi-gigabyte log files or you'll be in for a world of hurt as your system starts swapping. Use gets for situations like that.

gets will read until the end of the line based on a separator
read will read n bytes at a time
It all depends on what you are trying to read.
It may be more efficient to use read if your src file has unpredictable line lengths.

Related

How to read a large text file line-by-line and append this stream to a file line-by-line in Ruby?

Let's say I want to combine several massive files into one and then uniq! the one (THAT alone might take a hot second)
It's my understanding that File.readlines() loads ALL the lines into memory. Is there a way to read it line by line, sort of like how node.js pipe() system works?
One of the great things about Ruby is that you can do file IO in a block:
File.open("test.txt", "r").each_line do |row|
puts row
end # file closed here
so things get cleaned up automatically. Maybe it doesn't matter on a little script but it's always nice to know you can get it for free.
you aren't operating on the entire file contents at once, and you don't need to store the entirety of each line either if you use readline.
file = File.open("sample.txt", 'r')
while !file.eof?
line = file.readline
puts line
end
Large files are best read by streaming methods like each_line as shown in the other answer or with foreach which opens the file and reads line by line. So if the process doesn't request to have the whole file in memory you should use the streaming methods. While using streaming the required memory won't increase even if the file size increases opposing to non-streaming methods like readlines.
File.foreach("name.txt") { |line| puts line }
uniq! is defined on Array, so you'll have to read the files into an Array anyway. You cannot process the file line-by-line because you don't want to process a file, you want to process an Array, and an Array is a strict in-memory data structure.

Deleting contents of file after a specific line in ruby

Probably a simple question, but I need to delete the contents of a file after a specific line number? So I wan't to keep the first e.g 5 lines and delete the rest of the contents of a file. I have been searching for a while and can't find a way to do this, I am an iOS developer so Ruby is not a language I am very familiar with.
That is called truncate. The truncate method needs the byte position after which everything gets cut off - and the File.pos method delivers just that:
File.open("test.csv", "r+") do |f|
f.each_line.take(5)
f.truncate( f.pos )
end
The "r+" mode from File.open is read and write, without truncating existing files to zero size, like "w+" would.
The block form of File.open ensures that the file is closed when the block ends.
I'm not aware of any methods to delete from a file so my first thought was to read the file and then write back to it. Something like this:
path = '/path/to/thefile'
start_line = 0
end_line = 4
File.write(path, File.readlines(path)[start_line..end_line].join)
File#readlines reads the file and returns an array of strings, where each element is one line of the file. You can then use the subscript operator with a range for the lines you want
This isn't going to be very memory efficient for large files, so you may want to optimise if that's something you'll be doing.

Easier way to search through large files in Ruby?

I'm writing a simple log sniffer that will search logs for specific errors that are indicative of issues with the software I support. It allows the user to specify the path to the log and specify how many days back they'd like to search.
If users have log roll over turned off, the log files can sometimes get quite large. Currently I'm doing the following (though not done with it yet):
File.open(#log_file, "r") do |file_handle|
file_handle.each do |line|
if line.match(/\d+++-\d+-\d+/)
etc...
The line.match obviously looks for the date format we use in the logs, and the rest of the logic will be below. However, is there a better way to search through the file without .each_line? If not, I'm totally fine with that. I just wanted to make sure I'm using the best resources available to me.
Thanks
fgrep as a standalone or called from system('fgrep ...') may be faster solution
file.readlines might be better in speed, but it's a time-space tradeoff
look at this little research - last approaches seem to be rather fast.
Here are some coding hints...
Instead of:
File.open(#log_file, "r") do |file_handle|
file_handle.each do |line|
use:
File.foreach(#log_file) do |line|
next unless line[/\A\d+++-\d+-\d+/]
foreach simplifies opening and looping over the file.
next unless... makes a tight loop skipping every line that does NOT start with your target string. The less you do before figuring out whether you have a good line, the faster your code will run.
Using an anchor at the start of your pattern, like \A gives the regex engine a major hint about where to look in the line, and allows it to bail out very quickly if the line doesn't match. Also, using line[/\A\d+++-\d+-\d+/] is a bit more concise.
If your log file is sorted by date, then you can avoid having search through the entire file by doing a binary search. In this case you'd:
Open the file like you are doing
Use lineo= to fast forward to the middle of the file.
Check if the date on the beging of the line is higher or lower than the date you are looking for.
Continue splitting the file in halves until you find what you need.
I do however think your file needs to be very large for the above to make sense.
Edit
Here is some code which shows the basic idea. It find a line containing search date, not the first. This can be fixed either by more binary searches or by doing an linear search from the last midpoint, which did not contain date. There also isn't a termination condition in case the date is not in the file. These small additions, are left as an exercise to the reader :-)
require 'date'
def bin_fsearch(search_date, file)
f = File.open file
search = {min: 0, max: f.size}
while true
# go to file midpoint
f.seek (search[:max] + search[:min]) / 2
# read in until EOL
f.gets
# record the actual mid-point we are using
pos = f.pos
# read in next line
line = f.gets
# get date from line
line_date = Date.parse(line)
if line_date < search_date
search[:min] = f.pos
elsif line_date > search_date
search[:max] = pos
else
f.seek pos
return
end
end
end
bin_fsearch(Date.new(2013, 5, 4), '/var/log/system.log')
Try this, it will search one time at a time & should be pretty fast & takes less memory.
File.open(file, 'r') do |f|
f.each_line do |line|
# do stuff here to line
end
end
Another more faster option is to read the whole file into one array. it would be fast but will take LOT of memory.
File.readlines.each do |line|
#do stuff with each line
end
Further, if you need fastest approach with least amount of memory try grep which is specifically tuned for searching through large files. so should be fast & memory responsive
`grep -e regex bigfile`.split(/\n/).each do |line|
# ... (called on each matching line) ...
end
Faster than line-by-line is read the line by chunks:
File.open('file.txt') do |f|
buff = f.read(10240)
# ...
end
But you are using regexp to match dates, you might get incomplete lines. You will have to deal with it in your logic.
Also, if performance is that important, consider write a really simple C extension.
If the log file can get huge, and that is your concern, then maybe you can consider saving the errors in a database. Then, you will get faster response.

dealing with large CSV files (20G) in ruby

I am working on little problem and would have some advice on how to solve it:
Given a csv file with an unknown number of columns and rows, output a list of columns with values and the number of times each value was repeated. without using any library.
if the file is small this shouldn't be a problem, but when it is a few Gigs, i get NoMemoryError: failed to allocate memory. is there a way to create a hash and read from the disk instead of loading the file to Memory? you can do that in perl with tied Hashes
EDIT: will IO#foreach load the file into memory? how about File.open(filename).each?
Read the file one line at a time, discarding each line as you go:
open("big.csv") do |csv|
csv.each_line do |line|
values = line.split(",")
# process the values
end
end
Using this method, you should never run out of memory.
Do you read the whole file at once? Reading it on a per-line basis, i.e. using ruby -pe, ruby -ne or $stdin.each should reduce the memory usage by garbage collecting lines which were processed.
data = {}
$stdin.each do |line|
# Process line, store results in the data hash.
end
Save it as script.rb and pipe the huge CSV file into this script's standard input:
ruby script.rb < data.csv
If you don't feel like reading from the standard input we'll need a small change.
data = {}
File.open("data.csv").each do |line|
# Process line, store results in the data hash.
end
For future reference, in such cases you want to use CSV.foreach('big_file.csv', headers: true) do |row|
This will read the file line by line from the IO object with minimal memory footprint (should be below 1MB regardless of file size).

Fastest way to skip lines while parsing files in Ruby?

I tried searching for this, but couldn't find much. It seems like something that's probably been asked before (many times?), so I apologize if that's the case.
I was wondering what the fastest way to parse certain parts of a file in Ruby would be. For example, suppose I know the information I want for a particular function is between lines 500 and 600 of, say, a 1000 line file. (obviously this kind of question is geared toward much large files, I'm just using those smaller numbers for the sake of example), since I know it won't be in the first half, is there a quick way of disregarding that information?
Currently I'm using something along the lines of:
while buffer = file_in.gets and file_in.lineno <600
next unless file_in.lineno > 500
if buffer.chomp!.include? some_string
do_func_whatever
end
end
It works, but I just can't help but think it could work better.
I'm very new to Ruby and am interested in learning new ways of doing things in it.
file.lines.drop(500).take(100) # will get you lines 501-600
Generally, you can't avoid reading file from the start until the line you are interested in, as each line can be of different length. The one thing you can avoid, though, is loading whole file into a big array. Just read line by line, counting, and discard them until you reach what you look for. Pretty much like your own example. You can just make it more Rubyish.
PS. the Tin Man's comment made me do some experimenting. While I didn't find any reason why would drop load whole file, there is indeed a problem: drop returns the rest of the file in an array. Here's a way this could be avoided:
file.lines.select.with_index{|l,i| (501..600) === i}
PS2: Doh, above code, while not making a huge array, iterates through the whole file, even the lines below 600. :( Here's a third version:
enum = file.lines
500.times{enum.next} # skip 500
enum.take(100) # take the next 100
or, if you prefer FP:
file.lines.tap{|enum| 500.times{enum.next}}.take(100)
Anyway, the good point of this monologue is that you can learn multiple ways to iterate a file. ;)
I don't know if there is an equivalent way of doing this for lines, but you can use seek or the offset argument on an IO object to "skip" bytes.
See IO#seek, or see IO#open for information on the offset argument.
Sounds like rio might be of help here. It provides you with a lines() method.
You can use IO#readlines, that returns an array with all the lines
IO.readlines(file_in)[500..600].each do |line|
#line is each line in the file (including the last \n)
#stuff
end
or
f = File.new(file_in)
f.readlines[500..600].each do |line|
#line is each line in the file (including the last \n)
#stuff
end

Resources