Read a file in chunks in Ruby - ruby

I need to read a file in MB chunks, is there a cleaner way to do this in Ruby:
FILENAME="d:\\tmp\\file.bin"
MEGABYTE = 1024*1024
size = File.size(FILENAME)
open(FILENAME, "rb") do |io|
read = 0
while read < size
left = (size - read)
cur = left < MEGABYTE ? left : MEGABYTE
data = io.read(cur)
read += data.size
puts "READ #{cur} bytes" #yield data
end
end

Adapted from the Ruby Cookbook page 204:
FILENAME = "d:\\tmp\\file.bin"
MEGABYTE = 1024 * 1024
class File
def each_chunk(chunk_size = MEGABYTE)
yield read(chunk_size) until eof?
end
end
open(FILENAME, "rb") do |f|
f.each_chunk { |chunk| puts chunk }
end
Disclaimer: I'm a ruby newbie and haven't tested this.

Alternatively, if you don't want to monkeypatch File:
until my_file.eof?
do_something_with( my_file.read( bytes ) )
end
For example, streaming an uploaded tempfile into a new file:
# tempfile is a File instance
File.open( new_file, 'wb' ) do |f|
# Read in small 65k chunks to limit memory usage
f.write(tempfile.read(2**16)) until tempfile.eof?
end

You can use IO#each(sep, limit), and set sep to nil or empty string, for example:
chunk_size = 1024
File.open('/path/to/file.txt').each(nil, chunk_size) do |chunk|
puts chunk
end

If you check out the ruby docs:
http://ruby-doc.org/core-2.2.2/IO.html
there's a line that goes like this:
IO.foreach("testfile") {|x| print "GOT ", x }
The only caveat is. Since, this process can read the temp file faster than the
generated stream, IMO, a latency should be thrown in.
IO.foreach("/tmp/streamfile") {|line|
ParseLine.parse(line)
sleep 0.3 #pause as this process will discontine if it doesn't allow some buffering
}

https://ruby-doc.org/core-3.0.2/IO.html#method-i-read gives an example of iterating over fixed length records with read(length):
# iterate over fixed length records
open("fixed-record-file") do |f|
while record = f.read(256)
# ...
end
end
If length is a positive integer, read tries to read length bytes without any conversion (binary mode). It returns nil if an EOF is encountered before anything can be read. Fewer than length bytes are returned if an EOF is encountered during the read. In the case of an integer length, the resulting string is always in ASCII-8BIT encoding.

FILENAME="d:/tmp/file.bin"
class File
MEGABYTE = 1024*1024
def each_chunk(chunk_size=MEGABYTE)
yield self.read(chunk_size) until self.eof?
end
end
open(FILENAME, "rb") do |f|
f.each_chunk {|chunk| puts chunk }
end
It works, mbarkhau. I just moved the constant definition to the File class and added a couple of "self"s for clarity's sake.

Related

Ruby: how to read an mp4 file into chunks

I want to be able to read an mp4 file in chunks of 1mb.
I've tried opening the file with the following API's:
video_file = File.open(#video_filename, 'rb')
video_file = IO.binread(#video_filename)
The problem is, video_file is a string afterwards and I cannot use read to get chunks of the file.
chunk = video_file.read(4*1024*1024)
What is the right interface/tools to use in Ruby to open this file, and read it for N bytes at a time?
I suppose I would do:
chnk_size=4*1024*1024
f=File.open(fn, 'rb')
until f.eof?
chnk=f.read(chnk_size)
# process the chnk
end
Try something like this :
`FILENAME = "d:\\tmp\\file.bin"
MEGABYTE = 1024 * 1024
class File
def each_chunk(chunk_size = MEGABYTE)
yield read(chunk_size) until eof?
end
end
open(FILENAME, "rb") do |f|
f.each_chunk { |chunk| puts chunk }
end`

Errno::EINVAL Invalid argument # io_fread

I'm attempting to download a ~2GB file and write it to a file locally but I'm running into this issue:
Here's the applicable code:
File.open(local_file, "wb") do |tempfile|
puts "Downloading the backup..."
pbar = nil
open(backup_url,
:read_timeout => nil,
:content_length_proc => lambda do |content_length|
if content_length&.positive?
pbar = ProgressBar.create(:total => content_length)
end
end,
:progress_proc => ->(size) { pbar&.progress = size }) do |retrieved|
begin
tempfile.binmode
tempfile << retrieved.read
tempfile.close
rescue Exception => e
binding.pry
end
end
Read your file in chunks.
The line causing the issue is here:
tempfile << retrieved.read
This reads the entire contents into memory before writing it to the tempfile. If the content is small, this isn't a big deal, but if this content is quite large (how large depends on the system, configuration, OS and available resources), this can cause an Errno::EINVAL error, like Invalid argument # io_fread and Invalid argument # io_write.
To work around this, read the content in chunks and write each chunk to the tempfile. Something like this:
tempfile.write( retrieved.read( 1024 ) ) until retrieved.eof?
This will get chunks of 1024 bytes and write each chunk to the tempfile until retrieved reaches the end of the file (i.e. .eof?).
If retrieved.read doesn't take a size parameter, you may need to convert retrieved into a StringIO, like this:
retrievedIO = StringIO.new( retrieved )
tempfile.write( retrievedIO.read( 1024 ) ) until retrievedIO.eof?

'Failed to allocate memory' error with large array

I am trying to import a large text file (approximately 2 million rows of numbers at 260MB) into an array, make edits to the array, and then write the results to a new text file, by writing:
file_data = File.readlines("massive_file.txt")
file_data = file_data.map!(&:strip)
file_data.each do |s|
s.gsub!(/,.*\z/, "")
end
File.open("smaller_file.txt", 'w') do |f|
f.write(file_data.map(&:strip).uniq.join("\n"))
end
However, I have received the error failed to allocate memory (NoMemoryError). How can I allocate more memory to complete the task? Or, ideally, is there another method I can use where I can avoid having to re-allocate memory?
You can read the file line by line:
require 'set'
require 'digest/md5'
file_data = File.new('massive_file.txt', 'r')
file_output = File.new('smaller_file.txt', 'w')
unique_lines_set = Set.new
while (line = file_data.gets)
line.strip!
line.gsub!(/,.*\z/, "")
# Check if the line is unique
line_hash = Digest::MD5.hexdigest(line)
if not unique_lines_set.include? line_hash
# It is unique so add its hash to the set
unique_lines_set.add(line_hash)
# Write the line in the output file
file_output.puts(line)
end
end
file_data.close
file_output.close
You can try reading and writing one line at once:
new_file = File.open('smaller_file.txt', 'w')
File.open('massive_file.txt', 'r') do |file|
file.each_line do |line|
new_file.puts line.strip.gsub(/,.*\z/, "")
end
end
new_file.close
The only thing pending is find duplicated lines
Alternatively you can read file in chunks which should be faster compared to reading it line by line:
FILENAME="massive_file.txt"
MEGABYTE = 1024*1024
class File
def each_chunk(chunk_size=MEGABYTE) # or n*MEGABYTE
yield read(chunk_size) until eof?
end
end
filedata = ""
open(FILENAME, "rb") do |f|
f.each_chunk() {|chunk|
chunk.gsub!(/,.*\z/, "")
filedata += chunk
}
end
ref: https://stackoverflow.com/a/1682400/3035830

Ruby | binread : failed to allocate memory (NoMemoryError)

Running the following code
Dir.foreach(FileUtils.pwd()) do |f|
if f.end_with?('log')
File.open(f) do |file|
if File.size(f) > MAX_FILE_SIZE
puts f
puts file.ctime
puts file.mtime
# zipping the file
orig = f
Zlib::GzipWriter.open('arch_log.gz') do |gz|
gz.mtime = File.mtime(orig)
gz.orig_name = orig
gz.write IO.binread(orig)
puts "File has been archived"
end
#deleting the file
begin
File.delete(f)
puts "File has been deleted"
rescue Exception => e
puts "File #{f} can not be deleted"
puts " Error #{e.message}"
puts "======= Please remove file manually =========="
end
end
end
end
end
Also files are pretty heavy more than 1GB. Any help would be appreciated.
If the files you are reading are > 1GB, you have to have that much memory free at a minimum, because IO.binread is going to slurp that amount in.
You'd be better off to load a known amount and loop over the input until it's completely read, reading and writing in chunks.
From the docs:
IO.binread(name, [length [, offset]] ) -> string
------------------------------------------------------------------------------
Opens the file, optionally seeks to the given offset, then returns
length bytes (defaulting to the rest of the file). binread ensures
the file is closed before returning. The open mode would be "rb:ASCII-8BIT".
IO.binread("testfile") #=> "This is line one\nThis is line two\nThis is line three\nAnd so on...\n"
IO.binread("testfile", 20) #=> "This is line one\nThi"
IO.binread("testfile", 20, 10) #=> "ne one\nThis is line "

How can I copy the contents of one file to another using Ruby's file methods?

I want to copy the contents of one file to another using Ruby's file methods.
How can I do it using a simple Ruby program using file methods?
There is a very handy method for this - the IO#copy_stream method - see the output of ri copy_stream
Example usage:
File.open('src.txt') do |f|
f.puts 'Some text'
end
IO.copy_stream('src.txt', 'dest.txt')
For those that are interested, here's a variation of the IO#copy_stream, File#open + block answer(s) (written against ruby 2.2.x, 3 years too late).
copy = Tempfile.new
File.open(file, 'rb') do |input_stream|
File.open(copy, 'wb') do |output_stream|
IO.copy_stream(input_stream, output_stream)
end
end
As a precaution I would recommend using buffer unless you can guarantee whole file always fits into memory:
File.open("source", "rb") do |input|
File.open("target", "wb") do |output|
while buff = input.read(4096)
output.write(buff)
end
end
end
Here my implementation
class File
def self.copy(source, target)
File.open(source, 'rb') do |infile|
File.open(target, 'wb') do |outfile2|
while buffer = infile.read(4096)
outfile2 << buffer
end
end
end
end
end
Usage:
File.copy sourcepath, targetpath
Here is a simple way of doing that using ruby file operation methods :
source_file, destination_file = ARGV
script = $0
input = File.open(source_file)
data_to_copy = input.read() # gather the data using read() method
puts "The source file is #{data_to_copy.length} bytes long"
output = File.open(destination_file, 'w')
output.write(data_to_copy) # write up the data using write() method
puts "File has been copied"
output.close()
input.close()
You can also use File.exists? to check if the file exists or not. This would return a boolean true if it does!!
Here's a fast and concise way to do it.
# Open first file, read it, store it, then close it
input = File.open(ARGV[0]) {|f| f.read() }
# Open second file, write to it, then close it
output = File.open(ARGV[1], 'w') {|f| f.write(input) }
An example for running this would be.
$ ruby this_script.rb from_file.txt to_file.txt
This runs this_script.rb and takes in two arguments through the command-line. The first one in our case is from_file.txt (text being copied from) and the second argument second_file.txt (text being copied to).
You can also use File.binread and File.binwrite if you wish to hold onto the file contents for a bit. (Other answers use an instant copy_stream instead.)
If the contents are other than plain text files, such as images, using basic File.read and File.write won't work.
temp_image = Tempfile.new('image.jpg')
actual_img = IO.binread('image.jpg')
IO.binwrite(temp_image, actual_img)
Source: binread,
binwrite.

Resources