I'm attempting to download a ~2GB file and write it to a file locally but I'm running into this issue:
Here's the applicable code:
File.open(local_file, "wb") do |tempfile|
puts "Downloading the backup..."
pbar = nil
open(backup_url,
:read_timeout => nil,
:content_length_proc => lambda do |content_length|
if content_length&.positive?
pbar = ProgressBar.create(:total => content_length)
end
end,
:progress_proc => ->(size) { pbar&.progress = size }) do |retrieved|
begin
tempfile.binmode
tempfile << retrieved.read
tempfile.close
rescue Exception => e
binding.pry
end
end
Read your file in chunks.
The line causing the issue is here:
tempfile << retrieved.read
This reads the entire contents into memory before writing it to the tempfile. If the content is small, this isn't a big deal, but if this content is quite large (how large depends on the system, configuration, OS and available resources), this can cause an Errno::EINVAL error, like Invalid argument # io_fread and Invalid argument # io_write.
To work around this, read the content in chunks and write each chunk to the tempfile. Something like this:
tempfile.write( retrieved.read( 1024 ) ) until retrieved.eof?
This will get chunks of 1024 bytes and write each chunk to the tempfile until retrieved reaches the end of the file (i.e. .eof?).
If retrieved.read doesn't take a size parameter, you may need to convert retrieved into a StringIO, like this:
retrievedIO = StringIO.new( retrieved )
tempfile.write( retrievedIO.read( 1024 ) ) until retrievedIO.eof?
Related
I have a function which generates some data. Right now, this data is written to a file. Once the file is complete, I upload it via HTTParty:
require 'httparty'
url = "..."
def generate_data(file)
file << "First line of data\n"
sleep 1
file << "Second line of data\n"
sleep 1
file << "Third line of data\n"
end
File.open('payload.txt', 'w+') do |file|
generate_data(file)
file.rewind
HTTParty.post(url, body: {file: file})
end
As it happens, generate_data takes a bit -- I would like to accelerate the script and avoid writing to disk by interleaving the generation of the data and uploading it. How could I do this using HTTParty?
I was looking for something like StringIO which could be used as a fixed-size FIFO buffer: the generate_data function writes to it (and blocks when the buffer is full) while the HTTParty.post call reads from it. (and blocks when the buffer is empty). However, I failed to find anything like that.
You need to use streaming
HTTParty.put(
'http://localhost:3000/train',
body_stream: StringIO.new('foo')
)
I want to download and process csv file that is on sftp server line by line.
If I am using download! or sftp.file.open, it is buffering whole data in memory that I want to avoid.
Here is my source code:
sftp = Net::SFTP.start(#sftp_details['server_ip'], #sftp_details['server_username'], :password => decoded_pswd)
if sftp
begin
sftp.dir.foreach(#sftp_details['server_folder_path']) do |entry|
print_memory_usage do
print_time_spent do
if entry.file? && entry.name.end_with?("csv")
batch_size_cnt = 0
sftp.file.open("#{#sftp_details['server_folder_path']}/#{entry.name}") do |file|
header = file.gets
header = header.force_encoding(header.encoding).encode('UTF-8', invalid: :replace, undef: :replace, replace: '')
csv_data = ''
while line = file.gets
batch_size_cnt += 1
csv_data.concat(line.force_encoding(line.encoding).encode('UTF-8', invalid: :replace, undef: :replace, replace: ''))
if batch_size_cnt == 1000 || file.eof?
CSV.parse(csv_data, {headers: header, write_headers: true}) do |row|
row.delete(nil)
entities << row.to_hash
end
csv_data, batch_size_cnt = '', 0
courses.delete_if(&:blank?)
# DO PROCESSING PART
entities = []
end
end if header
end
sftp.rename("#{#sftp_details['server_folder_path']}/#{entry.name}", "#{#sftp_details['processed_file_path']}/#{entry.name}")
end
end
end
end
Can someone please help? Thanks
You need to add some kind of buffer to be able to read chunks and then write them all together. I think it would be wise to split in your script parsing and downloading. Focus on one thing at the time:
Your original line:
...
sftp.file.open("#{#sftp_details['server_folder_path']}/#{entry.name}") do |file|
...
If you check the source file of the download! (don't forget the bang!) method you can use 'stringio'. A stub which you can easily adjust. Usually the default buffer, which is 32kB, is sufficient. You can change it if you want (see the example).
Replace with (works only with single files) :
The StringIO usage:
...
io = StringIO.new
sftp.download!("#{#sftp_details['server_folder_path']}/#{entry.name}", io.puts, :read_size => 16000))
OR you can just download a file
...
file = File.open("/your_local_path/#{entry.name}",'wb')
sftp.download!("#{#sftp_details['server_folder_path']}/#{entry.name}", file, :read_size => 16000)
....
From the Doc's you can use an option :read_size:
:read_size - the maximum number of bytes to read at a time from the
source. Increasing this value might improve throughput. It defaults to
32,000 bytes.
I am trying to create a Tempfile and write some text into it. But I get this strange behavior in console
t = Tempfile.new("test_temp") # => #<File:/tmp/test_temp20130805-28300-1u5g9dv-0>
t << "Test data" # => #<File:/tmp/test_temp20130805-28300-1u5g9dv-0>
t.write("test data") # => 9
IO.read t.path # => ""
I also tried cat /tmp/test_temp20130805-28300-1u5g9dv-0 but the file is empty.
Am I missing anything? Or what's the proper way to write to Tempfile?
FYI I'm using ruby 1.8.7
You're going to want to close the temp file after writing to it. Just add a t.close to the end. I bet the file has buffered output.
Try this
run t.rewind before read
require 'tempfile'
t = Tempfile.new("test_temp")
t << "Test data"
t.write("test data") # => 9
IO.read t.path # => ""
t.rewind
IO.read t.path # => "Test datatest data"
close or rewind will actually write out content to file. And you may want to delete it after using:
file = Tempfile.new('test_temp')
begin
file.write <<~FILE
Test data
test data
FILE
file.close
puts IO.read(file.path) #=> Test data\ntestdata\n
ensure
file.delete
end
It's worth mentioning, calling .rewind is a must otherwise any subsequent .read call will just return empty value
Running the following code
Dir.foreach(FileUtils.pwd()) do |f|
if f.end_with?('log')
File.open(f) do |file|
if File.size(f) > MAX_FILE_SIZE
puts f
puts file.ctime
puts file.mtime
# zipping the file
orig = f
Zlib::GzipWriter.open('arch_log.gz') do |gz|
gz.mtime = File.mtime(orig)
gz.orig_name = orig
gz.write IO.binread(orig)
puts "File has been archived"
end
#deleting the file
begin
File.delete(f)
puts "File has been deleted"
rescue Exception => e
puts "File #{f} can not be deleted"
puts " Error #{e.message}"
puts "======= Please remove file manually =========="
end
end
end
end
end
Also files are pretty heavy more than 1GB. Any help would be appreciated.
If the files you are reading are > 1GB, you have to have that much memory free at a minimum, because IO.binread is going to slurp that amount in.
You'd be better off to load a known amount and loop over the input until it's completely read, reading and writing in chunks.
From the docs:
IO.binread(name, [length [, offset]] ) -> string
------------------------------------------------------------------------------
Opens the file, optionally seeks to the given offset, then returns
length bytes (defaulting to the rest of the file). binread ensures
the file is closed before returning. The open mode would be "rb:ASCII-8BIT".
IO.binread("testfile") #=> "This is line one\nThis is line two\nThis is line three\nAnd so on...\n"
IO.binread("testfile", 20) #=> "This is line one\nThi"
IO.binread("testfile", 20, 10) #=> "ne one\nThis is line "
I need to read a file in MB chunks, is there a cleaner way to do this in Ruby:
FILENAME="d:\\tmp\\file.bin"
MEGABYTE = 1024*1024
size = File.size(FILENAME)
open(FILENAME, "rb") do |io|
read = 0
while read < size
left = (size - read)
cur = left < MEGABYTE ? left : MEGABYTE
data = io.read(cur)
read += data.size
puts "READ #{cur} bytes" #yield data
end
end
Adapted from the Ruby Cookbook page 204:
FILENAME = "d:\\tmp\\file.bin"
MEGABYTE = 1024 * 1024
class File
def each_chunk(chunk_size = MEGABYTE)
yield read(chunk_size) until eof?
end
end
open(FILENAME, "rb") do |f|
f.each_chunk { |chunk| puts chunk }
end
Disclaimer: I'm a ruby newbie and haven't tested this.
Alternatively, if you don't want to monkeypatch File:
until my_file.eof?
do_something_with( my_file.read( bytes ) )
end
For example, streaming an uploaded tempfile into a new file:
# tempfile is a File instance
File.open( new_file, 'wb' ) do |f|
# Read in small 65k chunks to limit memory usage
f.write(tempfile.read(2**16)) until tempfile.eof?
end
You can use IO#each(sep, limit), and set sep to nil or empty string, for example:
chunk_size = 1024
File.open('/path/to/file.txt').each(nil, chunk_size) do |chunk|
puts chunk
end
If you check out the ruby docs:
http://ruby-doc.org/core-2.2.2/IO.html
there's a line that goes like this:
IO.foreach("testfile") {|x| print "GOT ", x }
The only caveat is. Since, this process can read the temp file faster than the
generated stream, IMO, a latency should be thrown in.
IO.foreach("/tmp/streamfile") {|line|
ParseLine.parse(line)
sleep 0.3 #pause as this process will discontine if it doesn't allow some buffering
}
https://ruby-doc.org/core-3.0.2/IO.html#method-i-read gives an example of iterating over fixed length records with read(length):
# iterate over fixed length records
open("fixed-record-file") do |f|
while record = f.read(256)
# ...
end
end
If length is a positive integer, read tries to read length bytes without any conversion (binary mode). It returns nil if an EOF is encountered before anything can be read. Fewer than length bytes are returned if an EOF is encountered during the read. In the case of an integer length, the resulting string is always in ASCII-8BIT encoding.
FILENAME="d:/tmp/file.bin"
class File
MEGABYTE = 1024*1024
def each_chunk(chunk_size=MEGABYTE)
yield self.read(chunk_size) until self.eof?
end
end
open(FILENAME, "rb") do |f|
f.each_chunk {|chunk| puts chunk }
end
It works, mbarkhau. I just moved the constant definition to the File class and added a couple of "self"s for clarity's sake.