How to efficiently split file into arbitrary byteranges in ruby - ruby

I need to upload a big file to a third party service.
This third party service gives me a list of urls and byteranges:
requests = [
{url: "https://.../part1", from: 0, to: 20_000_000},
{url: "https://.../part2", from: 20_000_001, to: 40_000_000},
{url: "https://.../part3", from: 40_000_001, to: 54_184_279}
]
I'm using the httpx gem to upload the data, the :body option can receive an IO or Enumerable object.
I would like to split and upload chunks in an efficient way. This is why I think I should avoid writing chunks to the disks and also avoid loading the entire file into memory. I suppose that the best option would be some kind of "lazy Enumerable" but I dont know how to write the part function that would return this IO or Enumerable object.
file = File.open("bigFile", "rb")
results = requests.each do |request|
Thread.start { HTTPX.post(request[:url]), body: part(file, request[:from], request[:to]) }
end.map(&:value)
def part(file, from, to)
# ???
end

The easiest way to generate an enumerator for each "byterange" would be to let the part function handle the opening of the file:
def part(filepath, from, to = nil, chunk_size = 4096, &block)
return to_enum(__method__, filepath, from, to, chunk_size) unless block_given?
size = File.size(filepath)
to = size-1 unless to and to >= from and to < size
io = File.open(filepath, "rb")
io.seek(from, IO::SEEK_SET)
while (io.pos <= to)
size = (io.pos + chunk_size <= to) ? chunk_size : 1 + to - io.pos
chunk = io.read(size)
yield chunk
end
ensure
io.close if io
end
Warning: the chunk size calculation may be wrong, I will check it in a while (I have to take care of my child)
Note: You may want to improve this function to ensure that you always read a full physical HDD block (or a multiple of it), as it will greatly speed-up the IO. You'll have a misalignment when from is not a multiple of the physical HDD block.
The part function now returns an Enumerator when called without a block:
part("bigFile", 0, 1300, 512)
#=> #<Enumerator: main:part("bigFile", 0, 1300, 512)
And of course you can call it directly with a block:
part("bigFile", 0, 1300, 512) do |chunk|
puts "#{chunk.inspect}"
end

IO.read("bigFile", 1000, 2000)
will read 1000 bytes, starting at offset 2000. Ruby starts counting at zero, so I think
IO.read("bigFile", 20_000_000, 0) #followed by
IO.read("bigFile,20_000_000,20_000_000) #not 20_000_001
would be correct. Without bookkeeping:
f = File.open("bigFile")
partname = "part0"
until f.eof? do
partname = partname.succ
chunk = f.read(20_000_000)
#do something with chunk and partname
end
f.close

Related

repeatedly read Ruby IO until X bytes have been read, Y seconds have elapsed, or EOF, whichever comes first

I want to forward logs from an IO pipe to an API. Ideally, there would be no more than e.g. 10 seconds of latency (so humans watching the log don't get impatient).
A naive way to accomplish this would be to use IO.each_byte and send each byte to the API as soon as it becomes available, but the overhead of processing a request per byte causes additional latency.
IO#each(limit) also gets close to what I want, but if the limit is 50 kB and after 10 seconds, only 20 kB has been read, I want to go ahead and send that 20 kB without waiting for more. How can I apply both a time and size limit simultaneously?
A naìˆve approach would be to use IO#each_byte enumerator.
The contrived, not tested example:
enum = io.each_byte
now = Time.now
res = while Time.now - now < 20 do
begin
send_byte enum.next
rescue e => StopIteration
# no more data
break :closed
end
end
puts "NO MORE DATA" if res == :closed
Here's what I ended up with. Simpler solutions still appreciated!
def read_chunks(io, byte_interval: 200 * 1024, time_interval: 5)
buffer = last = nil
reset = lambda do
buffer = ''
last = Time.now
end
reset.call
mutex = Mutex.new
cv = ConditionVariable.new
[
lambda do
IO.select [io]
mutex.synchronize do
begin
chunk = io.readpartial byte_interval
buffer.concat chunk
rescue EOFError
raise StopIteration
ensure
cv.signal
end
end
end,
lambda do
mutex.synchronize do
until io.eof? || Time.now > (last + time_interval) || buffer.length > byte_interval
cv.wait mutex, time_interval
end
unless buffer.empty?
buffer_io = StringIO.new buffer
yield buffer_io.read byte_interval until buffer_io.eof?
reset.call
end
raise StopIteration if io.eof?
end
end,
].map do |function|
Thread.new { loop { function.call } }
end.each(&:join)
end

How to check if Dir size changes during Watir wait.until

I have a method that waits for a chrome download to start, using Watir. However, I'd like to simplify and respecify this to the point where it simply checks if the directory size increases. I'm assuming this is going to require me to save the directory's size at the beginning of the block, and then wait for the Dir size to be equal to that number + 1.
def wait_for_download
dl_dir = Dir["#{Dir.pwd}/downloads/*"].to_s
Watir::Wait.until { !dl_dir.include?(".crdownload") }
end
This is just a couple of functions you can add in your initializers or whatever.
def get_file_size_in_mb(path)
File.size(path).to_f / 10240000.0
end
def find_all_files_inside(folder_path)
Dir.glob("#{folder_path}/**/*")
end
def calculate_size_of_folder_contents(folder_path)
mb = 0.0
find_all_files_inside(folder_path).each do |fn|
mb += get_file_size_in_mb(fn)
end# ^ could have used `inject` here
mb
end
def wait_until_folder_size_changes(folder_path, seconds=2)
while true
size0 = calculate_size_of_folder_contents(folder_path)
sleep seconds
size1 = calculate_size_of_folder_contents(folder_path)
break if (size1-size0) > 0
end
end
Haven't tested, but seems functionally sound
You could also easily monkey code this into watir itself

How can I process huge JSON files as streams in Ruby, without consuming all memory?

I'm having trouble processing a huge JSON file in Ruby. What I'm looking for is a way to process it entry-by-entry without keeping too much data in memory.
I thought that yajl-ruby gem would do the work but it consumes all my memory. I've also looked at Yajl::FFI and JSON:Stream gems but there it is clearly stated:
For larger documents we can use an IO object to stream it into the
parser. We still need room for the parsed object, but the document
itself is never fully read into memory.
Here's what I've done with Yajl:
file_stream = File.open(file, "r")
json = Yajl::Parser.parse(file_stream)
json.each do |entry|
entry.do_something
end
file_stream.close
The memory usage keeps getting higher until the process is killed.
I don't see why Yajl keeps processed entries in the memory. Can I somehow free them, or did I just misunderstood the capabilities of Yajl parser?
If it cannot be done using Yajl: is there a way to do this in Ruby via any library?
Problem
json = Yajl::Parser.parse(file_stream)
When you invoke Yajl::Parser like this, the entire stream is loaded into memory to create your data structure. Don't do that.
Solution
Yajl provides Parser#parse_chunk, Parser#on_parse_complete, and other related methods that enable you to trigger parsing events on a stream without requiring that the whole IO stream be parsed at once. The README contains an example of how to use chunking instead.
The example given in the README is:
Or lets say you didn't have access to the IO object that contained JSON data, but instead only had access to chunks of it at a time. No problem!
(Assume we're in an EventMachine::Connection instance)
def post_init
#parser = Yajl::Parser.new(:symbolize_keys => true)
end
def object_parsed(obj)
puts "Sometimes one pays most for the things one gets for nothing. - Albert Einstein"
puts obj.inspect
end
def connection_completed
# once a full JSON object has been parsed from the stream
# object_parsed will be called, and passed the constructed object
#parser.on_parse_complete = method(:object_parsed)
end
def receive_data(data)
# continue passing chunks
#parser << data
end
Or if you don't need to stream it, it'll just return the built object from the parse when it's done. NOTE: if there are going to be multiple JSON strings in the input, you must specify a block or callback as this is how yajl-ruby will hand you (the caller) each object as it's parsed off the input.
obj = Yajl::Parser.parse(str_or_io)
One way or another, you have to parse only a subset of your JSON data at a time. Otherwise, you are simply instantiating a giant Hash in memory, which is exactly the behavior you describe.
Without knowing what your data looks like and how your JSON objects are composed, it isn't possible to give a more detailed explanation than that; as a result, your mileage may vary. However, this should at least get you pointed in the right direction.
Both #CodeGnome's and #A. Rager's answer helped me understand the solution.
I ended up creating the gem json-streamer that offers a generic approach and spares the need to manually define callbacks for every scenario.
Your solutions seem to be json-stream and yajl-ffi. There's an example on both that're pretty similar (they're from the same guy):
def post_init
#parser = Yajl::FFI::Parser.new
#parser.start_document { puts "start document" }
#parser.end_document { puts "end document" }
#parser.start_object { puts "start object" }
#parser.end_object { puts "end object" }
#parser.start_array { puts "start array" }
#parser.end_array { puts "end array" }
#parser.key {|k| puts "key: #{k}" }
#parser.value {|v| puts "value: #{v}" }
end
def receive_data(data)
begin
#parser << data
rescue Yajl::FFI::ParserError => e
close_connection
end
end
There, he sets up the callbacks for possible data events that the stream parser can experience.
Given a json document that looks like:
{
1: {
name: "fred",
color: "red",
dead: true,
},
2: {
name: "tony",
color: "six",
dead: true,
},
...
n: {
name: "erik",
color: "black",
dead: false,
},
}
One could stream parse it with yajl-ffi something like this:
def parse_dudes file_io, chunk_size
parser = Yajl::FFI::Parser.new
object_nesting_level = 0
current_row = {}
current_key = nil
parser.start_object { object_nesting_level += 1 }
parser.end_object do
if object_nesting_level.eql? 2
yield current_row #here, we yield the fully collected record to the passed block
current_row = {}
end
object_nesting_level -= 1
end
parser.key do |k|
if object_nesting_level.eql? 2
current_key = k
elsif object_nesting_level.eql? 1
current_row["id"] = k
end
end
parser.value { |v| current_row[current_key] = v }
file_io.each(chunk_size) { |chunk| parser << chunk }
end
File.open('dudes.json') do |f|
parse_dudes f, 1024 do |dude|
pp dude
end
end

Split a complex file into a hash

I am running a command line program, called Primer 3. It takes an input file and returns data to standard output. I am trying to write a Ruby script which will accept that input, and put the entries into a hash.
The results returned are below. I would like to split the data on the '=' sign, so that the has would something like this:
{:SEQUENCE_ID => "example", :SEQUENCE_TEMPLATE => "GTAGTCAGTAGACNAT..etc", :SEQUENCE_TARGET => "37,21" etc }
I would also like to lower case the keys, ie:
{:sequence_id => "example", :sequence_template => "GTAGTCAGTAGACNAT..etc", :sequence_target => "37,21" etc }
This is my current script:
#!/usr/bin/ruby
puts 'Primer 3 hash'
primer3 = {}
while line = gets do
name, height = line.split(/\=/)
primer3[name] = height.to_i
end
puts primer3
It is returning this:
Primer 3 hash
{"SEQUENCE_ID"=>0, "SEQUENCE_TEMPLATE"=>0, "SEQUENCE_TARGET"=>37, "PRIMER_TASK"=>0, "PRIMER_PICK_LEFT_PRIMER"=>1, "PRIMER_PICK_INTERNAL_OLIGO"=>1, "PRIMER_PICK_RIGHT_PRIMER"=>1, "PRIMER_OPT_SIZE"=>18, "PRIMER_MIN_SIZE"=>15, "PRIMER_MAX_SIZE"=>21, "PRIMER_MAX_NS_ACCEPTED"=>1, "PRIMER_PRODUCT_SIZE_RANGE"=>75, "P3_FILE_FLAG"=>1, "SEQUENCE_INTERNAL_EXCLUDED_REGION"=>37, "PRIMER_EXPLAIN_FLAG"=>1, "PRIMER_THERMODYNAMIC_PARAMETERS_PATH"=>0, "PRIMER_LEFT_EXPLAIN"=>0, "PRIMER_RIGHT_EXPLAIN"=>0, "PRIMER_INTERNAL_EXPLAIN"=>0, "PRIMER_PAIR_EXPLAIN"=>0, "PRIMER_LEFT_NUM_RETURNED"=>0, "PRIMER_RIGHT_NUM_RETURNED"=>0, "PRIMER_INTERNAL_NUM_RETURNED"=>0, "PRIMER_PAIR_NUM_RETURNED"=>0, ""=>0}
Data source
SEQUENCE_ID=example
SEQUENCE_TEMPLATE=GTAGTCAGTAGACNATGACNACTGACGATGCAGACNACACACACACACACAGCACACAGGTATTAGTGGGCCATTCGATCCCGACCCAAATCGATAGCTACGATGACG
SEQUENCE_TARGET=37,21
PRIMER_TASK=pick_detection_primers
PRIMER_PICK_LEFT_PRIMER=1
PRIMER_PICK_INTERNAL_OLIGO=1
PRIMER_PICK_RIGHT_PRIMER=1
PRIMER_OPT_SIZE=18
PRIMER_MIN_SIZE=15
PRIMER_MAX_SIZE=21
PRIMER_MAX_NS_ACCEPTED=1
PRIMER_PRODUCT_SIZE_RANGE=75-100
P3_FILE_FLAG=1
SEQUENCE_INTERNAL_EXCLUDED_REGION=37,21
PRIMER_EXPLAIN_FLAG=1
PRIMER_THERMODYNAMIC_PARAMETERS_PATH=/usr/local/Cellar/primer3/2.3.4/bin/primer3_config/
PRIMER_LEFT_EXPLAIN=considered 65, too many Ns 17, low tm 48, ok 0
PRIMER_RIGHT_EXPLAIN=considered 228, low tm 159, high tm 12, high hairpin stability 22, ok 35
PRIMER_INTERNAL_EXPLAIN=considered 0, ok 0
PRIMER_PAIR_EXPLAIN=considered 0, ok 0
PRIMER_LEFT_NUM_RETURNED=0
PRIMER_RIGHT_NUM_RETURNED=0
PRIMER_INTERNAL_NUM_RETURNED=0
PRIMER_PAIR_NUM_RETURNED=0
=
$ primer3_core < example2 | ruby /Users/sean/Dropbox/bin/rb/read_primer3.rb
#!/usr/bin/ruby
puts 'Primer 3 hash'
primer3 = {}
while line = gets do
key, value = line.split(/=/, 2)
primer3[key.downcase.to_sym] = value.chomp
end
puts primer3
For fun, here are a couple of purely-functional solutions. Both assume that you've already pulled your data from the file, e.g.
my_data = ARGF.read # read the file passed on the command line
This one feels sort of gross, but it is a (long) one-liner :)
hash = Hash[ my_data.lines.map{ |line|
line.chomp.split('=',2).map.with_index{ |s,i| i==0 ? s.downcase.to_sym : s }
} ]
This one is two lines, but feels cleaner than using with_index:
keys,values = my_data.lines.map{ |line| line.chomp.split('=',2) }.transpose
hash = Hash[ keys.map(&:downcase).map(&:to_sym).zip(values) ]
Both of these are likely less efficient and certainly more memory-intense than your already-accepted answer; iterating the lines and slowly mutating your hash is the best way to go. These non-mutating variations are just a mental exercise.
Your final answer should use ARGF to allow filenames on the command line or via STDIN. I would write it like so:
#!/usr/bin/ruby
module Primer3
def self.parse( file )
{}.tap do |primer3|
# Process one line at a time, without reading it all into memory first
file.each_line do |line|
key, value = line.chomp.split('=', 2)
primer3[key.downcase.to_sym] = value
end
end
end
end
Primer3.parse( ARGF ) if __FILE__==$0
This way you can either call the file from the command line, with or without STDIN, or you can require this file and use the module function it defines in other code.
OK I have it (almost). The only problem is it is adding a \n at the end of each value.
puts 'Primer 3 hash'
primer3 = {}
while line = gets do
key, value = line.split(/\=/)
puts key
puts value
primer3[key.downcase] = value
end
puts primer3
{"sequence_id"=>"example\n", "sequence_template"=>"GTAGTCAGTAGACNATGACNACTGACGATGCAGACNACACACACACACACAGCACACAGGTATTAGTGGGCCATTCGATCCCGACCCAAATCGATAGCTACGATGACG\n", "sequence_target"=>"37,21\n", "primer_task"=>"pick_detection_primers\n", "primer_pick_left_primer"=>"1\n", "primer_pick_internal_oligo"=>"1\n", "primer_pick_right_primer"=>"1\n", "primer_opt_size"=>"18\n", "primer_min_size"=>"15\n", "primer_max_size"=>"21\n", "primer_max_ns_accepted"=>"1\n", "primer_product_size_range"=>"75-100\n", "p3_file_flag"=>"1\n", "sequence_internal_excluded_region"=>"37,21\n", "primer_explain_flag"=>"1\n", "primer_thermodynamic_parameters_path"=>"/usr/local/Cellar/primer3/2.3.4/bin/primer3_config/\n", "primer_left_explain"=>"considered 65, too many Ns 17, low tm 48, ok 0\n", "primer_right_explain"=>"considered 228, low tm 159, high tm 12, high hairpin stability 22, ok 35\n", "primer_internal_explain"=>"considered 0, ok 0\n", "primer_pair_explain"=>"considered 0, ok 0\n", "primer_left_num_returned"=>"0\n", "primer_right_num_returned"=>"0\n", "primer_internal_num_returned"=>"0\n", "primer_pair_num_returned"=>"0\n", ""=>"\n"}

Read a file in chunks in Ruby

I need to read a file in MB chunks, is there a cleaner way to do this in Ruby:
FILENAME="d:\\tmp\\file.bin"
MEGABYTE = 1024*1024
size = File.size(FILENAME)
open(FILENAME, "rb") do |io|
read = 0
while read < size
left = (size - read)
cur = left < MEGABYTE ? left : MEGABYTE
data = io.read(cur)
read += data.size
puts "READ #{cur} bytes" #yield data
end
end
Adapted from the Ruby Cookbook page 204:
FILENAME = "d:\\tmp\\file.bin"
MEGABYTE = 1024 * 1024
class File
def each_chunk(chunk_size = MEGABYTE)
yield read(chunk_size) until eof?
end
end
open(FILENAME, "rb") do |f|
f.each_chunk { |chunk| puts chunk }
end
Disclaimer: I'm a ruby newbie and haven't tested this.
Alternatively, if you don't want to monkeypatch File:
until my_file.eof?
do_something_with( my_file.read( bytes ) )
end
For example, streaming an uploaded tempfile into a new file:
# tempfile is a File instance
File.open( new_file, 'wb' ) do |f|
# Read in small 65k chunks to limit memory usage
f.write(tempfile.read(2**16)) until tempfile.eof?
end
You can use IO#each(sep, limit), and set sep to nil or empty string, for example:
chunk_size = 1024
File.open('/path/to/file.txt').each(nil, chunk_size) do |chunk|
puts chunk
end
If you check out the ruby docs:
http://ruby-doc.org/core-2.2.2/IO.html
there's a line that goes like this:
IO.foreach("testfile") {|x| print "GOT ", x }
The only caveat is. Since, this process can read the temp file faster than the
generated stream, IMO, a latency should be thrown in.
IO.foreach("/tmp/streamfile") {|line|
ParseLine.parse(line)
sleep 0.3 #pause as this process will discontine if it doesn't allow some buffering
}
https://ruby-doc.org/core-3.0.2/IO.html#method-i-read gives an example of iterating over fixed length records with read(length):
# iterate over fixed length records
open("fixed-record-file") do |f|
while record = f.read(256)
# ...
end
end
If length is a positive integer, read tries to read length bytes without any conversion (binary mode). It returns nil if an EOF is encountered before anything can be read. Fewer than length bytes are returned if an EOF is encountered during the read. In the case of an integer length, the resulting string is always in ASCII-8BIT encoding.
FILENAME="d:/tmp/file.bin"
class File
MEGABYTE = 1024*1024
def each_chunk(chunk_size=MEGABYTE)
yield self.read(chunk_size) until self.eof?
end
end
open(FILENAME, "rb") do |f|
f.each_chunk {|chunk| puts chunk }
end
It works, mbarkhau. I just moved the constant definition to the File class and added a couple of "self"s for clarity's sake.

Resources