SHA256 value for large binaries when using Ruby - ruby

I want the SHA256 values for some large binary files. Is this possible using Ruby? Digest::SHA2 does not seem to be usable as expects strings and not files.

Is the file on disk? - I would just use a call to shell using another utility program and avoid reading it into the Ruby runtime memory at all, e.g. using Open3 + sha256sum
Is the file coming over IO (like the network)? - Use a streaming read on the IO object and compute the SHA in chunks
e.g. something like this:
sha = Digest::SHA2.new
File.open(file_path) do |f|
while chunk = f.read(256) # only load 256 bytes at a time
sha << chunk
end
end
sha.hexdigest # returns what you want

Related

How to read a large file into a string

I'm trying to save and load the states of Matrices (using Matrix) during the execution of my program with the functions dump and load from Marshal. I can serialize the matrix and get a ~275 KB file, but when I try to load it back as a string to deserialize it into an object, Ruby gives me only the beginning of it.
# when I want to save
mat_dump = Marshal.dump(#mat) # serialize object - OK
File.open('mat_save', 'w') {|f| f.write(mat_dump)} # write String to file - OK
# somewhere else in the code
mat_dump = File.read('mat_save') # read String from file - only reads like 5%
#mat = Marshal.load(mat_dump) # deserialize object - "ArgumentError: marshal data too short"
I tried to change the arguments for load but didn't find anything yet that doesn't cause an error.
How can I load the entire file into memory? If I could read the file chunk by chunk, then loop to store it in the String and then deserialize, it would work too. The file has basically one big line so I can't even say I'll read it line by line, the problem stays the same.
I saw some questions about the topic:
"Ruby serialize array and deserialize back"
"What's a reasonable way to read an entire text file as a single string?"
"How to read whole file in Ruby?"
but none of them seem to have the answers I'm looking for.
Marshal is a binary format, so you need to read and write in binary mode. The easiest way is to use IO.binread/write.
...
IO.binwrite('mat_save', mat_dump)
...
mat_dump = IO.binread('mat_save')
#mat = Marshal.load(mat_dump)
Remember that Marshaling is Ruby version dependent. It's only compatible under specific circumstances with other Ruby versions. So keep that in mind:
In normal use, marshaling can only load data written with the same major version number and an equal or lower minor version number.

Compressing large string in ruby

I have a web application(ruby on rails) that sends some YAML as the value of a hidden input field.
Now I want to reduce the size of the text that is sent across to the browser. What is the most efficient form of lossless compression that would send across minimal data? I'm ok to incur additional cost of compression and decompression at the server side.
You could use the zlib implementation in the ruby core to in/de-flate data:
require "zlib"
data = "some long yaml string" * 100
compressed_data = Zlib::Deflate.deflate(data)
#=> "x\x9C+\xCE\xCFMU\xC8\xC9\xCFKW\xA8L\xCC\xCDQ(.)\xCA\xCCK/\x1E\x15\x1C\x15\x1C\x15\x1C\x15\x1C\x15\x1C\x15\x1C\x15\x1C\x15D\x15\x04\x00\xB3G%\xA6"
You should base64-encode the compressed data to make it printable:
require 'base64'
encoded_data = Base64.encode64 compressed_data
#=> "eJwrzs9NVcjJz0tXqEzMzVEoLinKzEsvHhUcFRwVHBUcFRwVHBUcFUQVBACz\nRyWm\n"
Later, on the client-side, you might use pako (a zlib port to javascript) to get your data back. This answer probably helps you with implementing the JS part.
To give you an idea on how effective this is, here are the sizes of the example strings:
data.size # 2100
compressed_data.size # 48
encoded_data.size # 66
Same thing goes vice-versa when compressing on the client and inflating on the server.
Zlib::Inflate.inflate(Base64.decode64(encoded_data))
#=> "some long yaml stringsome long yaml str ... (shortened, as the string is long :)
Disclaimer:
The ruby zlib implementation should be compatible with the pako implementation. But I have not tried it.
The numbers about string sizes are a little cheated. Zlib is really effective here, because the string repeats a lot. Real life data usually does not repeat as much.
If you are working on a Rails application, you can also use the ActiveSupport::Gzip wrapper that allows compression/decompression of strings with gzip.
compressed_log = ActiveSupport::Gzip.compress('large string')
=> "\x1F\x8B\b\x00yq5c\x00\x03..."
original_log = ActiveSupport::Gzip.decompress(compressed_log)
=> "large string"
Behind the scenes, the compress method uses the Zlib::GzipWriter class which writes gzipped files. Similarly, the decompress method uses Zlib::GzipReader class which reads a gzipped file.

dealing with large CSV files (20G) in ruby

I am working on little problem and would have some advice on how to solve it:
Given a csv file with an unknown number of columns and rows, output a list of columns with values and the number of times each value was repeated. without using any library.
if the file is small this shouldn't be a problem, but when it is a few Gigs, i get NoMemoryError: failed to allocate memory. is there a way to create a hash and read from the disk instead of loading the file to Memory? you can do that in perl with tied Hashes
EDIT: will IO#foreach load the file into memory? how about File.open(filename).each?
Read the file one line at a time, discarding each line as you go:
open("big.csv") do |csv|
csv.each_line do |line|
values = line.split(",")
# process the values
end
end
Using this method, you should never run out of memory.
Do you read the whole file at once? Reading it on a per-line basis, i.e. using ruby -pe, ruby -ne or $stdin.each should reduce the memory usage by garbage collecting lines which were processed.
data = {}
$stdin.each do |line|
# Process line, store results in the data hash.
end
Save it as script.rb and pipe the huge CSV file into this script's standard input:
ruby script.rb < data.csv
If you don't feel like reading from the standard input we'll need a small change.
data = {}
File.open("data.csv").each do |line|
# Process line, store results in the data hash.
end
For future reference, in such cases you want to use CSV.foreach('big_file.csv', headers: true) do |row|
This will read the file line by line from the IO object with minimal memory footprint (should be below 1MB regardless of file size).

How do I wrap ruby IO with a sliding window filter

I'm using an opaque API in some ruby code which takes a File/IO as a parameter. I want to be able to pass it an IO object that only gives access to a given range of data in the real IO object.
For example, I have a 8GB file, and I want to give the api an IO object that has a 1GB range within the middle of my real file.
real_file = File.new('my-big-file')
offset = 1 * 2**30 # start 1 GB into it
length = 1 * 2**30 # end 1 GB after start
filter = IOFilter.new(real_file, offset, length)
# The api only sees the 1GB of data in the middle
opaque_api(filter)
The filter_io project looks like it would be the easiest to adapt to do this, but doesn't seem to support this use case directly.
I think you would have to write it yourself, as it seems like a rather specific thing: you would have to implement all (or, a subset that you need) of IO's methods using a chunk of the opened file as a data source. An example of the "speciality" would be writing to such stream - you would have to take care not to cross the boundary of the segment given, i.e. constantly keeping track of your current position in the big file. Doesn't seem like a trivial job, and I don't see any shortcuts that could help you there.
Perhaps you can find some OS-based solution, e.g. making a loopback device out of the part of the large file (see man losetup and particularly -o and --sizelimit options, for example).
Variant 2:
If you are ok with keeping the contents of the window in memory all the time, you may wrap StringIO like this (just a sketch, not tested):
def sliding_io filename, offset, length
File.open(filename, 'r+') do |f|
# read the window into a buffer
f.seek(offset)
buf = f.read(length)
# wrap a buffer into StringIO and pass it given block
StringIO.open(buf) do |buf_io|
yield(buf_io)
end
# write altered buffer back to the big file
f.seek(offset)
f.write(buf[0,length])
end
end
And use it as you would use block variant of IO#open.
I believe the IO object has the functionality you are looking for. I've used it before for MD5 hash summing similarly sized files.
incr_digest = Digest::MD5.new()
file = File.open(filename, 'rb') do |io|
while chunk = io.read(50000)
incr_digest << chunk
end
end
This was the block I used, where I was passing the chunk to the MD5 Digest object.
http://www.ruby-doc.org/core/classes/IO.html#M000918

StringScanner scanning IO instead of a string

I've got a parser written using ruby's standard StringScanner. It would be nice if I could use it on streaming files. Is there an equivalent to StringScanner that doesn't require me to load the whole string into memory?
You might have to rework your parser a bit, but you can feed lines from a file to a scanner like this:
File.open('filepath.txt', 'r') do |file|
scanner = StringScanner.new(file.readline)
until file.eof?
scanner.scan(/whatever/)
scanner << file.readline
end
end
StringScanner was intended for that, to load a big string and going back and forth with an internal pointer, if you make it a stream, then the references get lost, you can not use unscan, check_until, pre_match, post_match,
well you can, but for that you need to buffer all the previous input.
If you are concerned about the buffer size then just load by chunk of data, and use a simple regexp or a gem called Parser.
The simplest way is to read a fix size of data.
# iterate over fixed length records
open("fixed-record-file") do |f|
while record = f.read(1024)
# parse here the record using regexp or parser
end
end
[Updated]
Even with this loop you can use StringSanner, you just need to update the string with each new chunk of data:
string=(str)
Changes the string being scanned to str and resets the scanner.
Returns str
There is StringIO.
Sorry misread you question. Take a look at this seems to have streaming options

Resources