I need to calculate some MD5 of very large files (>1TB) for deduplication purposes. For example, I have a 10GiB file and want to simultaneously calculate the MD5 of the whole file and of each of the 10 sequential 1GiB chunks.
I cannot use the form:
Digest::MD5.hexdigest(IO.read("file_10GiB.txt"))
because Ruby first reads the whole file in memory before calculating the MD5, so I quickly run out of memory.
So I can read it in 1MiB chunks with the following code.
require "digest"
fd = File.open("file_10GiB.txt")
digest_1GiB = Digest::MD5.new
digest_10GiB = Digest::MD5.new
10.times do
1024.times do
data_1MiB = fd.read(2**20) # Each MiB is read only once so
digest_1GiB << data_1MiB # there is no duplicate IO operations.
digest_10GiB << data_1MiB # It is then sent to both digests.
end
puts digest_1GiB.hexdigest
digest_1GiB.reset
end
puts digest_10GiB.hexdigest
That takes around 40 seconds. The results are similar using openssl instead of digest.
If I comment out digest_1GiB << data_1MiB or digest_10GiB << data_1MiB, it unsurprisingly goes twice as fast (20 seconds) and the bash command md5sum file_10GiB.txt takes around 20 seconds, so that is consistent.
Clearly both MD5's are being calculated in the same thread and core so I thought I could use some multithreading. I'm using Ruby MRI 2.2.1 that doesn't have real multithreading but using subprocesses I can calculate the MD5's on several cores simultaneously:
fd = File.open("file_10GiB.txt")
IO.popen("md5sum", "r+") do |md5sum_10GiB|
10.times do
IO.popen("md5sum", "r+") do |md5sum_1GiB|
1024.times do
data_1MiB = fd.read(2**20) # Each MiB is read only once so
md5sum_1GiB << data_1MiB # there is no duplicate IO operations.
md5sum_10GiB << data_1MiB # It is then sent to both digests.
end
md5sum_1GiB.close_write
puts md5sum_1GiB.gets
end
end
md5sum_10GiB.close_write
puts md5sum_10GiB.gets
end
But this takes 120 seconds, three times slower. Why would that be?
Strangely, if I comment out md5sum_1GiB << data_1MiB or md5sum_10GiB << data_1MiB it doesn't take 60 seconds as expected, but 40, which is still half of the theoretical speed.
The results are similar using Open3::popen2 instead of IO::popen. Same for openssl md5 instead of md5sum.
I have confirmed the speed differences of those pieces of code with significantly larger files to make sure it wasn't insignificant measurement errors and the proportions stay the same.
I have very fast IO storage with about 2.5GiB/s sequential read so I don't think it could cause any limitations here.
Related
Short version:
how to read from STDIN (or a file) char by char while maintaining high performance using Ruby? (though the problem is probably not Ruby specific)
Long version:
While learning Ruby I'm designing a little utility that has to read from a piped text data, find and collect numbers in it and do some processing.
cat huge_text_file.txt | program.rb
input > 123123sdas234sdsd5a ...
output > 123123, 234, 5, ...
The text input might be huge (gigabytes) and it might not contain newlines or whitespace (any non-digit char is a separator) so I did a char by char reading (though I had my concerns about the performance) and it turns out doing it this way is incredibly slow.
Simply reading char by char with no processing on a 900Kb input file takes around 7 seconds!
while c = STDIN.read(1)
end
If I input data with newlines and read line by line, same file is read 100x times faster.
while s = STDIN.gets
end
It seems like reading from a pipe with STDIN.read(1) doesn't involve any buffering and every time read happens, hard drive is hit - but shouldn't it be cached by OS?
Doesn't STDIN.gets read char by char internally until it encounters '\n'?
Using C, I would probably read data in chunks though I would I have to deal with numbers being split by buffer window but that doesn't look like an elegant solution for Ruby. So what is the proper way of doing this?
P.S Timing reading the same file in Python:
for line in f:
line
f.close()
Running time is 0.01 sec.
c = f.read(1)
while c:
c = f.read(1)
f.close()
Running time is 0.17 sec.
Thanks!
This script reads the IO object word by word, and executes the block every time 1000 words have been found or the end of the file has been reached.
No more than 1000 words will be kept in memory at the same time. Note that using " " as separator means that "words" might contain newlines.
This scripts uses IO#each to specify a separator (a whitespace in this case, to get an Enumerator of words), lazy to avoid doing any operation on the whole file content and each_slice to get an array of batch_size words.
batch_size = 1000
STDIN.each(" ").lazy.each_slice(batch_size) do |batch|
# batch is an Array of batch_size words
end
Instead of using cat and |, you could also read the file directly :
batch_size = 1000
File.open('huge_text_file.txt').each(" ").lazy.each_slice(batch_size) do |batch|
# batch is an Array of batch_size words
end
With this code, no number will be split, no logic is needed, it should be much faster than reading the file char by char and it will use much less memory than reading the whole file into a String.
I'm working on a class to download videos from an url.
I want to stream these videos instead of downloading them at once, so my program uses less RAM.
The function is the following
def get_file(url, max_segment_size)
http_client = HTTPClient.new
segment = nil
http_client.get_content(url) do |chunk|
segment.nil? ? segment = chunk : segment << chunk
if segment.size >= max_segment_size
# send part to s3
send_part(segment)
segment = nil
end
end
# send last part
send_part(segment) if segment
end
However, the program still uses a lot of RAM. For example, streaming a file of 30MB makes the process consume 150MB. Comparing to downloading the whole file at once, it uses about the same amount of ram. (I tried using net/http with the read_body method. Same results)
My understanding was that setting segment = nil should free up the space on the memory that the variable was using.
Is this expected to happen? Is there a way to manually free up this space on ruby?
I have code that is running in a manner similar to the sample code below. There are two threads that loop at certain time intervals. The first thread sets a flag and depending on the value of this flag the second thread prints out a result. My question is in a situation like this where only one thread is changing the value of the resource (#flag), and the second thread is only accessing its value but not changing it, is a mutex lock required? Any explanations?
Class Sample
def initialize
#flag=""
#wait_interval1 = 20
#wait_interval2 = 5
end
def thread1(x)
Thread.start do
loop do
if x.is_a?(String)
#flag = 0
else
#flag = 1
sleep #wait_interval
end
end
end
def thread2(y)
Thread.start do
loop do
if #flag == 0
if y.start_with?("a")
puts "yes"
else
puts "no"
end
end
end
end
end
end
As a general rule, the mutex lock is required (or better yet, a read/write lock so multiple reads can run in parallel and the exclusive lock's only needed when changing the value).
It's possible to avoid needing the lock if you can guarantee that the underlying accesses (both the read and the write) are atomic (they happen as one uninterruptible action so it's not possible for two to overlap). On modern multi-core and multi-processor hardware it's difficult to guarantee that, and when you add in virtualization and semi-interpreted languages like Ruby it's all but impossible. Don't be fooled into thinking that being 99.999% certain there won't be an overlap is enough, that just means you can expect an error due to lack of locking once every 100,000 iterations which translates to several times a second for your code and probably at least once every couple of seconds for the kind of code you'd see in a real application. That's why it's advisable to follow the general rule and not worry about when it's safe to break it until you've exhausted every other option for getting acceptable performance and shown through profiling that it's acquiring/releasing that lock that's the bottleneck.
I'm trying to create a large array containing about 64000 objects. The objects are truncated SHA256 digests of files.
The files are in 256 subdirectories (named 00 - ff) each containing about 256 files (varies slightly for each). Each file size is between around 1.5KB and 2KB.
The code looks like this:
require 'digest'
require 'cfpropertylist'
A = Array.new
Dir.glob('files/**') do |dir|
puts "Processing dir #{dir}"
Dir.glob("#{dir}/*.bin") do |file|
sha256 = Digest::SHA256.file file
A.push(CFPropertyList::Blob.new(sha256.digest[0..7]))
end
end
plist = A.to_plist({:plist_format => CFPropertyList::List::FORMAT_XML, :formatted => true})
File.write('hashes.plist', plist)
If I process 16 directories (replacing 'files/**' with 'files/0*' in the above), the time it takes on my machine is 0m0.340s.
But if I try to process all of them, the processing speed drastically reduce after about 34 directories have been processed.
This is on the latest OS X, using the stock ruby.
The machine is a mid-2011 iMac with 12GB memory and 3.4 GHz Intel Core i7.
The limiting factor does not seem to be the Array size: since if I remove the sha256 processing and just store the filenames instead, there is no slowdown.
Is there anything I can do better or to track the issue? I don't have another OS or machine available at the moment to test if this is an OS X or machine specific thing.
This is a disk/FS caching issue. After running the script to completion, and rerunning again the slowdown mostly disappeared. Also using another computer with SSD didn't show a slowdown.
I'm in need to transfer a file via sockets:
# sender
require 'socket'
SIZE = 1024 * 1024 * 10
TCPSocket.open('127.0.0.1', 12345) do |socket|
File.open('c:/input.pdf', 'rb') do |file|
while chunk = file.read(SIZE)
socket.write(chunk)
end
end
end
# receiver
require 'socket'
require 'benchmark'
SIZE = 1024 * 1024 * 10
server = TCPServer.new("127.0.0.1", 12345)
puts "Server listening..."
client = server.accept
time = Benchmark.realtime do
File.open('c:/output.pdf', 'w') do |file|
while chunk = client.read(SIZE)
file.write(chunk)
end
end
end
file_size = File.size('c:/output.pdf') / 1024 / 1024
puts "Time elapsed: #{time}. Transferred #{file_size} MB. Transfer per second: #{file_size / time} MB" and exit
Using Ruby 1.9 i get a transfer rate of ~ 16MB/s (~ 22MB/s using 1.8) when transfering a 80MB PDF file from / to localhost. I'm new to socket programming, but that seems pretty slow compared to just using FileUtils.cp. Is there anything i'm doing wrong?
Well, even with localhost, you still have to go through some of the TCP stack, introducing inevitable delays with packet fragmentation and rebuilding. It probably doesn't go out on the wire where you'd be limited to 100 megaBITs (~12.5 MB/s) per second or a gigibit (~125 MB/s) theoretical maximum.
None of that overhead exists for raw file copying disk to disk. You should keep in mind that even SATA1 gave you 1.5 gigabits/sec and I'd be surprised if you were still running on that backlevel. On top of that, your OS itself will undoubtedly be caching a lot of stuff, not possible when sending over the TCP stack.
16MB per second doesn't sound too bad to me.
I know this question is old, but why can't you compress before you send, then decompress on the receiving end?
require 'zlib'
def compress(input)
Zlib::Deflate.deflate(input)
end
def decompress(input)
Zlib::Inflate.inflate(input)
end
(Shameless plug) AFT (https://github.com/wlib/aft) already does what you're making