Calculate hash from huge stream - ruby

I have to calculate a hash from various streams (StringIO, File, chunked http responses...), and the sources are pretty big (around 100MB - 1GB). For example, I have the following code
require 'digest'
sha = Digest::SHA256.new
stream = StringIO.new("test\nfoo\nbar\nhello world")
# this could also be a File.open('my_file.txt')
# or a chunked http response
while content = stream.read(2)
sha.update content
end
puts sha.to_s
This works so far, but I was wondering how the sha.update method works. Does it store a copy from the overall String in its instance, so that the whole content is hold in memory?
This could lead to some serious memory issues, when loading 1GB of data into RAM (and doing this on multiple processes on the same machine)

Related

Incrementally writing Parquet dataset from Python

I am writing a larger than RAM data out from my Python application - basically dumping data from SQLAlchemy to Parque. My solution was inspired by this question. Even though increasing the batch size as hinted here I am facing the issues:
RAM usage grows heavily
The writer starts to slow down after a while (write throughput speed drops more than 5x)
My assumption is that this is because the ParquetWriter metadata management becomes expensive when the number of rows increase. I am thinking that I should switch to datasets that would allow the writer to close the file in the middle of processing flush out the metadata.
My question is
Is there an example for writing incremental datasets with Python and Parquet
Are my assumptions correct or incorrect and using datasets would help to maintain the writer throughput?
My distilled code:
writer = pq.ParquetWriter(
fname,
Candle.to_pyarrow_schema(small_candles),
compression='snappy',
allow_truncated_timestamps=True,
version='2.0', # Highest available schema
data_page_version='2.0', # Highest available schema
) as writer:
def writeout():
nonlocal data
duration = time.time() - stats["started"]
throughout = stats["candles_processed"] / duration
logger.info("Writing Parquet table for candle %s, throughput is %s", "{:,}".format(stats["candles_processed"]), throughout)
writer.write_table(
pa.Table.from_pydict(
data,
writer.schema
)
)
data = dict.fromkeys(data.keys(), [])
process = psutil.Process(os.getpid())
logger.info("Flushed %s writer, the memory usage is %s", bucket, process.memory_info())
# Use massive yield_per() or otherwise we are leaking memory
for item in query.yield_per(100_000):
frame = construct_frame(row_type, item)
for key, value in frame.items():
data[key].append(value)
stats["candles_processed"] += 1
# Do regular checkopoints to avoid out of memory
# and to log the progress to the console
# For fine tuning Parquet writer see
# https://issues.apache.org/jira/browse/ARROW-10052
if stats["candles_processed"] % 100_000 == 0:
writeout()
In this case, the reason was the incorrect use of Python lists and dicts as a working buffer, as pointed out by #0x26res.
After making sure the dictionary of lists is cleared correctly, the memory consumption issues become negligible.

Is it possible to decrease Ruby memory usage when streaming files?

I'm working on a class to download videos from an url.
I want to stream these videos instead of downloading them at once, so my program uses less RAM.
The function is the following
def get_file(url, max_segment_size)
http_client = HTTPClient.new
segment = nil
http_client.get_content(url) do |chunk|
segment.nil? ? segment = chunk : segment << chunk
if segment.size >= max_segment_size
# send part to s3
send_part(segment)
segment = nil
end
end
# send last part
send_part(segment) if segment
end
However, the program still uses a lot of RAM. For example, streaming a file of 30MB makes the process consume 150MB. Comparing to downloading the whole file at once, it uses about the same amount of ram. (I tried using net/http with the read_body method. Same results)
My understanding was that setting segment = nil should free up the space on the memory that the variable was using.
Is this expected to happen? Is there a way to manually free up this space on ruby?

Manipulate Audio File with Ruby

Streaming mp3 and ogg files on internal server and authentication.
Trying to stream files to HTML5 player but running into problems with Chrome Seek To function.
Have all my headers setup, but how can I open a Binary File, Seek to position and only send data from that point to the end?
i.e. Given an mp3 file that is 119132474 long,
And A request comes in asking for the new start point of the file be at 21012274
How can I send a new binary file with only information from 21012274 to 119132474
Here is something similar to what I want to do but in Node.js http://www.extrawurst.org/blog11/2012/06/streaming-media-in-nodejs/
------- UPDATE 02/15/2014 --------
I installed Redis and used Redis as a temp cache server of Binary data. Then used Redis's GETRANGE. See http://redis.io/commands/getrange
You can open file in binary mode and use the methods from IO module to read bytes. For example:
file_size = File.size('filename')
File.open('filename', 'rb') do |file| # read in binary mode
file.seek(position)
file.read(file_size - position) # return all bytes until the end
end
There is another that should work, but I didn't test on streaming. The method is 'binread' which is simpler than the first one:
File.binread('filename', start_pos, offset)
It should work!

A simple TCP messaging protocol?

I want to send messages between Ruby processes via TCP without using ending chars that could restrict the potential message content. That rules out the naïve socket.puts/gets approach.
Is there a basic TCP message implementation somewhere in the standard libs?.
(I'd like to avoid Drb to keep everything simple.)
It seems like there is no canonical, reusable solution.
So here's a basic implementation for the archives:
module Messaging
# Assumes 'msg' is single-byte encoded
# and not larger than 4,3 GB ((2**(4*8)-1) bytes)
def dispatch(msg)
write([msg.length].pack('N') + msg)
end
def receive
if (message_size = read(4)) # sizeof (N)
message_size = message_size.unpack('N')[0]
read(message_size)
end
end
end
# usage
message_hub = TCPSocket.new('localhost', 1234).extend(Messaging)
The usual way to send strings in that situation is to send an integer (encoded however you like) for the size of the string, followed by that many bytes. You can save space but still allow arbitrary sizes by using a UTF-8-like scheme for that integer.

Ruby Memory Management

I have been using Ruby for a while now and I find, for bigger projects, it can take up a fair amount of memory. What are some best practices for reducing memory usage in Ruby?
Please, let each answer have one "best practice" and let the community vote it up.
When working with huge arrays of ActiveRecord objects be very careful... When processing those objects in a loop if on each iteration you are loading their related objects using ActiveRecord's has_many, belongs_to, etc. - the memory usage grows a lot because each object that belongs to an array grows...
The following technique helped us a lot (simplified example):
students.each do |student|
cloned_student = student.clone
...
cloned_student.books.detect {...}
ca_teachers = cloned_student.teachers.detect {|teacher| teacher.address.state == 'CA'}
ca_teachers.blah_blah
...
# Not sure if the following is necessary, but we have it just in case...
cloned_student = nil
end
In the code above "cloned_student" is the object that grows, but since it is "nullified" at the end of each iteration this is not a problem for huge array of students. If we didn't do "clone", the loop variable "student" would have grown, but since it belongs to an array - the memory used by it is never released as long as array object exists.
Different approach works too:
students.each do |student|
loop_student = Student.find(student.id) # just re-find the record into local variable.
...
loop_student.books.detect {...}
ca_teachers = loop_student.teachers.detect {|teacher| teacher.address.state == 'CA'}
ca_teachers.blah_blah
...
end
In our production environment we had a background process that failed to finish once because 8Gb of RAM wasn't enough for it. After this small change it uses less than 1Gb to process the same amount of data...
Don't abuse symbols.
Each time you create a symbol, ruby puts an entry in it's symbol table. The symbol table is a global hash which never gets emptied.
This is not technically a memory leak, but it behaves like one. Symbols don't take up much memory so you don't need to be too paranoid, but it pays to be aware of this.
A general guideline: If you've actually typed the symbol in code, it's fine (you only have a finite amount of code after all), but don't call to_sym on dynamically generated or user-input strings, as this opens the door to a potentially ever-increasing number
Don't do this:
def method(x)
x.split( doesn't matter what the args are )
end
or this:
def method(x)
x.gsub( doesn't matter what the args are )
end
Both will permanently leak memory in ruby 1.8.5 and 1.8.6. (not sure about 1.8.7 as I haven't tried it, but I really hope it's fixed.) The workaround is stupid and involves creating a local variable. You don't have to use the local, just create one...
Things like this are why I have lots of love for the ruby language, but no respect for MRI
Beware of C extensions which allocate large chunks of memory themselves.
As an example, when you load an image using RMagick, the entire bitmap gets loaded into memory inside the ruby process. This may be 30 meg or so depending on the size of the image.
However, most of this memory has been allocated by RMagick itself. All ruby knows about is a wrapper object, which is tiny(1).
Ruby only thinks it's holding onto a tiny amount of memory, so it won't bother running the GC. In actual fact it's holding onto 30 meg.
If you loop over a say 10 images, you can run yourself out of memory really fast.
The preferred solution is to manually tell the C library to clean up the memory itself - RMagick has a destroy! method which does this. If your library doesn't however, you may need to forcibly run the GC yourself, even though this is generally discouraged.
(1): Ruby C extensions have callbacks which will get run when the ruby runtime decides to free them, so the memory will eventually be successfully freed at some point, just perhaps not soon enough.
Measure and detect which parts of your code are creating objects that cause memory usage to go up. Improve and modify your code then measure again. Sometimes, you're using gems or libraries that use up a lot of memory and creating a lot of objects as well.
There are many tools out there such as busy-administrator that allow you to check the memory size of objects (including those inside hashes and arrays).
$ gem install busy-administrator
Example # 1: MemorySize.of
require 'busy-administrator'
data = BusyAdministrator::ExampleGenerator.generate_string_with_specified_memory_size(10.mebibytes)
puts BusyAdministrator::MemorySize.of(data)
# => 10 MiB
Example # 2: MemoryUtils.profile
Code
require 'busy-administrator'
results = BusyAdministrator::MemoryUtils.profile(gc_enabled: false) do |analyzer|
BusyAdministrator::ExampleGenerator.generate_string_with_specified_memory_size(10.mebibytes)
end
BusyAdministrator::Display.debug(results)
Output:
{
memory_usage:
{
before: 12 MiB
after: 22 MiB
diff: 10 MiB
}
total_time: 0.406452
gc:
{
count: 0
enabled: false
}
specific:
{
}
object_count: 151
general:
{
String: 10 MiB
Hash: 8 KiB
BusyAdministrator::MemorySize: 0 Bytes
Process::Status: 0 Bytes
IO: 432 Bytes
Array: 326 KiB
Proc: 72 Bytes
RubyVM::Env: 96 Bytes
Time: 176 Bytes
Enumerator: 80 Bytes
}
}
You can also try ruby-prof and memory_profiler. It is better if you test and experiment different versions of your code so you can measure the memory usage and performance of each version. This will allow you to check if your optimization really worked or not. You usually use these tools in development / testing mode and turn them off in production.

Resources