Is it possible to decrease Ruby memory usage when streaming files? - ruby

I'm working on a class to download videos from an url.
I want to stream these videos instead of downloading them at once, so my program uses less RAM.
The function is the following
def get_file(url, max_segment_size)
http_client = HTTPClient.new
segment = nil
http_client.get_content(url) do |chunk|
segment.nil? ? segment = chunk : segment << chunk
if segment.size >= max_segment_size
# send part to s3
send_part(segment)
segment = nil
end
end
# send last part
send_part(segment) if segment
end
However, the program still uses a lot of RAM. For example, streaming a file of 30MB makes the process consume 150MB. Comparing to downloading the whole file at once, it uses about the same amount of ram. (I tried using net/http with the read_body method. Same results)
My understanding was that setting segment = nil should free up the space on the memory that the variable was using.
Is this expected to happen? Is there a way to manually free up this space on ruby?

Related

Incrementally writing Parquet dataset from Python

I am writing a larger than RAM data out from my Python application - basically dumping data from SQLAlchemy to Parque. My solution was inspired by this question. Even though increasing the batch size as hinted here I am facing the issues:
RAM usage grows heavily
The writer starts to slow down after a while (write throughput speed drops more than 5x)
My assumption is that this is because the ParquetWriter metadata management becomes expensive when the number of rows increase. I am thinking that I should switch to datasets that would allow the writer to close the file in the middle of processing flush out the metadata.
My question is
Is there an example for writing incremental datasets with Python and Parquet
Are my assumptions correct or incorrect and using datasets would help to maintain the writer throughput?
My distilled code:
writer = pq.ParquetWriter(
fname,
Candle.to_pyarrow_schema(small_candles),
compression='snappy',
allow_truncated_timestamps=True,
version='2.0', # Highest available schema
data_page_version='2.0', # Highest available schema
) as writer:
def writeout():
nonlocal data
duration = time.time() - stats["started"]
throughout = stats["candles_processed"] / duration
logger.info("Writing Parquet table for candle %s, throughput is %s", "{:,}".format(stats["candles_processed"]), throughout)
writer.write_table(
pa.Table.from_pydict(
data,
writer.schema
)
)
data = dict.fromkeys(data.keys(), [])
process = psutil.Process(os.getpid())
logger.info("Flushed %s writer, the memory usage is %s", bucket, process.memory_info())
# Use massive yield_per() or otherwise we are leaking memory
for item in query.yield_per(100_000):
frame = construct_frame(row_type, item)
for key, value in frame.items():
data[key].append(value)
stats["candles_processed"] += 1
# Do regular checkopoints to avoid out of memory
# and to log the progress to the console
# For fine tuning Parquet writer see
# https://issues.apache.org/jira/browse/ARROW-10052
if stats["candles_processed"] % 100_000 == 0:
writeout()
In this case, the reason was the incorrect use of Python lists and dicts as a working buffer, as pointed out by #0x26res.
After making sure the dictionary of lists is cleared correctly, the memory consumption issues become negligible.

Memory leak after algorithm in Rails?

I wrote an algorithm inspired by the merge part of the merge sort.
def self.merge(arr)
if arr.length == 1
return arr
end
groups = []
(0...-(-arr.length/2)).each do |i|
groups << []
if !arr[2*i+1].nil?
arr[2*i].each do |cal1|
arr[2*i+1].each do |cal2|
mergecal = func(cal1,cal2)
if mergecal
groups[i] << mergecal
else
mergecal = nil
end
end
end
else
groups[i] = arr[2*i]
end
end
arr = nil
return merge(groups)
end
After the page using this algorithm is rendered, Task Manager reported around 500MB of RAM usage. Then by refreshing the same page again, memory usage have now reached 1GB. I tried adding GC.start(full_mark: true) to the controller just after the function call, but nothing seems to have changed. I'm not sure whether the memory leak has to code with my code or Ruby itself.
Ruby garbage collection doesn't immediately reduce the amount of memory your ruby program has allocated. Memory allocation is expensive so even if the objects you create are immediately collected by GC the memory is slowly released back to the OS. If you think this function has a memory leak you should try running it in a non-rails process where you have more control over object lifecycles. You can use GC.stat to get information about the number of live and free objects before and after you run the GC. It's also worth reading up on how ruby GC works I like this article.

Calculate hash from huge stream

I have to calculate a hash from various streams (StringIO, File, chunked http responses...), and the sources are pretty big (around 100MB - 1GB). For example, I have the following code
require 'digest'
sha = Digest::SHA256.new
stream = StringIO.new("test\nfoo\nbar\nhello world")
# this could also be a File.open('my_file.txt')
# or a chunked http response
while content = stream.read(2)
sha.update content
end
puts sha.to_s
This works so far, but I was wondering how the sha.update method works. Does it store a copy from the overall String in its instance, so that the whole content is hold in memory?
This could lead to some serious memory issues, when loading 1GB of data into RAM (and doing this on multiple processes on the same machine)

Manipulate Audio File with Ruby

Streaming mp3 and ogg files on internal server and authentication.
Trying to stream files to HTML5 player but running into problems with Chrome Seek To function.
Have all my headers setup, but how can I open a Binary File, Seek to position and only send data from that point to the end?
i.e. Given an mp3 file that is 119132474 long,
And A request comes in asking for the new start point of the file be at 21012274
How can I send a new binary file with only information from 21012274 to 119132474
Here is something similar to what I want to do but in Node.js http://www.extrawurst.org/blog11/2012/06/streaming-media-in-nodejs/
------- UPDATE 02/15/2014 --------
I installed Redis and used Redis as a temp cache server of Binary data. Then used Redis's GETRANGE. See http://redis.io/commands/getrange
You can open file in binary mode and use the methods from IO module to read bytes. For example:
file_size = File.size('filename')
File.open('filename', 'rb') do |file| # read in binary mode
file.seek(position)
file.read(file_size - position) # return all bytes until the end
end
There is another that should work, but I didn't test on streaming. The method is 'binread' which is simpler than the first one:
File.binread('filename', start_pos, offset)
It should work!

Ruby script 'Killed'

I am running a ruby script from the command line. The script downloads a file (15 MB), unzips it, parses it as JSON and then populates a mysql db with it.
When I run it, I get a simple 'Killed' message back. What's going on? How can I find out what the problem is?
I am using it on an EC2 micro instance.
Thanks
Here's the script
require 'open-uri'
require 'zlib'
require 'json'
require_relative '../db/db.rb'
dl = open('........')
ex = Zlib::GzipReader.new dl
json = JSON.parse ex.read
events = json['resultsPage']['results']['event']
puts "starting to parse #{events.count} event(s)..."
created = 0
updated = 0
events[1..10].each do |event|
performances = event['performance']
performances.each do |performance|
ar_show = Show.find_or_initialize_by_songkick_id performance['id']
ar_show.artist_name = performance['displayName']
ar_show.new_record? ? created += 1 : updated += 1
ar_show.save!
end
end
Import.create :updated => updated, :new => created
puts "complete. new: #{created} - updated: #{updated}"
You are almost certainly running out of memory, as a micro instance doesn't have much memory or swap space available. I've had this happen with Perl programs. Dynamic languages can use a lot of memory when processing large chunks of data.
The best way to test this theory is to spin up a small or large instance for under an hour (so you won't pay much for it) and try the script there. If it runs, you know that a micro instance is too small for your program to run on.

Resources