Class owning a thread doesn't get garbage collected
I have a ruby service that needs to stream an object from S3 to somewhere else, the files get large and I don't want to store them as a file, so I wrote a simple class to buffer parts of the object for it to be used as an IO object by other parts of the code.
Essentially it looks like this, and full code avail in Gist linked below:
class S3ObjectStream
attr_reader :s3_client, :object, :size
def initialize(bucket, key, part_size: 5 * 1024 * 1024, num_threads: 5)
#s3_client = Aws::S3::Client.new(...)
#object = Aws::S3::Object.new(bucket_name: bucket, key:, client: #s3_client)
#io = StringIO.new
#size = #object.content_length
initialize_parts
start_parts
ObjectSpace.define_finalizer(self,
self.class.method(:finalize).to_proc)
end
def self.finalize(id)
puts "S3ObjectStream #{id} dying"
end
def read(size, out_buf=nil)
# Simplified, checks if more mem needed from parts
get_mem if need_more
#io.read(size, out_buf)
end
def need_more
#check byte ranges
end
def get_mem
# Simplified...
part = #parts.shift
#io.rewind
#io << part.data
start_next_part
end
def initialize_parts
#parts = []
# Determine # of parts required
# Create instances of them
nparts.each do
part = DataPart.new(...)
#parts.push_back(part)
end
end
def start_parts
# Start downloading parts concurrently by num of threads or total parts
# These vars are set in initialize_parts, not shown in simplified code
num_to_start = [#num_parts, #num_threads].min
#parts.each_with_index do |part, i|
break if i == num_to_start
part.start
end
end
def start_next_part
#parts.each do |part|
next if part.started?
part.start
break
end
end
end
class DataPart
def initialize(s3_object, start_byte, end_byte)
#s3_object = s3_object
#start_byte = start_byte
#end_byte = end_byte
#range = "bytes=#{#start_byte}-#{#end_byte}"
ObjectSpace.define_finalizer(self,
self.class.method(:finalize).to_proc)
end
def self.finalize(id)
puts "DataPart #{id} dying"
end
def start
#thread = Thread.new do
#part_data = #s3_object.get(range: #range).body.read
nil # Don't want the thread to hold onto the string as Thread.value
end
end
def data
#thread.join
#part_data
end
end
The issue we're running into is the DataPart objects don't seem to be cleaned up by the garbage collection.
My understanding is once the DataPart goes out of scope in get_mem (shifted off the array, then leaves scope of the method), it should be unreachable and marked for cleaning.
Initially we were running into memory issues (graphs below) where the whole file was being held in memory. Adding the nil to the DataPart thread in start reduced the memory usage, but we were still seeing the objects stay around forever.
Here is a graph of the memory usage of this script
Adding destructor prints to the objects showed all the DataPart objects that were created weren't destroyed until the program exited even when the S3ObjectStreams that owned those objects and the arrays of them were being destroyed as expected.
gist showing test code and logs of objects being destroyed
When we remove the thread from start and do the part downloading in serial, the DataPart objects get destroyed as expected during runtime GC runs. But this obviously adds a ton of time to the whole process.
Graph of the memory usage after removing the thread
My question is, what would cause these DataParts to stick around with the inclusion of a thread? Is there a circular dependency here that I'm not understanding between the thread objects and the owning DataParts?
Rather than some objects not being garbage collected, I'd rather assume that your StringIO object in #io just gets larger on each read, since you append the data there in S3ObjectStream#get_mem.
As StringIO is basically just a normal String with a different interface to work like an IO object, what happens here is that you are just increasing the size of the underlying string, without ever releasing the read data again. Please be aware that with a StringIO object, just reading data from it will not delete previously read data from the String; you can always call rewind on it to read everything from the beginning again.
To avoid this, you should probably try to get rid of the #io object altogether and just use a simple String object. In get_mem, you can then append data to this string. In read, you can use String#byteslice to get up to size bytes of data and remove this read data from the buffer. That way, your buffer will not grow unbounded.
This can look like this:
class S3ObjectStream
def initialize(bucket, key, part_size: 5 * 1024 * 1024, num_threads: 5)
# ...
# a mutable string in binary encoding
#buffer = +"".force_encoding(Encoding::BINARY)
end
def get_mem
part = #parts.shift
#buffer << part.data
end
def read(size, out_buf = nil)
# Simplified, checks if more mem needed from parts
get_mem if need_more
data = #buffer.byteslice(0, size)
if out_buf
out_buf.replace data
out_buf
else
data
end
end
end
The out_buf is more or less useless in this implementation though and probably doesn't help in nay way. But likely, it doesn't hurt either.
Note that neither this construct nor your previous StringIO object is thread-safe. If you are thus appending to and/or reading from the #buffer from multiple concurrent threads, you need to add appropriate mutexes.
In addition to the #io issue, it also appears from tyour simplified code, that you are starting to get all parts in parallel, each in its own thread. Thus, each DataPart object hold its read data in memory in the #part_data variable. As you initialize all DataPart objects for your data in parallel at the start, your memory will grow to contain all parts anyway. The cobstructiuon wiuth partially getting a data part from the #parts array and appending its data to a buffer is thus rather pointless.
Instead, you probably have to only get a few DataParts (or one at a time) as they are consumed and continue creating/fetching additional DataParts as you read the data.
Related
I'm looking to implement a data structure with the following characteristics:
Operations
Push: Add an element to the front of the list.
Read: Read all elements in the list
Behavior
Fixed-size: The list should not grow beyond a specified threshold, and it should automatically truncate from the end (oldest item) if that threshold is exceeded. This does not need to be strictly enforced, but the list should eventually be truncated once it passes the threshold.
Concurrency-safe: The structure should safely accommodate multiple parallel pushers and readers
Non-blocking: This is the real problem. I'd like to use an implementation without locks. Many threads should be able to push/read simultaneously if possible. A less-desirable, but acceptable option would be an implementation that has locks, but minimizes contention between multiple pushers/readers. I'm familiar with reader-writer locks, but those assume infrequent writes, which is not my use-case.
Optional but nice-to-have
Write-read consistency: If a single thread pushes to the structure, a read immediately following should contain the written item. This would be nice, but I'm wondering whether excluding this requirement could make the above requirements easier to implement.
I'm mostly a novice in concurrent data structures. Does an example of such a data structure exist? Ring buffers are interesting, but I don't think they can be non-blocking. Linked-lists are promising, but the concurrency-safe, non-blocking requirements complicate the implementation considerably.
I have found some good papers on implementing non-blocking linked lists using atomic CAS (compare-and-swap) operations, but the fixed-size requirement throws a bit of a wrench into those. Maybe that idea can be adapted to a fixed-size list?
For what it's worth, I'm interested in using this in Ruby. I understand that MRI has the global-interpreter-lock, which makes this a bit useless for MRI, but other Ruby runtimes could take advantage of this, and I'm thinking of it as a learning exercise to grow my concurrent programming skills.
Analysis
This question might be a better fit on Software Engineering, rather than here on Stack Overflow, as it seems to be more of a design question. That said, I suggest using thread-safe arrays, or delegating resource contention to an MVCC database if you can't redesign your application to avoid a singular shared object altogether.
Recommendations
You can implement a thread-safe list or simulate a circular buffer using Concurrent::Array with the #unshift and #pop methods. You can also choose to externalize locking to something like a database, where Ruby's GIL is largely irrelevant to the underlying queue or locking mechanisms. However, to the best of my knowledge, there's no way to create a truly lockless concurrent access object in Ruby, although implementing your own multiversion concurrency control might come close.
The low-hanging fruit is probably to externalize your reads and writes to an MVCC-capable database such as PostgreSQL. If you can't or won't do that, you may need to accept the trade-offs inherent in the ACID properties and performance characteristics of your application and data structures. In particular, the use of a single shared data structure is a design decision you should perhaps re-evaluate if you can.
Before you start down that path, just make sure that you have a real performance problem to solve. While there are certainly cases where locks add noticeable overhead, many real-world applications are sufficiently performant even with Ruby's GIL in the mix. Your mileage may certainly vary.
You might consider creating a class such as the following. I don't consider this to be complete. Moreover, I have not considered non-blocking issues, which is a broad topic that is not specific to this class.
class TruncatedList
attr_reader :max_size
alias to_s inspect
def initialize(max_size=Float::INFINITY)
#max_size = max_size
#list = []
end
def pop(n=1)
return nil if #list.empty?
case n
when 0
nil
when 1
#list.pop
else
#list.pop([n, #list.size].min)
end
end
def >>(obj)
#list.pop if #list.size == #max_size
#list.unshift(obj)
end
def unshift(*arr)
arr.each do |obj|
#list.pop if #list.size == #max_size
#list >> obj
end
end
def <<(obj)
if #list.size < #max_size
#list << obj
else
#list
end
end
def push(*arr)
arr.each do |obj|
break(#list) if #list.size == #max_size
#list << obj
end
end
def shift(n=1)
return #list if #list.empty?
case n
when 0
nil
when 1
#list.shift
else
#list.shift([n, #list.size].min)
end
end
def pop(n=1)
return nil if #list.empty?
case n
when 0
nil
when 1
#list.pop
else
#list.pop([n, #list.size].min)
end
end
def inspect
#list.to_str
end
def to_a
#list
end
def size
#list.size
end
end
Here is an example of how this set might be used.
t = TruncatedList.new(6)
#=> #<TruncatedList:0x00007fe2db0512a0 #max_size=6, #list=[]>
t.inspect
#=> "[]"
t.to_a
#=> []
t >> 1
#=> 1
t.inspect
#=> "[1]"
t.unshift(2,3)
#=> [2, 3]
t.inspect
#=> "[3, 2, 1]"
t.unshift(4,5,6,7,8)
#=> [4, 5, 6, 7, 8]
t.inspect
#=> "[8, 7, 6, 5, 4, 3]"
t.to_a
#=> [8, 7, 6, 5, 4, 3]
I think I came up with an interesting solution that meets the requirements.
Theory
We use a linked-list as the foundation, but add thread-safety and truncation on top.
Pushes
During pushes, we accomplish thread-safety by using a compare-and-set operation. The push will succeed only if another thread has not already pushed to the last-known list head. If the push fails, we simply retry until it succeeds.
Truncation
When the first node is pushed, we designate it as the "prune node". As items get pushed to the list, that node is pushed further down, but we maintain a reference to it. When the list reaches capacity, we set break the link on the "prune node" to allow the following nodes to be garbage collected. Then we set the newest node as the "prune node". This way, the list size never exceeds "capacity * 2".
Reads
Because it's a linked-list without arbitrary insertions, we get mostly consistent reads because the list nodes will never be rearranged. We dereference the head when we start reading the list. We never read more elements than the configured capacity. If the list is truncated during a read, it's possible that we might not read enough nodes (this could be mitigated by saving the prune node when starting enumeration so that pruned nodes could still be read while the enumerator is active).
Thoughts
I'm pretty happy about the truncation mechanism, but it seems likely that a Mutex-based solution would perform just-as-well or even better than the CAS solution. It likely depends on how heavily contested the push operation is, and would need to be benchmarked.
require 'concurrent-ruby'
class SizedList
attr_reader :capacity
class Node
attr_reader :value
attr_reader :nxt
def initialize(value, nxt = nil)
#value = value
#nxt = Concurrent::AtomicReference.new(nxt)
#count = Concurrent::AtomicFixnum.new(0)
end
def increment
#count.increment
end
end
def initialize(capacity)
#capacity = capacity
#head = Node.new(nil)
#prune_node = Concurrent::AtomicReference.new
end
def push(element)
succeeded = false
node = nil
# Maybe should just use a mutex for this write instead of CAS
# It needs to be benchmarked
until succeeded
first = #head.nxt.get
node = Node.new(element, first)
succeeded = #head.nxt.compare_and_set(first, node)
end
# Every N nodes where N=#capacity is designated as the "prune node"
# Once we push N times, we drop all the nodes after the prune node by setting
# it's nxt value to nil.
# Then we set the first node as the new prune node
#prune_node.compare_and_set(nil, node) if #prune_node.get.nil?
prune_node = #prune_node.get
count = prune_node.increment
if count >= #capacity
if #prune_node.compare_and_set(prune_node, node)
prune_node.nxt.set(nil)
end
end
nil
end
def each(&block)
enum = Enumerator.new do |yielder|
current = #head
# Here we just iterate through the list, but limit the results to #capacity
#capacity.times do
current = current.nxt.get
break if current == nil
yielder.yield(current.value)
end
end
block ? enum.each(&block) : enum
end
end
I'm looking for some StringIO-similar class, that allows me to write and read concurrently from different parts of my program.
From one part of the program I want to write (append) characters to the buffer, from another part I want to read them.
The problem with StringIO is the following:
buffer = StringIO.new
buffer.write "Foobar" # Write to the buffer
buffer.rewind # Move the pointer to beginning
buffer.getc #=> F
buffer.getc #=> o
buffer.write("something") # Write more to the buffer
buffer.string #=> Fosomething
buffer.getc #=> nil
buffer.pos #=> 11
Whenever I write to the buffer, it is written to the current position. Aterwards the position is moved to the last written characters.
What I need is a StringBuffer with two seperate positions for reading and writing, instead of only one. Does something like this exist in ruby or do I have to do in on my own?
You should consider using a Queue. If you do not need thread safety, then a simple array might be fine too.
If your program is single-threaded, try coroutines, aka blocks.
def do_stuff
yield rand(100)
end
100.times do
do_stuff { |response| puts response }
end
I have an array to which I keep adding blocks of code at different points of time. When a particular event occurs, an iterator iterates through this array and yields the blocks one after the other.
Many of these blocks are the same and I want to avoid executing duplicate blocks.
This is sample code:
#after_event_hooks = []
def add_after_event_hook(&block)
#after_event_hooks << block
end
Something like #after_event_hooks.uniq or #after_event_hooks |= block don't work.
Is there a way to compare blocks or check their uniqueness?
The blocks can not be checked for uniqueness since that will mean to check whether they represent the same functions, something that is not possible and has been researched in computer science for a long time.
You can probably use a function similar to the discussed in "Ruby block to string instead of executing", which is a function that takes a block and returns a string representation of the code in the block, and compare the output of the strings you receive.
I am not sure if this is fast enough to be worthy to compare them, instead of executing them multiple times. This also has the downside you need to be sure the code is exactly the same, even one variable with different name will break it.
As #hakcho has said, it is not trivial to compare blocks. A simple solution might be having the API request for named hooks, so you can compare the names:
#after_event_hooks = {}
def add_after_event_hook(name, &block)
#after_event_hooks[name] = block
end
def after_event_hooks
#after_event_hooks.values
end
Maybe use something like this:
class AfterEvents
attr_reader :hooks
def initialize
#hooks = {}
end
def method_missing(hook_sym, &block)
#hooks[hook_sym] = block
end
end
Here is a sample:
events = AfterEvents.new
events.foo { puts "Event Foo" }
events.bar { puts "Event Bar" }
# test
process = {:first => [:foo], :sec => [:bar], :all => [:foo, :bar]}
process.each { |event_sym, event_codes|
puts "Processing event #{event_sym}"
event_codes.each { |code| events.hooks[code].call }
}
# results:
# Processing event first
# Event Foo
# Processing event sec
# Event Bar
# Processing event all
# Event Foo
# Event Bar
I've written some code in ruby to process items in an array via a threadpool. In the process, I've preallocated a results array which is the same size as the passed-in array. Within the threadpool, I'm assigning items in the preallocated array, but the indexes of those items are guaranteed to be unique. With that in mind, do I need to surround the assignment with a Mutex#synchronize?
Example:
SIZE = 1000000000
def collect_via_threadpool(items, pool_count = 10)
processed_items = Array.new(items.count, nil)
index = -1
length = items.length
mutex = Mutex.new
items_mutex = Mutex.new
[pool_count, length, 50].min.times.collect do
Thread.start do
while (i = mutex.synchronize{index = index + 1}) < length do
processed_items[i] = yield(items[i])
# ^ do I need to synchronize around this? `processed_items` is preallocated
end
end
end.each(&:join)
processed_items
end
items = collect_via_threadpool(SIZE.times.to_a, 100) do |item|
item.to_s
end
raise unless items.size == SIZE
items.each_with_index do |item, index|
raise unless item.to_i == index
end
puts 'success'
(This test code takes a long time to run, but appears to print 'success' every time.)
It seems like I would want to surround the Array#[]= with Mutex#synchronize just to be safe, but my question is:
Within Ruby's specification is this code defined as safe?
Nothing in Ruby is specified to be thread safe other than Mutex (and thus anything derived from it). If you want to know if your specific code is thread safe, you'll need to look at how your implementation handles threads and arrays.
For MRI, calling Array.new(n, nil) does actually allocate memory for the entire array, so if your threads are guaranteed to not share indices your code will work. It's as safe as having multiple threads operate on distinct variables without a mutex.
However for other implementations, Array.new(n, nil) might not allocate a whole array, and assigning to indices later could involve reallocations and memory copies, which could break catastrophically.
So while your code may work (in MRI at least), don't rely on it. While we're on the topic, Ruby's threads aren't even specified to actually run in parallel. So if you're trying to avoid mutexes because you think you might see some performance boost, maybe you should rethink your approach.
I'm new to Ruby and I'm just having a play around with ideas and what I would like to do is remove the #continent data from the country_array I have created. Done a good number of searches and can find quite a bit of info on removing elements in their entirety but can't find how to specifically remove #continent data. Please keep any answers fairly simple as I'm new, however any help much appreciated.
class World
include Enumerable
include Comparable
attr_accessor :continent
def <=> (sorted)
#length = other.continent
end
def initialize(country, continent)
#country = country
#continent = continent
end
end
a = World.new("Spain", "Europe")
b = World.new("India", "Asia")
c = World.new("Argentina", "South America")
d = World.new("Japan", "Asia")
country_array = [a, b, c, d]
puts country_array.inspect
[#<World:0x100169148 #continent="Europe", #country="Spain">,
#<World:0x1001690d0 #continent="Asia", #country="India">,
#<World:0x100169058 #continent="South America", #country="Argentina">,
#<World:0x100168fe0 #continent="Asia", #country="Japan">]
You can use remove_instance_variable. However, since it's a private method, you'll need to reopen your class and add a new method to do this:
class World
def remove_country
remove_instance_variable(:#country)
end
end
Then you can do this:
country_array.each { |item| item.remove_country }
# => [#<World:0x7f5e41e07d00 #country="Spain">,
#<World:0x7f5e41e01450 #country="India">,
#<World:0x7f5e41df5100 #country="Argentina">,
#<World:0x7f5e41dedd10 #country="Japan">]
The following example will set the #continent to nil for the first World object in your array:
country_array[0].continent = nil
irb(main):035:0> country_array[0]
=> #<World:0xb7dd5e84 #continent=nil, #country="Spain">
But it doesn't really remove the continent variable since it's part of your World object.
Have you worked much with object-oriented programming? Is your World example from a book or tutorial somewhere? I would suggest some changes to how your World is structured. A World could have an array of Continent's, and each Continent could have an array of Country's.
Names have meaning and variable names should reflect what they truly are. The country_array variable could be renamed to world_array since it is an array of World objects.
99% of the time I would recommend against removing an instance variable, because it's extra code for no extra benefit.
When you're writing code, generally you're trying to solve a real-world problem. With the instance variable, some questions to ask are:
What real world concept am I trying to model with the various states the variable can be in?
What am I going to do with the values stored in the variable?
If you're just trying to blank out the continent value stored in a World object, you can set #continent to nil as dustmachine says. This will work fine for the 99% of the cases. (Accessing a removed instance variable will just return nil anyway.)
The only possible case (I can think of) when removing the instance variable could be useful is when you're caching a value that may be nil. For example:
class Player
def score(force_reload = false)
if force_reload
# purge cached value
remove_instance_variable(:#score)
end
# Calling 'defined?' on an instance variable will return false if the variable
# has never been set, or has been removed via force_reload.
if not defined? #score
# Set cached value.
# Next time around, we'll just return the #score without recalculating.
#score = get_score_via_expensive_calculation()
end
return #score
end
private
def get_score_via_expensive_calculation
if play_count.zero?
return nil
else
# expensive calculation here
return result
end
end
end
Since nil is a meaningful value for #score, we can't use nil to indicate that the value hasn't been cached yet. So we use the undefined state to tell us whether we need to recalculate the cached value. So there are 3 states for #score:
nil (means user has not played any games)
number (means user played at least once but did not accrue any points)
undefined (means we haven't fetched the calculated score for the Player object yet).
Now it's true that you could use another value that's not a number instead of the undefined state (a symbol like :unset for example), but this is just a contrived example to demonstrate the idea. There are cases when your variable may hold an object of unknown type.