Ruby StringIO for concurrent reading and writing - ruby

I'm looking for some StringIO-similar class, that allows me to write and read concurrently from different parts of my program.
From one part of the program I want to write (append) characters to the buffer, from another part I want to read them.
The problem with StringIO is the following:
buffer = StringIO.new
buffer.write "Foobar" # Write to the buffer
buffer.rewind # Move the pointer to beginning
buffer.getc #=> F
buffer.getc #=> o
buffer.write("something") # Write more to the buffer
buffer.string #=> Fosomething
buffer.getc #=> nil
buffer.pos #=> 11
Whenever I write to the buffer, it is written to the current position. Aterwards the position is moved to the last written characters.
What I need is a StringBuffer with two seperate positions for reading and writing, instead of only one. Does something like this exist in ruby or do I have to do in on my own?

You should consider using a Queue. If you do not need thread safety, then a simple array might be fine too.

If your program is single-threaded, try coroutines, aka blocks.
def do_stuff
yield rand(100)
end
100.times do
do_stuff { |response| puts response }
end

Related

ruby: object not cleaned by GC when it creates a thread

Class owning a thread doesn't get garbage collected
I have a ruby service that needs to stream an object from S3 to somewhere else, the files get large and I don't want to store them as a file, so I wrote a simple class to buffer parts of the object for it to be used as an IO object by other parts of the code.
Essentially it looks like this, and full code avail in Gist linked below:
class S3ObjectStream
attr_reader :s3_client, :object, :size
def initialize(bucket, key, part_size: 5 * 1024 * 1024, num_threads: 5)
#s3_client = Aws::S3::Client.new(...)
#object = Aws::S3::Object.new(bucket_name: bucket, key:, client: #s3_client)
#io = StringIO.new
#size = #object.content_length
initialize_parts
start_parts
ObjectSpace.define_finalizer(self,
self.class.method(:finalize).to_proc)
end
def self.finalize(id)
puts "S3ObjectStream #{id} dying"
end
def read(size, out_buf=nil)
# Simplified, checks if more mem needed from parts
get_mem if need_more
#io.read(size, out_buf)
end
def need_more
#check byte ranges
end
def get_mem
# Simplified...
part = #parts.shift
#io.rewind
#io << part.data
start_next_part
end
def initialize_parts
#parts = []
# Determine # of parts required
# Create instances of them
nparts.each do
part = DataPart.new(...)
#parts.push_back(part)
end
end
def start_parts
# Start downloading parts concurrently by num of threads or total parts
# These vars are set in initialize_parts, not shown in simplified code
num_to_start = [#num_parts, #num_threads].min
#parts.each_with_index do |part, i|
break if i == num_to_start
part.start
end
end
def start_next_part
#parts.each do |part|
next if part.started?
part.start
break
end
end
end
class DataPart
def initialize(s3_object, start_byte, end_byte)
#s3_object = s3_object
#start_byte = start_byte
#end_byte = end_byte
#range = "bytes=#{#start_byte}-#{#end_byte}"
ObjectSpace.define_finalizer(self,
self.class.method(:finalize).to_proc)
end
def self.finalize(id)
puts "DataPart #{id} dying"
end
def start
#thread = Thread.new do
#part_data = #s3_object.get(range: #range).body.read
nil # Don't want the thread to hold onto the string as Thread.value
end
end
def data
#thread.join
#part_data
end
end
The issue we're running into is the DataPart objects don't seem to be cleaned up by the garbage collection.
My understanding is once the DataPart goes out of scope in get_mem (shifted off the array, then leaves scope of the method), it should be unreachable and marked for cleaning.
Initially we were running into memory issues (graphs below) where the whole file was being held in memory. Adding the nil to the DataPart thread in start reduced the memory usage, but we were still seeing the objects stay around forever.
Here is a graph of the memory usage of this script
Adding destructor prints to the objects showed all the DataPart objects that were created weren't destroyed until the program exited even when the S3ObjectStreams that owned those objects and the arrays of them were being destroyed as expected.
gist showing test code and logs of objects being destroyed
When we remove the thread from start and do the part downloading in serial, the DataPart objects get destroyed as expected during runtime GC runs. But this obviously adds a ton of time to the whole process.
Graph of the memory usage after removing the thread
My question is, what would cause these DataParts to stick around with the inclusion of a thread? Is there a circular dependency here that I'm not understanding between the thread objects and the owning DataParts?
Rather than some objects not being garbage collected, I'd rather assume that your StringIO object in #io just gets larger on each read, since you append the data there in S3ObjectStream#get_mem.
As StringIO is basically just a normal String with a different interface to work like an IO object, what happens here is that you are just increasing the size of the underlying string, without ever releasing the read data again. Please be aware that with a StringIO object, just reading data from it will not delete previously read data from the String; you can always call rewind on it to read everything from the beginning again.
To avoid this, you should probably try to get rid of the #io object altogether and just use a simple String object. In get_mem, you can then append data to this string. In read, you can use String#byteslice to get up to size bytes of data and remove this read data from the buffer. That way, your buffer will not grow unbounded.
This can look like this:
class S3ObjectStream
def initialize(bucket, key, part_size: 5 * 1024 * 1024, num_threads: 5)
# ...
# a mutable string in binary encoding
#buffer = +"".force_encoding(Encoding::BINARY)
end
def get_mem
part = #parts.shift
#buffer << part.data
end
def read(size, out_buf = nil)
# Simplified, checks if more mem needed from parts
get_mem if need_more
data = #buffer.byteslice(0, size)
if out_buf
out_buf.replace data
out_buf
else
data
end
end
end
The out_buf is more or less useless in this implementation though and probably doesn't help in nay way. But likely, it doesn't hurt either.
Note that neither this construct nor your previous StringIO object is thread-safe. If you are thus appending to and/or reading from the #buffer from multiple concurrent threads, you need to add appropriate mutexes.
In addition to the #io issue, it also appears from tyour simplified code, that you are starting to get all parts in parallel, each in its own thread. Thus, each DataPart object hold its read data in memory in the #part_data variable. As you initialize all DataPart objects for your data in parallel at the start, your memory will grow to contain all parts anyway. The cobstructiuon wiuth partially getting a data part from the #parts array and appending its data to a buffer is thus rather pointless.
Instead, you probably have to only get a few DataParts (or one at a time) as they are consumed and continue creating/fetching additional DataParts as you read the data.

Make puts thread-safe

I'm have a multithreaded program that prints to the console in hundreds of places. Unfortunately, instead of
Line 2
Line 1
Line 3
I get
Line2Line1
Line3
I am trying to make puts thread safe.
In Python (which I don't think has this problem, but suppose it did), I'd do
old_print = print
print_mutex = threading.Lock()
def print(*args, **kwargs):
print_mutex.acquire()
try:
old_print(*args, **kwargs)
finally:
print_mutex.release()
I'm trying this in Ruby,
old_puts = puts
puts_mutex = Mutex.new
def puts(*args)
puts_mutex.synchronize {
old_puts(*args)
}
But this doesn't work: "undefined method old_puts"
How can I make thread-safe (i.e. not print partial lines)?
alias old_puts puts
or more modern way:
module MyKernel
PutsMutex = Mutex.new
def puts(*)
PutsMutex.synchronize{super}
end
end
module Kernel
prepend MyKernel
end
The reason for this behaviour is that puts internally calls the underlying write function twice - one for the actual value to be written, and one for the newline to be written. (Explained in Ruby's puts is not atomic)
Here's a hack to make puts call write exactly once: Append \n to the string you're writing. Here's what this looks like in my code:
# Threadsafe `puts` that outputs text and newline atomically
def safe_puts(msg)
puts msg + "\n"
end
puts internally checks whether the object being written has a newline at the end, and only calls write again if that isn't true. Since we've changed the input to end with a newline, puts ends up making only one call to write.

Puts arrays in file using ruby

This is a part of my file:
project(':facebook-android-sdk-3-6-0').projectDir = new File('facebook-android-sdk-3-6-0/facebook-android-sdk-3.6.0/facebook')
project(':Forecast-master').projectDir = new File('forecast-master/Forecast-master/Forecast')
project(':headerListView').projectDir = new File('headerlistview/headerListView')
project(':library-sliding-menu').projectDir = new File('library-sliding-menu/library-sliding-menu')
I need to extract the names of the libs. This is my ruby function:
def GetArray
out_file = File.new("./out.txt", "w")
File.foreach("./file.txt") do |line|
l=line.scan(/project\(\'\:(.*)\'\).projectDir/)
File.open(out_file, "w") do |f|
l.each do |ch|
f.write("#{ch}\n")
end
end
puts "#{l} "
end
end
My function returns this:
[]
[["CoverFlowLibrary"]]
[["Android-RSS-Reader-Library-master"]]
[["library"]]
[["facebook-android-sdk-3-6-0"]]
[["Forecast-master"]]
My problem is that I find nothing in out_file. How can I write to a file? Otherwise, I only need to get the name of the libs in the file.
Meditate on this:
"project(':facebook-android-sdk-3-6-0').projectDir'".scan(/project\(\'\:(.*)\'\).projectDir/)
# => [["facebook-android-sdk-3-6-0"]]
When scan sees the capturing (...), it will create a sub-array. That's not what you want. The knee-jerk reaction is to flatten the resulting array of arrays but that's really just a band-aid on the code because you chose the wrong method.
Instead consider this:
"project(':facebook-android-sdk-3-6-0').projectDir'"[/':([^']+)'/, 1]
# => "facebook-android-sdk-3-6-0"
This is using String's [] method to apply a regular expression with a capture and return that captured text. No sub-arrays are created.
scan is powerful and definitely has its place, but not for this sort of "find one thing" parsing.
Regarding your code, I'd do something like this untested code:
def get_array
File.new('./out.txt', 'w') do |out_file|
File.foreach('./file.txt') do |line|
l = line[/':([^']+)'/, 1]
out_file.puts l
puts l
end
end
end
Methods in Ruby are NOT camelCase, they're snake_case. Constants, like classes, start with a capital letter and are CamelCase. Don't go all Java on us, especially if you want to write code for a living. So GetArray should be get_array. Also, don't start methods with "get_", and don't call it array; Use to_a to be idiomatic.
When building a regular expression start simple and do your best to keep it simple. It's a maintainability thing and helps to reduce insanity. /':([^']+)'/ is a lot easier to read and understand, and accomplishes the same as your much-too-complex pattern. Regular expression engines are greedy and lazy and want to do as little work as possible, which is sometimes totally evil, but once you understand what they're doing it's possible to write very small/succinct patterns to accomplish big things.
Breaking it down, it basically says "find the first ': then start capturing text until the next ', which is what you're looking for. project( can be ignored as can ).projectDir.
And actually,
/':([^']+)'/
could really be written
/:([^']+)'/
but I felt generous and looked for the leading ' too.
The problem is that you're opening the file twice: once in:
out_file = File.new("./out.txt", "w")
and then once for each line:
File.open(out_file, "w") do |f| ...
Try this instead:
def GetArray
File.open("./out.txt", "w") do |f|
File.foreach("./file.txt") do |line|
l=line.scan(/project\(\'\:(.*)\'\).projectDir/)
l.each do |ch|
f.write("#{ch}\n")
end # l.each
end # File.foreach
end # File.open
end # def GetArray

Read files into variables, using Dir and arrays

For an assignment, I'm using the Dir.glob method to read a series of famous speech files, and then perform some basic speech analytics on each one (number of words, number of sentences, etc). I'm able to read the files, but have not figured out how to read each file into a variable, so that I may operate on the variables later.
What I've got is:
Dir.glob('/students/~pathname/public_html/speeches/*.txt').each do |speech|
#code to process the speech.
lines = File.readlines(speech)
puts lines
end
This prints all the speeches out onto the page as one huge block of text. Can anyone offer some ideas as to why?
What I'd like to do, within that code block, is to read each file into a variable, and then perform operations on each variable such as:
Dir.glob('/students/~pathname/public_html/speeches/*.txt').each do |speech|
#code to process the speech.
lines = File.readlines(speech)
text = lines.join
line_count = lines.size
sentence_count = text.split(/\.|\?|!/).length
paragraph_count = text.split(/\n\n/).length
puts "#{line_count} lines"
puts "#{sentence_count} sentences"
puts "#{paragraph_count} paragraphs"
end
Any advice or insight would be hugely appreciated! Thanks!
Regarding your first question:
readLines converts the file into an array of Strings and what you then see is the behaviour of puts with an array of Strings as the argument.
Try puts lines.inspect if you would rather see the data as an array.
Also: Have a look at the Ruby console irb in case you have not done so already. It is very useful for trying out the kinds of things you are asking about.
Here's what wound up working:
speeches = []
Dir.glob('/PATH TO DIRECTORY/speeches/*.txt').each do |speech|
#code to process the speech.
f = File.readlines(speech)
speeches << f
end
def process_file(file_name)
# count the lines
line_count = file_name.size
return line_count
end
process_file(speeches[0])

Functionally find mapping of first value that passes a test

In Ruby, I have an array of simple values (possible encodings):
encodings = %w[ utf-8 iso-8859-1 macroman ]
I want to keep reading a file from disk until the results are valid. I could do this:
good = encodings.find{ |enc| IO.read(file, "r:#{enc}").valid_encoding? }
contents = IO.read(file, "r:#{good}")
...but of course this is dumb, since it reads the file twice for the good encoding. I could program it in gross procedural style like so:
contents = nil
encodings.each do |enc|
if (s=IO.read(file, "r:#{enc}")).valid_encoding?
contents = s
break
end
end
But I want a functional solution. I could do it functionally like so:
contents = encodings.map{|e| IO.read(f, "r:#{e}")}.find{|s| s.valid_encoding? }
…but of course that keeps reading files for every encoding, even if the first was valid.
Is there a simple pattern that is functional, but does not keep reading the file after a the first success is found?
If you sprinkle a lazy in there, map will only consume those elements of the array that are used by find - i.e. once find stops, map stops as well. So this will do what you want:
possible_reads = encodings.lazy.map {|e| IO.read(f, "r:#{e}")}
contents = possible_reads.find {|s| s.valid_encoding? }
Hopping on sepp2k's answer: If you can't use 2.0, lazy enums can be easily implemented in 1.9:
class Enumerator
def lazy_find
self.class.new do |yielder|
self.each do |element|
if yield(element)
yielder.yield(element)
break
end
end
end
end
end
a = (1..100).to_enum
p a.lazy_find { |i| i.even? }.first
# => 2
You want to use the break statement:
contents = encodings.each do |e|
s = IO.read( f, "r:#{e}" )
s.valid_encoding? and break s
end
The best I can come up with is with our good friend inject:
contents = encodings.inject(nil) do |s,enc|
s || (c=File.open(f,"r:#{enc}").valid_encoding? && c
end
This is still sub-optimal because it continues to loop through encodings after finding a match, though it doesn't do anything with them, so it's a minor ugliness. Most of the ugliness comes from...well, the code itself. :/

Resources