Why is this unused string not garbage collected? - ruby

Why does unused_variable_2 and unused_variable_3 get garbage collected, but not unused_variable_1?
# leaky_boat.rb
require "memprof"
class Boat
def initialize(string)
unused_variable1 = string[0...100]
puts unused_variable1.object_id
#string = string
puts #string.object_id
end
end
class Rocket
def initialize(string)
unused_variable_2 = string.dup
puts unused_variable_2.object_id
unused_variable_3 = String.new(string)
puts unused_variable_3.object_id
#string = string
puts #string.object_id
end
end
Memprof.start
text = "a" * 100
object_id_message = "Object ids of unused_variable_1, #string, unused_variable_2, unused_variable_3, and another #string"
before_gc_message = "Before GC"
after_gc_message = "After GC"
puts object_id_message
boat = Boat.new(text)
rocket = Rocket.new(text)
puts before_gc_message
Memprof.stats
ObjectSpace.garbage_collect
puts after_gc_message
Memprof.stats
Memprof.stop
Running the program:
$ uname -a
Linux [redacted] 3.2.0-25-generic #40-Ubuntu SMP Wed May 23 20:30:51 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux
$ ruby --version # Have to use Ruby 1.8 - memprof doesn't work on 1.9
ruby 1.8.7 (2011-06-30 patchlevel 352) [x86_64-linux]
$ ruby -rubygems leaky_boat.rb
Object ids of unused_variable_1, #string, unused_variable_2, unused_variable_3, and another #string
70178323299180
70178323299320
70178323299100
70178323299060
70178323299320
Before GC
2 leaky_boat.rb:6:String
2 leaky_boat.rb:26:String
1 leaky_boat.rb:9:String
1 leaky_boat.rb:7:String
1 leaky_boat.rb:32:Rocket
1 leaky_boat.rb:31:Boat
1 leaky_boat.rb:29:String
1 leaky_boat.rb:28:String
1 leaky_boat.rb:27:String
1 leaky_boat.rb:20:String
1 leaky_boat.rb:18:String
1 leaky_boat.rb:17:String
1 leaky_boat.rb:16:String
1 leaky_boat.rb:15:String
After GC
1 leaky_boat.rb:6:String
1 leaky_boat.rb:32:Rocket
1 leaky_boat.rb:31:Boat
1 leaky_boat.rb:29:String
1 leaky_boat.rb:28:String
1 leaky_boat.rb:27:String
1 leaky_boat.rb:26:String

This behavior is because the string implementation of your version of ruby for substr has a special case to save memory allocations when you take a substr that is the tail of the source string and the string length is large enough to not store the string value in the base object structure.
If you trace the code, you see the range subscript string[0...100] will go through this clause in rb_str_substr. So the new string will be allocated via str_new3 which allocates a new object struct (hence the differing object_id), but sets the string value ptr field as a pointer into the source object's extended storage and sets the ELTS_SHARED flag to indicate that the new object shares storage with another object.
In your code you take this new substring object and assign it to instance var #string which is still a live reference when you run garbage collection. Since there's a live reference to the allocated storage of the original string, it can't be collected.
In ruby trunk, this optimization to share storage on compatible tail substrings appears to still exist.
The two other vars unused_variable_2 and unused_variable_3 don't have this extended storage sharing issue because they're set via mechanisms that assure distinct storage, so they get garbage collected as expected when their references pass out of scope.
String#dup runs rb_str_replace (via initialize_copy binding) which replaces the contents of the source string with a copy of the contents of the source string and assures that the storage is not shared.
String#new(source_str) runs through rb_str_init which similarly assures distinct storage with rb_str_replace on the supplied initial value.

Related

Why does ghostscript hang when passing a procedure to a procedure to another procedure to invoke when using the same argument name?

I am passing a postscript procedure (c) as an argument on the stack to a procedure (a), which then passes this as an argument on the stack to another procedure (b) that invokes the procedure (c).
When I use a dictionary to localize variables and use the same name for the argument in a and b, ghostscript hangs (in the below code, these are procedures a1 and a2 and they both use proc). When I make them different names (in the below code these are procedures a and b which use Aproc and Bproc, respectively), it runs correctly.
Note that I am aware that I can just use the stack to pass down the original argument and avoid the whole exch def step, but in my real world code, I want to capture the stack argument to use locally.
Below is a minimal working example of the issue:
%!PS
/c{
(C START) =
(C output) =
(C END) =
}bind def
/b{
(B START) =
1 dict begin
/Bproc exch def
Bproc
end
(B END) =
}bind def
/a{
(A START) =
1 dict begin
/Aproc exch def
{Aproc} b
end
(A END) =
}bind def
/b2{
(B2 START) =
1 dict begin
/proc exch def
proc
end
(B2 END) =
}bind def
/a2{
(A2 START) =
1 dict begin
/proc exch def
{proc} b2
end
(A2 END) =
}bind def
{c} a
% {c} a2
Here is the output with the above code (different argument names, no issues):
$ gs -dBATCH -dNOPAUSE -sDEVICE=pdfwrite -sOutputFile=test2.pdf test2.ps
GPL Ghostscript 9.54.0 (2021-03-30)
Copyright (C) 2021 Artifex Software, Inc. All rights reserved.
This software is supplied under the GNU AGPLv3 and comes with NO WARRANTY:
see the file COPYING for details.
A START
B START
C START
C output
C END
B END
A END
$
Change the last 2 lines of the code to the following (comment out first, uncomment out the second):
...
% {c} a
{c} a2
Now ghostscript just hangs:
$ gs -dBATCH -dNOPAUSE -sDEVICE=pdfwrite -sOutputFile=test2.pdf test2.ps
GPL Ghostscript 9.54.0 (2021-03-30)
Copyright (C) 2021 Artifex Software, Inc. All rights reserved.
This software is supplied under the GNU AGPLv3 and comes with NO WARRANTY:
see the file COPYING for details.
C-c C-c: *** Interrupt
$
Environment: OpenSuse:
$ uname -a
Linux localhost.localdomain 5.18.4-1-default #1 SMP PREEMPT_DYNAMIC Wed Jun 15 06:00:33 UTC 2022 (ed6345d) x86_64 x86_64 x86_64 GNU/Linux
$ cat /etc/os-release
NAME="openSUSE Tumbleweed"
# VERSION="20220619"
ID="opensuse-tumbleweed"
ID_LIKE="opensuse suse"
VERSION_ID="20220619"
PRETTY_NAME="openSUSE Tumbleweed"
ANSI_COLOR="0;32"
CPE_NAME="cpe:/o:opensuse:tumbleweed:20220619"
BUG_REPORT_URL="https://bugs.opensuse.org"
HOME_URL="https://www.opensuse.org/"
DOCUMENTATION_URL="https://en.opensuse.org/Portal:Tumbleweed"
LOGO="distributor-logo-Tumbleweed"
I think you are being tripped up by early and late name binding, which is a subtle point in PostScript and often traps newcomers to the language.
Because of late name binding you can change the definition of a key/value pair during the course of execution, which can have unexpected effects. This is broadly similar to self-modifying code.
See section 3.12 Early Name Binding on page 117 of the 3rd Edition PostScript Language Reference Manual for details, but the basic point is right at the first paragraph:
"Normally, when the PostScript language scanner encounters an executable name in the program being scanned, it simply produces an executable name object; it does not look up the value of the name. It looks up the name only when the name object is executed by the interpreter. The lookup occurs in the dictionaries that are on the dictionary tack at the time of execution."
Also you aren't quite (I think) doing what you think you are with the executable array tokens '{' and '}'. That isn't putting the definition on the stack, exactly, it is defining an executable array on the stack. Not quite the same thing. In particular it means that the content of the array (between { and }) isn't being executed.
If you wanted to put the original function on the stack you would do '/proc load'. For example if you wanted to define a function to be the same as showpage you might do:
/my_showpage /showpage load def
Not:
/my_showpage {showpage} def
So looking at your failing case.
In function a2 You start by creating and opening a new dictionary. Said dictionary is only present on the stack, so if you ever pop it off it will go out of scope and be garbage collected (I'm sure you know this of course).
In that dictionary you create a key '/proc' with the associated value of an executable array. The array contains {c}.
Then you create a new executable array on the stack {proc}
You then execute function b2 with that executable array on the stack.
b2 starts another new dictionary, again on the stack, and then defines a key/value pair '/proc' with the associated value {proc}.
You then execute proc. That executes the array, the only thing in it is a reference to 'proc', so that is looked up starting in the current dictionary.
Oh look! We have a definition in the current dictionary, it's an executable array. So we push that array and execute it. The only thing in it is a reference to 'proc', so we look that up in the current dictionary, and we have one, so we execute that array. Round and round and round we go.....
You can avoid this by simply changing :
/a2{
(A2 START) =
1 dict begin
/proc exch def
{proc} b2
end
(A2 END) =
}bind def
To :
/a2{
(A2 START) =
1 dict begin
/proc exch def
/proc load b2
end
(A2 END) =
}bind def

Ruby Zlib compression gives different outputs for the same input

I have this ruby method for compressing a string -
def compress_data(data)
output = StringIO.new
gz = Zlib::GzipWriter.new(output)
gz.write(data)
gz.close
compressed_data = output.string
compressed_data
end
When I call this method with the same input, I get different outputs at different times. I am trying to get the byte array for the compressed outputs and compare them.
The output is Different when I run the below -
input = "hello world"
output1 = (compress_data input).bytes.to_a
sleep 1
output2 = (compress_data input).bytes.to_a
if output1 == output2
puts 'Same'
else
puts 'Different'
end
The output is Same when I remove the sleep. Does the compression algorithm have something to do with the current time?
Option 1 - fixed mtime:
Yes. The compression time is stored in the header. You can use the mtime method to set the time to a fixed value, which will resolve your problem:
gz = Zlib::GzipWriter.new(output)
gz.mtime = 1
gz.write(data)
gz.close
Note that the Ruby documentation says that setting mtime to zero will disable the timestamp. I tried it, and it does not work. I also looked at the source code, and it appears this functionality is missing. Seems like a bug. So you have to set it to something else than 0 (but see comments below - it will be fixed in future releases).
Option 2 - skip the header:
Another option is to just skip the header when checking for similar data. The header is 10 bytes long, so to only check the data:
data = compress_data(input).bytes[10..-1]
Note that you do not need to call to_a on bytes. It is already an Array:
String.bytes -> an_array
Returns an array of bytes in str. This is a shorthand for str.each_byte.to_a.

Getting memsize of shared Array space

tl;dr
require 'objspace'
ObjectSpace.memsize_of([0] * 1_000_000)
#=> 8000040
ObjectSpace.memsize_of(Array.new([0] * 1_000_000))
#=> 40
Where did it go?
Longer version
A whole bunch of stuff inside Array seems to have a concept of a "shared array" where the data block gets moved to a shared heap space. I'm aware that memsize_of makes it clear that it may be incomplete, but is there a (good?) way to analyze the allocation of these shared array blocks? They don't seem to be "objects" from the point of view of ObjectSpace.each_object. For the purposes of this memory profiler it would be nice to at least be able to track the overall size of the shared array heap space even if I can't trace it back to specific objects.
Fortunately rb_ary_memsize is a public function, so with small hack, you can do it:
#include <ruby.h>
#include <assert.h>
/* private macros from array.c */
#define ARY_OWNS_HEAP_P(a) (!FL_TEST((a), ELTS_SHARED|RARRAY_EMBED_FLAG))
#define ARY_SHARED_P(ary) \
(assert(!FL_TEST((ary), ELTS_SHARED) || !FL_TEST((ary), RARRAY_EMBED_FLAG)), \
FL_TEST((ary),ELTS_SHARED)!=0)
RUBY_FUNC_EXPORTED size_t
rb_ary_memsize(VALUE ary)
{
if (ARY_OWNS_HEAP_P(ary)) {
return RARRAY(ary)->as.heap.aux.capa * sizeof(VALUE);
}
/* -------8<------8<------- */
else if (ARY_SHARED_P(ary)){
/* if it is a shared array, calculate size using length of shared root */
return RARRAY_LEN(RARRAY(ary)->as.heap.aux.shared) * sizeof(VALUE);
}
/* ------->8------>8------- */
else {
return 0;
}
}
Compile it into shared object:
gcc $(ruby -rrbconfig \
-e'puts RbConfig::CONFIG.values_at("rubyhdrdir","rubyarchhdrdir").map{|d| " -I#{d}"}.join') \
-Wall -fpic -shared -o ary_memsize_hack.so ary_memsize_hack.c
And load into process replacing original function:
LD_PRELOAD="$(pwd)/ary_memsize_hack.so" ruby -robjspace \
-e 'p ObjectSpace.memsize_of([0] * 1_000_000);
p ObjectSpace.memsize_of(Array.new([0] * 1_000_000))'
It will produce desired output:
8000040
8000040
UPDATE:
rb_ary_memsize function which in charge of estimating array size, only does it for arrays, which are owning the heap (i.e. not shared and not embedded), and returns zero otherwise. In general it makes sense, because if you supposed to calculate size of all arrays in the applications, eventually the numbers should match, while with my patch the contents of shared arrays will be counted multiple times. I guess main problem is the way how the wrapping array constructed on the ruby side: essentially the reference on inner array lost, and is not reachable by the application code, and looks like uncountable. My patch only demonstrates how to reach the root of the shared array to expose the size, but I don't think this should be integrated into upstream in any way. The similar problem would be with embedded arrays, for the ruby also returns 0 as the size, which does not show what the application expect to see:
require 'objspace'
puts ObjectSpace.dump([1])
#=> {"address":"0x000008033f9bd8", "type":"ARRAY", "class":"0x000008029f9a98", "length":1,
# "embedded":true, "memsize":40, "flags":{"wb_protected":true}}
puts ObjectSpace.dump([1, 2])
#=> {"address":"0x000008033f9b38", "type":"ARRAY", "class":"0x000008029f9a98", "length":2,
# "embedded":true, "memsize":40, "flags":{"wb_protected":true}}
puts ObjectSpace.dump([1, 2, 3])
#=> {"address":"0x000008033f9ac0", "type":"ARRAY", "class":"0x000008029f9a98", "length":3,
# "embedded":true, "memsize":40, "flags":{"wb_protected":true}}
puts ObjectSpace.dump([1, 2, 3, 4])
#=> {"address":"0x000008033f9a48", "type":"ARRAY", "class":"0x000008029f9a98", "length":4,
# "memsize":72, "flags":{"wb_protected":true}}

Improving an algorithm for substring search when reading ZIP files

So I have a ZIP reader library, and I read ZIP files by first figuring out where the EOCD record is (the standard way "from the tail"). I have to look for a pattern that is roughly this:
4byte_magic_number, fixed_n_bytes, 2_bytes_of_comment_size, comment
The bytesize of comment is provided in the 2_bytes_of_comment_size. Just scanning for the magic number is insufficient, because I eager-read a substantial portion at the tail of the file - basically the maximum size the ZIP EOCD record can be, and then look for this pattern in there.
So far, I came up with this
def locate_eocd_signature(in_str)
# We have to scan from the _very_ tail. We read the very minimum size
# the EOCD record can have (up to and including the comment size), using
# a sliding window. Once our end offset matches the comment size we found our
# EOCD marker.
eocd_signature_int = 0x06054b50
unpack_pattern = 'VvvvvVVv'
minimum_record_size = 22
end_location = minimum_record_size * -1
loop do
# If the window is nil, we have rolled off the start of the string, nothing to do here.
# We use negative values because if we used positive slice indices
# we would have to detect the rollover ourselves
break unless window = in_str[end_location, minimum_record_size]
window_location = in_str.bytesize + end_location
unpacked = window.unpack(unpack_pattern)
# If we found the signature, pick up the comment size, and check if the size of the window
# plus that comment size is where we are in the string. If we are - bingo.
if unpacked[0] == 0x06054b50 && comment_size = unpacked[-1]
assumed_eocd_location = in_str.bytesize - comment_size - minimum_record_size
# if the comment size is where we should be at - we found our EOCD
return assumed_eocd_location if assumed_eocd_location == window_location
end
end_location -= 1 # Shift the window back, by one byte, and try again.
end
end
but it just screams ugly at me. Is there a better way to do something like this? Is there a pack specifier that says "all the bytes in binary until the the end of the string" that I do not know of? Then I could tack that onto the end of the pack specifier for example... A bit at loss here.
In the end I opted for the following optimization. First, I made a method for finding all the indices of a given substring in a string - there is no stdlib builtin for this.
def all_indices_of_substr_in_str(of_substring, in_string)
last_i = 0
found_at_indices = []
while last_i = in_string.index(of_substring, last_i)
found_at_indices << last_i
last_i += of_substring.bytesize
end
found_at_indices
end
Then, we use it to "latch" onto the offsets in our buffer where our signature was found.
def locate_eocd_signature(in_str)
eocd_signature = 0x06054b50
eocd_signature_str = [eocd_signature].pack('V')
unpack_pattern = 'VvvvvVVv'
minimum_record_size = 22
str_size = in_str.bytesize
indices = all_indices_of_substr_in_str(eocd_signature_str, in_str)
indices.each do |check_at|
maybe_record = in_str[check_at..str_size]
# If the record is smaller than the minimum - we will never recover anything
break if maybe_record.bytesize < minimum_record_size
# Now we check if the record ends with the combination
# of the comment size and an arbitrary byte string of that size.
# If it does - we found our match
*_unused, comment_size = maybe_record.unpack(unpack_pattern)
if (maybe_record.bytesize - minimum_record_size) == comment_size
return check_at # Found the EOCD marker location
end
end
# If we haven't caught anything, return nil deliberately instead of returning the last statement
nil
end

Match Multiple Patterns in a String and Return Matches as Hash

I'm working with some log files, trying to extract pieces of data.
Here's an example of a file which, for the purposes of testing, I'm loading into a variable named sample. NOTE: The column layout of the log files is not guaranteed to be consistent from one file to the next.
sample = "test script result
Load for five secs: 70%/50%; one minute: 53%; five minutes: 49%
Time source is NTP, 23:25:12.829 UTC Wed Jun 11 2014
D
MAC Address IP Address MAC RxPwr Timing I
State (dBmv) Offset P
0000.955c.5a50 192.168.0.1 online(pt) 0.00 5522 N
338c.4f90.2794 10.10.0.1 online(pt) 0.00 3661 N
990a.cb24.71dc 127.0.0.1 online(pt) -0.50 4645 N
778c.4fc8.7307 192.168.1.1 online(pt) 0.00 3960 N
"
Right now, I'm just looking for IPv4 and MAC address; eventually the search will need to include more patterns. To accomplish this, I'm using two regular expressions and passing them to Regexp.union
patterns = Regexp.union(/(?<mac_address>\h{4}\.\h{4}\.\h{4})/, /(?<ip_address>\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})/)
As you can see, I'm using named groups to identify the matches.
The result I'm trying to achieve is a Hash. The key should equal the capture group name, and the value should equal what was matched by the regular expression.
Example:
{"mac_address"=>"0000.955c.5a50", "ip_address"=>"192.168.0.1"}
{"mac_address"=>"338c.4f90.2794", "ip_address"=>"10.10.0.1"}
{"mac_address"=>"990a.cb24.71dc", "ip_address"=>"127.0.0.1"}
{"mac_address"=>"778c.4fc8.7307", "ip_address"=>"192.168.1.1"}
Here's what I've come up with so far:
sample.split(/\r?\n/).each do |line|
hashes = []
line.split(/\s+/).each do |val|
match = val.match(patterns)
if match
hashes << Hash[match.names.zip(match.captures)].delete_if { |k,v| v.nil? }
end
end
results = hashes.reduce({}) { |r,h| h.each {|k,v| r[k] = v}; r }
puts results if results.length > 0
end
I feel like there should be a more "elegant" way to do this. My chief concern, though, is performance.

Resources