Is Ruby garbage collection affected by intermediate variables? - ruby

Does creating intermediate variables cause the garbage collector to do more work?
That is, is there any difference between:
output = :asdf.to_s.upcase
and
str = :asdf.to_s
output = str.upcase
? (Assume str is never referenced again.)

It would be a trivial amount of extra work when marking objects still referenced, assuming both str and output were still in scope (i.e. the binding where they exist was still active) when the GC mark phase began. Both variables would start a mark on the same string. I don't know, but suspect that when marking objects as still viable, if Ruby comes across an item already marked, it will probably stop recursing and go to its next item at the same level. In this case the String is a single object without child objects to mark further, so it's one quick call to rb_gc_mark repeated for each reference to the String - one case where it is marked, and another case where Ruby notes it has already been marked and stops recursing.
If neither variable were in any active binding when GC mark phase began, it is no extra work, the String referenced would not get marked (no work) and the sweep phase would delete it just once (same work no matter how many references were active before).

Related

Garbage collection in Ruby 2, what happens to ignored method's returned values?

Let's say we have a Ruby class like this:
class MyClass
def my_method(p)
# do some cool stuff with a huge amount of objects
my_objects = ...
return my_objects
end
end
And somewhere else in the application there's a function that calls MyClass's my_method, pretty much like this:
def my_func
#doing some stuff ..
MyClass.my_method(some_param)
#doing other stuff ..
end
What happens to the list of objects, is it eligible for garbage collection? Is it possible to know roughly when it's going to be collected?
Is there a way to "mark" the list as eligible for GC? Maybe like this:
def my_func
#doing some stuff ..
objects = MyClass.my_method(some_param)
objects = nil #does this make any difference?
#doing other stuff ..
end
GC destroys all objects which are not being referenced by your code. By setting objects to nil, you change reference of the variable, hence objects will be GCed, but exactly same thing is going to happen if you go with the first code. The real question is - why do you need for this object to be GB at precise moment - it shouldn't affect your code at all.
If you really want to have better control over garbage collection you can look at GC class: http://www.ruby-doc.org/core-1.9.3/GC.html. Note that you can rerun GC.start, which will force GC to run at that precise moment (even if there is nothing to collect).
Items returned from the function are eligible for being collected once nothing else points to them.
So, if you ignore the return value and really nothing more remembers those objects, than yes, thay can be GC'ed.
So, if you store the result in objects variable, then the returned values will be 'pinned'**) as long as the objects variable still remembers them***). When you nil that variable, they will be released and pending for collection. Nilling that variable may speed up their collection, but does not necessarily have to. *)
UNLESS anything other still remembers them. If between the objects=f() and objects=nil you read the values from objects variable and pass them to other functions/methods, and if they happen to store those objects, then of course it will pin them too, and "nilling" will help a bit in releasing the resources but not cause any immediate collection.*)
(*) In general, in environments with GC, you never actually know when the GC will run and what will it collect. You just know that objects that were forgotten by everyone will eventually be automatically removed. Nothing more. Theoreticaly, GC may choose to not run at all if your machine has terabytes of free memory.
(**) in some environments (like .Net) "pinning" is a precise term. Here I said it like that just to help you imagine how it works. I do not mean real pinning of memory blocks for communication with lower-level libraries, etc.
(***) When where's an object A remembers object B which remembers object C, and if the "B" becomes forgotten and if only B (and noone else) rememebers the C, then both B and C are GC'ed. So, you don't have to nil the objects. If the thing that contains objects variable at some point becomes 'forgotten', then both the "outer thing", and "objects" and the "returned items" will be GC'ed. At least should be, if GC implementation is OK.
This leaves one more thing to say: I do not say about GC in Ruby 2.0. All I've said was about garbage collectors in general. It applies also to Java, .Net, ObjC (with GC) and others. If you need to know precisely what happens in Ruby 2.0 and what are the gory details of GC implementation - ask directly about that :)

Deallocate memory previously allocated to a variable (using create)

I'm reading the Gforth manual on memory allocation / deallocation, and this is something I cannot understand. Suppose I allocated a chunk of memory to hold four integers like this:
create foo 1 , 2 , 3 , 4 ,
Then, maybe I allocated more memory and perhaps deallocated some too, and now I want to deallocate foo. How do I do that? Doing foo free and foo 4 cells free results in an error.
One option is to use forget foo but that will 'deallocate' everything that you have defined since you defined foo, and worse than that Gforth doesn't implement it. In Gforth you have to use a 'marker', but this also will revert everything that happened after the marker.
For example (I'll show what you would get entering this into a Gforth interpreter, including the interpreter's responses (denoted by double asterisks)):
marker -unfoo **ok**
create foo 1 , 2 , 3 , 4 , **ok**
/ A test word to get the first thing in foo (1) back
: test foo # . ; **ok**
test **1 ok**
-unfoo **ok**
foo
**:8: Undefined word
>>>foo<<<
Backtrace:
$7FAA4EB4 throw
$7FAB1628 no.extensions
$7FAA502C interpreter-notfound1**
test
**:8: Undefined word
>>>test<<<
Backtrace:
$7FAA4EB4 throw
$7FAB1628 no.extensions
$7FAA502C interpreter-notfound1**
The example is meant to illustrate that foo and test are both gone after you execute -unfoo.
How this actually works is probably my moving the address that the interpreter is taking as the last thing added to the dictionary. -unfoo moves this back to before the address at which foo was added, which is equivalent to freeing the memory used by foo.
Here is another reference for this Starting Forth which is pretty excellent for picking up Forth in general.
In response to a comment on this answer:
This question is quite similar and this answer is pretty helpful. This is probably the most relevant part of the Gforth documentation.
The links above explain Forth versions of malloc(), free() and resize().
So in answer to your original question, you can use free but the memory that you free has to have been allocated by allocate or resize.
create adds an item to the dictionary and is as such not exactly what you want if you are going to want the memory back. My understanding of this, which may be incorrect is that you wouldn't normally remove things from the dictionary during the course of normal execution.
The best way to store a string depends on what you want to do with it. If you don't need it to exist for the lifetime of the programme you can just use s" by itself as this returns a length and an address.
In general, I would say that using create is quite a good idea but it does have limitations. If the string changes you will have to create a new dictionary entry for it. If you can set an upper bound on the string length, then once you have created a word you can go back and overwrite the memory that has been alloted for it.
This is another answer that I gave that gives an example of defining a string word.
So in summary, if you really do need to be able to deallocate the memory, use heap methods that Gforth provides (I think that they are in the Forth standard but I don't know if all Forths implement them). If you don't you can use the dictionary as per your question.
The CREATE ALLOT and VARIABLE words consume dictionary space (look it up in the ISO 93 standard.)
Traditionally you can
FORGET aap
, but that removes aap and each definition that is defined later than aap , totally different from free().
In complicated Forth's like gforth this simple mechanism no longer works. It amounted to truncating the linked list and resetting an allocation pointer (HERE/DP)
In gforth you are obliged to use MARKER. In putting
MARKER aap
you can use aap to remove aap and later defined words.
MARKER is cumbersome and it is much easier to restart your Forth.

The scope is confusing

I am confused with the scope of a variable inside a block. This works:
f = 'new'
[1,2,3].each do |n| puts f * n end
#=> new newnew newnewnew
But this doesn't:
[1,2,3].each do |n|
a ||=[]
a << n
end
a
#=>a does not exsit!
Why is this? And please put some resource on this topic for me.
What's confusing?
In the first snippet f is created and then the each block is executed, which can see things outside itself (called enclosing scope). So it can see f.
In the second snippet you create a inside the block, and so its scope is that block. Outside the block, a doesn't exist.
When you refer to a name (a, for instance) ruby will go from the current scope outwards, looking in all enclosing scopes for the name. If it finds it in one of the enclosing scopes, it uses the value that name is associated with. If not, it goes back to the most local scope and creates the name there. Subsequent name lookups will yield the value tied to that name.
When a block ends, the names that were in that scope are lost (the values aren't lost, just the names; values are lost when the garbage collector sees that no more names (or anything) refer to that value, and the gc collects the value to reuse its memory).
If visualization is your thing, I find it helpful to think of scopes as a staircase, and at the beginning of a program, you are standing on the top step1. Every time a block is entered, you step down one step. You can see everything on the current step, and everything on steps above the one you're on, but nothing on the steps below. When you refer to a variable name, you look around on the step you're on to find it. When you see it, you use that value. If you don't see it, you look to the next step above the one you're on. If you see it, you use that value. You do this over and over till you've looked at the very top step but don't see that name. If this happens, you create the name on the step you are standing on (and give it a value, if you're looking it up for an assignment). The next time you look for that name, you'll see it on the step you're standing on, and use it there.
When a block ends, you step up one stair step. Because you can't see any names on the steps below, all the names on the step you were previously on are lost.
If that helps you, think of it that way. If not, don't.
1 Actually you're on the second step because you're not in global scope, but to use names from global scope, you have to use a $ at the beginning of the name. So in the staircase example, if the name you are looking for has an $ at the beginning, you look directly at the top step. If not, you don't look that far. However, this is kind of wrong, since all the stairs in the program would share the same top step, which is weird to think about.
map works much better:
a = [1,2,3].map do |n|
n
end
No need to declare a outside of the block.
It's simple, a variable defined inside a block is not visible outside (if this happened, we'd say the variable had leaked, and as the word suggests, this would be bad):
>> lambda { x = 1 }.call
=> 1
>> x
NameError: undefined local variable or method `x' for main:Object

Caching of data in Mathematica

there is a very time-consuming operation which generates a dataset in my package. I would like to save this dataset and let the package rebuild it only when I manually delete the cached file. Here is my approach as part of the package:
myDataset = Module[{fname, data},
fname = "cached-data.mx";
If[FileExistsQ[fname],
Get[fname],
data = Evaluate[timeConsumingOperation[]];
Put[data, fname];
data]
];
timeConsumingOperation[]:=Module[{},
(* lot of work here *)
{"data"}
];
However, instead of writing the long data set to the file, the Put command only writes one line: "timeConsumingOperation[]", even if I wrap it with Evaluate as above. (To be true, this behaviour is not consistent, sometimes the dataset is written, sometimes not.)
How do you cache your data?
Another caching technique I use very often, especially when you might not want to insert the precomputed form in e.g. a package, is to memoize the expensive evaluation(s), such that it is computed on first use but then cached for subsequent evaluations. This is readily accomplished with SetDelayed and Set in concert:
f[arg1_, arg2_] := f[arg1, arg2] = someExpensiveThing[arg1, arg2]
Note that SetDelayed (:=) binds higher than Set (=), so the implied order of evaluation is the following, but you don't actually need the parens:
f[arg1_, arg2_] := ( f[arg1, arg2] = someExpensiveThing[arg1, arg2])
Thus, the first time you evaluate f[1,2], the evaluation-delayed RHS is evaluated, causing resulting value is computed and stored as an OwnValue of f[1,2] with Set.
#rcollyer is also right in that you don't need to use empty brackets if you have no arguments, you could just as easily write:
g := g = someExpensiveThing[...]
There's no harm in using them, though.
In the past, whenever I've had trouble with things evaluating it is usually when I have not correctly matched the pattern required by the function. For instance,
f[x_Integers]:= x
which won't match anything. Instead, I meant
f[x_Integer]:=x
In your case, though, you have no pattern to match: timeConsumingOperation[].
You're problem is more likely related to when timeConsumingOperation is defined relative to myDataset. In the code you've posted above, timeConsumingOperation is defined after myDataset. So, on the first run (or immediately after you've cleared the global variables) you would get exactly the result you're describing because timeConsumingOperation is not defined when the code for myDataset is run.
Now, SetDelayed (:=) automatically causes the variable to be recalculated whenever it is used, and since you do not require any parameters to be passed, the square brackets are not necessary. The important point here is that timeConsumingOperation can be declared, as written, prior to myDataset because SetDelayed will cause it not to be executed until it is used.
All told, your caching methodology looks exactly how I would go about it.

Garbage collection with Ruby C Extension

I am working my way through Ferret (Ruby port of Lucene) code to solve
a bug. Ferret code is mainly a C extension to Ruby. I am running into
some issues with the garbage collector. I managed to fix it, but I
don't completely understand my fix =) I am hoping someone with deeper
knowledge of Ruby and C extension (this is my 3rd day with Ruby) can
elaborate. Thanks.
Here is the situation:
Some where in Ferret C code, I am returning a "Token" to Ruby land.
The code looks like
static VALUE get_token (...)
{
...
RToken *token = ALLOC(RToken);
token->text = rb_str_new2("some text");
return Data_Wrap_Struct(..., &frt_token_mark, &frt_token_free, token);
}
frt_token_mark calls rb_gc_mark(token->text) and frt_token_free
just frees the token with free(token)
In Ruby, this code correlates to the following:
token = #input.next
Basically, #input is set to some object, calling the next method on it
triggers the get_token C call, which returns a token object.
In Ruby land, I then do something like w = token.text.scan('\w+')
When I run this code inside a while 1 loop (to isolate my problem), at
some point (roughly when my ruby process mem footprint goes to 256MB,
probably some GC threshold), Ruby dies with errors like
scan method called on terminated object
Or just core dumps. My guess was that token.text was garbage collected.
I don't know enough about Ruby C extension to know what happens with
Data_Wrap_Struct returned objects. Seems to me the assignment in Ruby
land, token =, should create a reference to it.
My "work-around"/"fix" is to create a Ruby instance variable in the
object referred to by #input, and stores the token text in there, to
get an extra reference to it. So the C code looks like
RToken *token = ALLOC(RToken);
token->text = rb_str_new2(tk->text);
/* added code: prevent garbage collection */
rb_ivar_set(input, id_curtoken, token->text);
return Data_Wrap_Struct(cToken, &frt_token_mark, &frt_token_free, token);
So now I've created a "curtoken" in the input instance variable, and
saved a copy of the text there... I've taken care to remove/delete
this reference in the free callback of the class for #input.
With this code, it works in that I no longer get the terminated object
error.
The fix seems to make sense to me -- it keeps an extra ref in curtoken
to the token.text string so an instance of token.text won't be removed
until the next time #input.next is called (at which time a different
token.text replaces the old value in curtoken).
My question is: why did it not work before? Shouldn't
Data_Wrap_Structure return an object that, when assigned in Ruby land,
has a valid reference and not be removed by Ruby?
Thanks.
When the Ruby garbage collector is invoked, it has a mark phase and a sweep phase. The mark phase marks all objects in the system by marking:
all objects referenced by a ruby stack frame (e.g. local variables)
all globally accessible objects (e.g. referred to by a constant or global variable) and their children/referents, and
all objects referred to by a reference on the stack, as well as those objects' children/referents.
as well as a number of other objects that are not important to this discussion. The sweep phase then destroys any objects that are not accessible (i.e. those that were not marked).
Data_Wrap_Struct returns a reference to an object. As long as that reference is available to ruby code (e.g. stored in a local variable) or is on the stack (referred to by a local C variable), the object should not be swept.
It's looks like from what you've posted that token->text is getting garbage collected. But why is it getting collected? It must not be getting marked. Is the Token object itself getting marked? If it is, then token->text should be getting marked. Try setting a breakpoint or printing a message in the token's mark function to see.
If the token is not getting marked, then the next step is to figure out why. If it is getting marked, then the next step is to figure out why the string returned by the text() method is getting swept (maybe it's not the same object that is getting marked).
Also, are you sure that it is the token's text member that is causing the exception? Looking at:
http://github.com/dbalmain/ferret/blob/master/ruby/ext/r_analysis.c
I see that the token and the token stream both have text() methods. The TokenStream struct doesn't hold a reference to its text object (it can't, as it's a C struct with no knowledge of ruby). Thus, the Ruby object wrapping the C struct needs to hold the reference (and this is being done with rb_ivar_set).
The RToken struct shouldn't need to do this, because it marks its text member in its mark function.
One more thing: you may be able to reproduce this bug by calling GC.start explicitly in your loop rather than having to allocate so many objects that the garbage collector kicks in. This won't fix the problem but might make diagnosis simpler.
perhaps mark as volatile:
http://www.justskins.com/forums/chasing-a-garbage-collection-bug-98766.html
maybe your compile is keeping its reference in a registry instead of the stack...there is some way mentioned I think in README.EXT to force an object to never be GC'ed, but...the question still remains as to why it's being collected early...

Resources