Garbage collection with Ruby C Extension - ruby

I am working my way through Ferret (Ruby port of Lucene) code to solve
a bug. Ferret code is mainly a C extension to Ruby. I am running into
some issues with the garbage collector. I managed to fix it, but I
don't completely understand my fix =) I am hoping someone with deeper
knowledge of Ruby and C extension (this is my 3rd day with Ruby) can
elaborate. Thanks.
Here is the situation:
Some where in Ferret C code, I am returning a "Token" to Ruby land.
The code looks like
static VALUE get_token (...)
{
...
RToken *token = ALLOC(RToken);
token->text = rb_str_new2("some text");
return Data_Wrap_Struct(..., &frt_token_mark, &frt_token_free, token);
}
frt_token_mark calls rb_gc_mark(token->text) and frt_token_free
just frees the token with free(token)
In Ruby, this code correlates to the following:
token = #input.next
Basically, #input is set to some object, calling the next method on it
triggers the get_token C call, which returns a token object.
In Ruby land, I then do something like w = token.text.scan('\w+')
When I run this code inside a while 1 loop (to isolate my problem), at
some point (roughly when my ruby process mem footprint goes to 256MB,
probably some GC threshold), Ruby dies with errors like
scan method called on terminated object
Or just core dumps. My guess was that token.text was garbage collected.
I don't know enough about Ruby C extension to know what happens with
Data_Wrap_Struct returned objects. Seems to me the assignment in Ruby
land, token =, should create a reference to it.
My "work-around"/"fix" is to create a Ruby instance variable in the
object referred to by #input, and stores the token text in there, to
get an extra reference to it. So the C code looks like
RToken *token = ALLOC(RToken);
token->text = rb_str_new2(tk->text);
/* added code: prevent garbage collection */
rb_ivar_set(input, id_curtoken, token->text);
return Data_Wrap_Struct(cToken, &frt_token_mark, &frt_token_free, token);
So now I've created a "curtoken" in the input instance variable, and
saved a copy of the text there... I've taken care to remove/delete
this reference in the free callback of the class for #input.
With this code, it works in that I no longer get the terminated object
error.
The fix seems to make sense to me -- it keeps an extra ref in curtoken
to the token.text string so an instance of token.text won't be removed
until the next time #input.next is called (at which time a different
token.text replaces the old value in curtoken).
My question is: why did it not work before? Shouldn't
Data_Wrap_Structure return an object that, when assigned in Ruby land,
has a valid reference and not be removed by Ruby?
Thanks.

When the Ruby garbage collector is invoked, it has a mark phase and a sweep phase. The mark phase marks all objects in the system by marking:
all objects referenced by a ruby stack frame (e.g. local variables)
all globally accessible objects (e.g. referred to by a constant or global variable) and their children/referents, and
all objects referred to by a reference on the stack, as well as those objects' children/referents.
as well as a number of other objects that are not important to this discussion. The sweep phase then destroys any objects that are not accessible (i.e. those that were not marked).
Data_Wrap_Struct returns a reference to an object. As long as that reference is available to ruby code (e.g. stored in a local variable) or is on the stack (referred to by a local C variable), the object should not be swept.
It's looks like from what you've posted that token->text is getting garbage collected. But why is it getting collected? It must not be getting marked. Is the Token object itself getting marked? If it is, then token->text should be getting marked. Try setting a breakpoint or printing a message in the token's mark function to see.
If the token is not getting marked, then the next step is to figure out why. If it is getting marked, then the next step is to figure out why the string returned by the text() method is getting swept (maybe it's not the same object that is getting marked).
Also, are you sure that it is the token's text member that is causing the exception? Looking at:
http://github.com/dbalmain/ferret/blob/master/ruby/ext/r_analysis.c
I see that the token and the token stream both have text() methods. The TokenStream struct doesn't hold a reference to its text object (it can't, as it's a C struct with no knowledge of ruby). Thus, the Ruby object wrapping the C struct needs to hold the reference (and this is being done with rb_ivar_set).
The RToken struct shouldn't need to do this, because it marks its text member in its mark function.
One more thing: you may be able to reproduce this bug by calling GC.start explicitly in your loop rather than having to allocate so many objects that the garbage collector kicks in. This won't fix the problem but might make diagnosis simpler.

perhaps mark as volatile:
http://www.justskins.com/forums/chasing-a-garbage-collection-bug-98766.html
maybe your compile is keeping its reference in a registry instead of the stack...there is some way mentioned I think in README.EXT to force an object to never be GC'ed, but...the question still remains as to why it's being collected early...

Related

Python 3 psutil, extracting pid and name values?

For Python 3.5 on Windows 7 64-bit and the psutil 5 library.
I'm confused as to how to properly access the name and pid information provided within each class 'psutil.Process' item returned by psutil.process_iter().
The following code returns a class 'generator' object:
import psutil
curProcesses = psutil.process_iter()
A simple loop outputs each class 'psutil.Process' object contained within that generator:
for eachProcess in curProcesses:
print(eachProcess)
OUTPUT:
psutil.Process(pid=0, name='System Idle Process')
psutil.Process(pid=4, name='System')
... and so on ...
Now here's where I'm getting confused.
IF I modify the previous loop as follows, THEN I'll get an integer pid and a string name.
for eachProcess in curProcesses:
# Observe the two different forms of access.
print(eachProcess.pid)
print(eachProcess.name())
OUTPUT:
0
System Idle Process
4
System
... and so on ...
The resulting integer and string are exactly what I want. However, after several experiments I can only get them IF:
eachProcess.pid is NOT followed by parentheses ala eachProcess.pid. (Adding parentheses produces a TypeError: 'int' object is not callable exception.)
eachProcess.name is followed by parentheses ala eachProcess.name(). (Removing the parentheses returns a bound method Process.name instead of the name as a string.)
Why do the two keyword-looking arguments pid and name behave differently? (I suspect I'm about to learn something very useful about Python 3 generator objects...)
There is not much to it really: pid is a read-only attribute (created with the #property decorator), and name() is a method, both of the Process class. Methods need parens to be called in order to return data, while attributes are accessed without them. This bit of documentation might be helpful. Also, if you find it helpful, you can see the difference in implementation between name() and pid.
As far as why pid is a read-only attribute of Process, while a method name() needs to be called in order to get the process name, I am not sure. In fact, pid appears to be the only read-only attrubute on the process class, while all other information about the process instance is retrieved through method calls. Personally it seems inconsistent/non-standard to do it this way, but I assume that there is a good reason for this choice. I assume the main reason is so the PID cannot be changed accidentally since it is a crucial component. If the PID were a method instead of a read-only attribute, then eachProcess.pid = 123 in your loop would change the method to the int 123, while the way it currently is, this attempted reassignment will raise an error instead, so that the PID is protected in a sense, while eachProcess.name = 'foo' will probably go by without raising an error.
Also, note that while they may look like keyword arguments in the way they appear in the string representations of Process class instances, name() and pid are not keyword arguments (although pid can be passed as a keyword argument when creating a Process instance, but that is not what is happening here).
I made pid a class attribute / property mainly for consistency with subprocess.Popen.pid and multiprocessing.Process.pid stdlib modules (also threading.Thread.ident comes to mind).
Also, it's something which does not require any calculation (contrarily from name(), cmdline() etc.) and it never changes, so something read-only made more sense to me at the time.
It's a property rather than a mere attribute just to error out in case the user tries to change/set it.

Type mismatch when calling a function in qtp

I am using QTP 11.5 for automating a web application.I am trying to call an action in qtp through driverscript as below:
RFSTestPath = "D:\vf74\D Drive\RFS Automation\"
LoadAndRunAction RFStestPath & LogInApplication,"Action1",oneIteration
Inside the LogInApplication(Action1) am calling a login function as:
Call fncLogInApplication(strURL,strUsesrName,strPasssword)
Definition of fncLogInApplication is written in fncLogInApplication.vbs
When I associate the fncLogInApplication.vbs file to driverscript, am able to execute my code without any errors. But when I de-associate .vbs file from driverscript and associate it to LogInApplication test am getting "Type mismatch: 'fncLogInApplication'"
Can anyone help me in the association please. I want fncLogInApplication to be executed when I associate to LogInApplication not to the main driverscript.
Please comment back if you require any more info
There is only one set of associated libraries that is active at any one time: That is always the outermost test's one.
This means if test A calls test B, test B will be executed with the libraries loaded based upon test A´s associated libraries list, not B's.
This also means that if B depends on a library, and B associated this library, but is called from test A (which does not associated this library), then B will fail to call (locate) the function since the associated libraries of B are never loaded (only those from A are). (As would A, naturally.).
If you are still interested: "Type mismatch" is QTPs (or VBScript´s) poor way of telling you: "The function called is not known, so I bet you instead meant an array variable dereference, and the variable you specified is equal to empty, so it is not an array, and thus cannot be dereferenced as an array variable, which is what I call a 'type mismatch'."
This reasoning is valid, considering the syntax tree of VB/VBScript: Function calls and array variable dereferences cannot be formally differentiated. Syntactically, they are very similar, or identical in most cases. So be prepared to handle "Type mismatch" like the "Unknown function referenced" message that VB/VBScript never display when creating VBScript code.
You can, however, load the library you want in test B´s code (for example, using LoadFunctionLibrary), but this still allows A to call functions from that library once B loaded it and returned from A´s call. This, and all the possible variations of this procedure, however, have side-effects to aspects like debugging, forward references and visibility of global variables, so I would recommend against it.
Additional notes:
There is no good reason to use CALL. Just call the sub or function.
If you call a function and use the result it returns, you must include the arguments in parantheses.
If you call a sub (or a function, and don´t use the result it returns), you must not include the arguments in parantheses. If the sub or function accepts only one argument, it might look like you are allowed to put it in parantheses, but this is not true. In this case, the argument is simply treated like a term in parantheses.
The argument "bracketing" aspects just listed can create very nasty bugs, especially if the argument is byRef, also due (but not limited) to the fact that VBScripts unfortunately allows you to pass values for a byRef argument (where a variable parameter is expected), so it is generally a good idea to put paranthesis only where it belongs (i.e. where absolutely needed).

Is Ruby garbage collection affected by intermediate variables?

Does creating intermediate variables cause the garbage collector to do more work?
That is, is there any difference between:
output = :asdf.to_s.upcase
and
str = :asdf.to_s
output = str.upcase
? (Assume str is never referenced again.)
It would be a trivial amount of extra work when marking objects still referenced, assuming both str and output were still in scope (i.e. the binding where they exist was still active) when the GC mark phase began. Both variables would start a mark on the same string. I don't know, but suspect that when marking objects as still viable, if Ruby comes across an item already marked, it will probably stop recursing and go to its next item at the same level. In this case the String is a single object without child objects to mark further, so it's one quick call to rb_gc_mark repeated for each reference to the String - one case where it is marked, and another case where Ruby notes it has already been marked and stops recursing.
If neither variable were in any active binding when GC mark phase began, it is no extra work, the String referenced would not get marked (no work) and the sweep phase would delete it just once (same work no matter how many references were active before).

Garbage collection in Ruby 2, what happens to ignored method's returned values?

Let's say we have a Ruby class like this:
class MyClass
def my_method(p)
# do some cool stuff with a huge amount of objects
my_objects = ...
return my_objects
end
end
And somewhere else in the application there's a function that calls MyClass's my_method, pretty much like this:
def my_func
#doing some stuff ..
MyClass.my_method(some_param)
#doing other stuff ..
end
What happens to the list of objects, is it eligible for garbage collection? Is it possible to know roughly when it's going to be collected?
Is there a way to "mark" the list as eligible for GC? Maybe like this:
def my_func
#doing some stuff ..
objects = MyClass.my_method(some_param)
objects = nil #does this make any difference?
#doing other stuff ..
end
GC destroys all objects which are not being referenced by your code. By setting objects to nil, you change reference of the variable, hence objects will be GCed, but exactly same thing is going to happen if you go with the first code. The real question is - why do you need for this object to be GB at precise moment - it shouldn't affect your code at all.
If you really want to have better control over garbage collection you can look at GC class: http://www.ruby-doc.org/core-1.9.3/GC.html. Note that you can rerun GC.start, which will force GC to run at that precise moment (even if there is nothing to collect).
Items returned from the function are eligible for being collected once nothing else points to them.
So, if you ignore the return value and really nothing more remembers those objects, than yes, thay can be GC'ed.
So, if you store the result in objects variable, then the returned values will be 'pinned'**) as long as the objects variable still remembers them***). When you nil that variable, they will be released and pending for collection. Nilling that variable may speed up their collection, but does not necessarily have to. *)
UNLESS anything other still remembers them. If between the objects=f() and objects=nil you read the values from objects variable and pass them to other functions/methods, and if they happen to store those objects, then of course it will pin them too, and "nilling" will help a bit in releasing the resources but not cause any immediate collection.*)
(*) In general, in environments with GC, you never actually know when the GC will run and what will it collect. You just know that objects that were forgotten by everyone will eventually be automatically removed. Nothing more. Theoreticaly, GC may choose to not run at all if your machine has terabytes of free memory.
(**) in some environments (like .Net) "pinning" is a precise term. Here I said it like that just to help you imagine how it works. I do not mean real pinning of memory blocks for communication with lower-level libraries, etc.
(***) When where's an object A remembers object B which remembers object C, and if the "B" becomes forgotten and if only B (and noone else) rememebers the C, then both B and C are GC'ed. So, you don't have to nil the objects. If the thing that contains objects variable at some point becomes 'forgotten', then both the "outer thing", and "objects" and the "returned items" will be GC'ed. At least should be, if GC implementation is OK.
This leaves one more thing to say: I do not say about GC in Ruby 2.0. All I've said was about garbage collectors in general. It applies also to Java, .Net, ObjC (with GC) and others. If you need to know precisely what happens in Ruby 2.0 and what are the gory details of GC implementation - ask directly about that :)

Initialize member variables in a method and not the constructor

I have a public method which uses a variable (only in the scope of the public method) I pass as a parameter we will call A, this method calls a private method multiple times which also requires the parameter.
At present I am passing the parameter every time but it looks weird, is it bad practice to make this member variable of the class or would the uncertainty about whether it is initialized out way the advantages of not having to pass it?
Simplified pseudo code:
public_method(parameter a)
do something with a
private_method(string_a, a)
private_method(string_b, a)
private_method(string_c, a)
private_method(String, parameter a)
do something with String and a
Additional information: parameter a is a read only map with over 100 entries and in reality I will be calling private_method about 50 times
I had this same problem myself.
I implemented it differently in 3 different contexts to see hands-on what are result using 3 different strategies, see below.
Note that I am type of programmer that makes many changes to the code always trying to improve it. Thus I settle only for the code that is amenable to changes, readbale, would you call this "flexible" code. I settle only for very clear code.
After experimentation, I came to these results:
Passing a as parameter is perfectly OK if you have one or two - short number - of such values. Passing in parmeters has very good visibility, clarity, clear passing lines, well visible lifetime (initialization points, destruction points), amenable to changes, easy to track.
If number of such values begin to grow to >= 5-6 values, I swithc to approach #3 below.
Passing values through class members -- did not do good to clarity of my code, eventually I got rid of it. It makes for less clear code. Code becomes muddled. I did not like it. It had no advantages.
As alternative to (1) and (2), I adopted Inner class approach, in cases when amount of such values is > 5 (which makes for too long argument list).
I pack those values into small Inner class and pass such object by reference as argument to all internal members.
Public function of a class usually creates an object of Inner class (I call is Impl or Ctx or Args) and passes it down to private functions.
This combines clarity of arg passing with brevity. It's perfect.
Good luck
Edit
Consider preparing array of strings and using a loop rather than writing 50 almost-identical calls. Something like char *strings[] = {...} (C/C++).
This really depends on your use case. Does 'a' represent a state that your application/object care about? Then you might want to make it a member of your object. Evaluate the big picture, think about maintenance, extensibility when designing structures.
If your parameter a is a of a class of your own, you might consider making the private_method a public method for the variable a.
Otherwise, I do not think this looks weird. If you only need a in just 1 function, making it a private variable of your class would be silly (at least to me). However, if you'd need it like 20 times I would do so :P Or even better, just make 'a' an object of your own that has that certain function you need.
A method should ideally not pass more than 7 parameters. Using the number of parameters more than 6-7 usually indicates a problem with the design (do the 7 parameters represent an object of a nested class?).
As for your question, if you want to make the parameter private only for the sake of passing between private methods without the parameter having anything to do with the current state of the object (or some information about the object), then it is not recommended that you do so.
From a performance point of view (memory consumption), reference parameters can be passed around as method parameters without any significant impact on the memory consumption as they are passed by reference rather than by value (i.e. a copy of the data is not created). For small number of parameters that can be grouped together you can use a struct. For example, if the parameters represent x and y coordinates of a point, then pass them in a single Point structure.
Bottomline
Ask yourself this question, does the parameter that you are making as a members represent any information (data) about the object? (data can be state or unique identification information). If the answer to his question is a clear no, then do not include the parameter as a member of the class.
More information
Limit number of parameters per method?
Parameter passing in C#

Resources