Context and motivation
This is a question about releasing GIL in CPython while still working with Python objects, but in a limited way.
For performance reasons, I need to jump into GIL-less mode while maintaining read-only access to existing Python objects. The objects are guaranteed to exist throughout, and no other thread mutates them (no deletes, appends etc). In other words, they are "const" for all practical purposes, no ref counting needed, no dynamic allocations take place.
Question
Assume we have a native Python object on input, such as a list, dict or int. We can work with this object as PyObject * using the C API, as normal.
Q: Is it safe to release GIL and then call read-only ("const") functions on this object? For example, call PyList_GET_ITEM(lst, 0), PyInt_AS_LONG(i) or PyDict_GetItem(dct, key)?
If not, why not?
More generally, what is the CPython contract regarding (lack of) object mutation and dynamic allocations, and the need for synchronization via GIL?
What I tried
I created a Python list of integers, then released the GIL, iterated over the list elements using standard PyList_Size and PyList_GET_ITEM, summed the integer values inside, re-acquired GIL. All worked fine, results came out correct. No need to incref/decref anything, since no Python objects were created, deleted or mutated.
But this experiment is no proof of course—there may be side effects or limitations.
What I'm NOT asking
whether I can release GIL and work with Python objects
what is the GIL
where is the CPython documentation
what will some non-CPython implementation do
Related
I can see related questions asked before
In some languages, there's a specific way to force garbage collection. For example in R, we can call gc() and it will free up memory that was previously used to store objects that have since been removed.
Is there any way to do this in ruby?
In case it's relevant, I'm running a very long loop and I think it's slowly accumulating a little memory use each iteration, and I would like to force garbage collection every 100th iteration or so just to be sure e.g. (pseudo code) if index % 100 == 0 then gc(). Also note I intend to use this in a rails app, although I don't think that's relevant (since garbage collection would be entirely a ruby feature, nothing to do with rails)
No, there is no way to do it in Ruby.
There is a method called GC::start, and the documentation even says:
Initiates garbage collection, even if manually disabled.
But that is not true. GC::start is simply an advisory from your code to the runtime that it would be safe for your application to run a garbage collection now. But that is only a suggestion. The runtime is free to ignore this suggestion.
The majority of programming languages with automatic memory management do not give the programmer control over the garbage collector.
If Ruby had a method to force a garbage collection, then it would be impossible to implement Ruby on the JVM and neither JRuby nor TruffleRuby could exist, it would be impossible to implement Ruby on .NET and IronRuby couldn't exist, it would be impossible to implement Ruby on ECMAScript and Opal couldn't exist, it would be impossible to implement Ruby using existing high-performance garbage collectors and RubyOMR couldn't exist.
Since it is in generally desirable to give implementors freedom to implement optimizations and make the language faster, languages are very cautious on specifying features that so drastically restrict what an implementor can do.
I am quite surprised that R has such a feature, especially since that means it is impossible to implement high-performance implementations like FastR in a way that is compliant with the language specification. FastR is up to more than 35× faster than GNU R, so it is obvious why it is desirable for something like FastR to exist. But one of the ways FastR is faster, is that it uses a third-party high-performance garbage collected runtime (either the GraalVM or the JVM) that does not allow control over garbage collection, and thus FastR can never be a compliant R implementation.
Interestingly, the documentation of gc() has this to say:
[T]he primary purpose of calling gc is for the report on memory usage.
This is done via GC.start.
--
This page has been quite confusing for me.
It says:
Memory management in newLISP does not rely on a garbage collection algorithm. Memory is not marked or reference-counted. Instead, a decision whether to delete a newly created memory object is made right after the memory object is created.
newLISP follows a one reference only (ORO) rule. Every memory object not referenced by a symbol is obsolete once newLISP reaches a higher evaluation level during expression evaluation. Objects in newLISP (excluding symbols and contexts) are passed by value copy to other user-defined functions. As a result, each newLISP object only requires one reference.
Further down, I see:
All lists, arrays and strings are passed in and out of built-in functions by reference.
I can't make sense of these two.
How can newLISP "not rely on a garbage collection algorithm", and yet pass things by reference?
For example, what would it do in the case of circular references?!
Is it even possible for a LISP to not use garbage collection, without making performance go down the drain? (I assume you could always pass things by value, or you could always perform a full-heap scan whenever you think it might be necessary, but then it seems to me like that would insanely hurt your performance.)
If so, how would it deal with circular references? If not, what do they mean?
Perhaps reading http://www.newlisp.org/ExpressionEvaluation.html helps understanding the http://www.newlisp.org/MemoryManagement.html paper better. Regarding circular references: they do not exist in newLISP, there is no way to create them. The performance question is addressed in a sub chapter of that memory management paper and here: http://www.newlisp.org/benchmarks/
May be working and experimenting with newLISP - i.e. trying to create a circular reference - will clear up most of the questions.
R has pass-by-value semantics, which minimizes accidental side effects (a good thing). However, when code is organized into many functions/methods for reusability/readability/maintainability and when that code needs to manipulate large data structures through, e.g., big data frames, through a series of transformations/operations the pass-by-value semantics leads to a lot of copying of data around and much heap thrashing (a bad thing). For example, a data frame that takes 50Mb on the heap that is passed as a function parameter will be copied at a minimum the same number of times as the function call depth and the heap size at the bottom of the call stack will be N*50Mb. If the functions return a transformed/modified data frame from deep in the call chain then the copying goes up by another N.
The SO question What is the best way to avoid passing a data frame around? touches this topic but is phrased in a way that avoids directly asking the pass-by-reference question and the winning answer basically says, "yes, pass-by-value is how R works". That's not actually 100% accurate. R environments enable pass-by-reference semantics and OO frameworks such as proto use this capability extensively. For example, when a proto object is passed as a function argument, while its "magic wrapper" is passed by value, to the R developer the semantics are pass-by-reference.
It seems that passing a big data frame by reference would be a common problem and I'm wondering how others have approached it and whether there are any libraries that enable this. In my searching I have not discovered one.
If nothing is available, my approach would be to create a proto object that wraps a data frame. I would appreciate pointers about the syntactic sugar that should be added to this object to make it useful, e.g., overloading the $ and [[ operators, as well as any gotchas I should look out for. I'm not an R expert.
Bonus points for a type-agnostic pass-by-reference solution that integrates nicely with R, though my needs are exclusively with data frames.
The premise of the question is (partly) incorrect. R works as pass-by-promise and there is repeated copying in the manner you outline only when further assignments and alterations to the dataframe are made as the promise is passed on. So the number of copies will not be N*size where N is the stack depth, but rather where N is the number of levels where assignments are made. You are correct, however, that environments can be useful. I see on following the link that you have already found the 'proto' package. There is also a relatively recent introduction of a "reference class" sometimes referred to as "R5" where R/S3 was the original class system of S3 that is copied in R and R4 would be the more recent class system that seems to mostly support the BioConductor package development.
Here is a link to an example by Steve Lianoglou (in a thread discussing the merits of reference classes) of embedding an environment inside an S4 object to avoid the copying costs:
https://stat.ethz.ch/pipermail/r-help/2011-September/289987.html
Matthew Dowle's 'data.table' package creates a new class of data object whose access semantics using the "[" are different than those of regular R data.frames, and which is really working as pass-by-reference. It has superior speed of access and processing. It also can fall back on dataframe semantics since in later years such objects now inherit the 'data.frame' class.
You may also want to investigate Hesterberg's dataframe package.
My team is working on an MMO server in Ruby, and we opted to start moving computationally intensive operations into a C extension. As part of that effort, we moved the actual data storage into C (using Data_Get_Struct and all that). So, for example, each Ruby "Zone" object has an associated "ZoneKernel::Zone" C struct where the actual binary data is stored.
Basically, I'm wondering if this is a terrible idea or not. I'm not super familiar with the ruby internals, but it seems like the data should be fine as long as the parent Zone stays in memory on the ruby side (thus preventing garbage collection of the C data).
One caveat is that we've been getting semi-regular "Stack Consistency Errors" that crash our server - this seems like potentially a related memory issue (instead of just your garden variety segfault) - if anyone has any knowledge of what that might be, I would appreciate that as well!
As stated in the documentation to Data_Wrap_Struct(klass, mark, free, ptr) function:
The free argument is the function to free the pointer allocation. If
this is -1, the pointer will be just freed.
These mark / free functions are invoked during GC execution.
Your wrapped native structure will be automatically freed when its corresponding Ruby object is finalized. Until that happens, your data will not be freed unless you do so manually.
Writing C extensions doesn't guarantee a performance boost, but it almost always increases the complexity of your code. Profile your server in order to measure your performance gains, and develop your Zone class in pure Ruby if viable.
In general I like to keep any data that could change outside of my source. Loading it from YAML or a database means you can tweak the data to your heart's content without needing to recompile. Obviously, if your compile time and load times are fast then it's not as big an issue, but it's still a good idea to separate the two.
I favor YAML, because it's a standard format, so you could access the same file from any number of languages. You could load it directly into the C side, or into the Ruby side, depending on what seems faster/smarter.
Does anyone know of a GC algorithm which utilises type information to allow incremental collection, optimised collection, parallel collection, or some other nice feature?
By type information, I mean real semantics. Let me give an example: suppose we have an OO style class with methods to maintain a list which hide the representation. When the object becomes unreachable, the collector can just run down the list deleting all the nodes. It knows they're all unreachable now, because of encapsulation. It also knows there's no need to do a general scan of the nodes for pointers, because it knows all the nodes are the same type.
Obviously, this is a special case and easily handled with destructors in C++. The real question is whether there is way to analyse types used in a program, and direct the collector to use the resulting information to advantage. I guess you'd call this a type directed garbage collector.
The idea of at least exploiting containers for garbage collection in some way is not new, though in Java, you cannot generally assume that a container holds the only reference to objects within it, so your approach will not work in that context.
Here are a couple of references. One is for leak detection, and the other (from my research group) is about improving cache locality.
http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=4814126
http://www.cs.umass.edu/~emery/pubs/06-06.pdf
You might want to visit Richard Jones's extensive garbage collection bibliography for more references, or ask the folks on gc-list.
I don't think it has anything to do with a specific algorithm.
When the GC computes the graph of objects relationship, the information that a Collection object is sole responsible for those elements of the list is implicitly present in the graph if the compiler was good enough to extract it.
Whatever the GC algorithm chosen: the information depends more on how the compiler/runtime will extract this information.
Also, I would avoid C and C++ with GC. Because of pointer arithmetic, aliasing and the possibility to point within an object (reference on a data member or in an array), it's incredibly hard to perform accurate garbage collection in these languages. They have not been crafted for it.