Synthesize to smart pointers with Boost Spirit X3 - boost

I need to parse a complex AST, and it would be impossible to allocate this AST on heap memory, and the AST nodes must support polymorphism. One solution would be to allocate the AST nodes using smart pointers.
To simplify the question, how would I synthesize the following struct (std::unique_ptr<GiantIntegerStruct> giantIntegerStruct), with Boost Spirit X3 for example?
struct GiantIntegerStruct {
std::vector<unique_ptr<int>> manyInts;
}
My tentative solution, is to use semantic actions. Is there an alternative?

You can do semantic actions, or you can define traits for you custom types. However, see here Semantic actions runs multiple times in boost::spirit parsing (especially the two links there) - basically, consider not doing that.
I need to parse a complex AST, and it would be impossible to allocate this AST on heap memory
This somewhat confusing statement leads me to the logical conclusion that you merely need to allocate from a shared memory segment instead.
In the good old spirit of Rule Of Zero you could make a value-wrapper that does the allocation using whatever method you prefer and still enjoy automatic attribute propagation with "value semantics" (which will server as mere "handles" for the actual object in shared memory).
If you need any help getting this set up, feel free to post a new question.

Related

When to use references versus types versus boxes and slices versus vectors as arguments and return types?

I've been working with Rust the past few days to build a new library (related to abstract algebra) and I'm struggling with some of the best practices of the language. For example, I implemented a longest common subsequence function taking &[&T] for the sequences. I figured this was Rust convention, as it avoided copying the data (T, which may not be easily copy-able, or may be big). When changing my algorithm to work with simpler &[T]'s, which I needed elsewhere in my code, I was forced to put the Copy type constraint in, since it needed to copy the T's and not just copy a reference.
So my higher-level question is: what are the best-practices for passing data between threads and structures in long-running processes, such as a server that responds to queries requiring big data crunching? Any specificity at all would be extremely helpful as I've found very little. Do you generally want to pass parameters by reference? Do you generally want to avoid returning references as I read in the Rust book? Is it better to work with &[&T] or &[T] or Vec<T> or Vec<&T>, and why? Is it better to return a Box<T> or a T? I realize the word "better" here is considerably ill-defined, but hope you'll understand my meaning -- what pitfalls should I consider when defining functions and structures to avoid realizing my stupidity later and having to refactor everything?
Perhaps another way to put it is, what "algorithm" should my brain follow to determine where I should use references vs. boxes vs. plain types, as well as slices vs. arrays vs. vectors? I hesitate to start using references and Box<T> returns everywhere, as I think that'd get me a sort of "Java in Rust" effect, and that's not what I'm going for!

Does newLISP use garbage collection?

This page has been quite confusing for me.
It says:
Memory management in newLISP does not rely on a garbage collection algorithm. Memory is not marked or reference-counted. Instead, a decision whether to delete a newly created memory object is made right after the memory object is created.
newLISP follows a one reference only (ORO) rule. Every memory object not referenced by a symbol is obsolete once newLISP reaches a higher evaluation level during expression evaluation. Objects in newLISP (excluding symbols and contexts) are passed by value copy to other user-defined functions. As a result, each newLISP object only requires one reference.
Further down, I see:
All lists, arrays and strings are passed in and out of built-in functions by reference.
I can't make sense of these two.
How can newLISP "not rely on a garbage collection algorithm", and yet pass things by reference?
For example, what would it do in the case of circular references?!
Is it even possible for a LISP to not use garbage collection, without making performance go down the drain? (I assume you could always pass things by value, or you could always perform a full-heap scan whenever you think it might be necessary, but then it seems to me like that would insanely hurt your performance.)
If so, how would it deal with circular references? If not, what do they mean?
Perhaps reading http://www.newlisp.org/ExpressionEvaluation.html helps understanding the http://www.newlisp.org/MemoryManagement.html paper better. Regarding circular references: they do not exist in newLISP, there is no way to create them. The performance question is addressed in a sub chapter of that memory management paper and here: http://www.newlisp.org/benchmarks/
May be working and experimenting with newLISP - i.e. trying to create a circular reference - will clear up most of the questions.

Static call graph generation for the Linux kernel

I'm looking for a tool to statically generate a call graph of the Linux kernel (for a given kernel configuration). The generated call graph should be "complete", in the sense that all calls are included, including potential indirect ones which we can assume are only done through the use of function pointers in the case of the Linux kernel.
For instance, this could be done by analyzing the function pointer types: this approach would lead to superfluous edges in the graph, but that's ok for me.
ncc seems to implement this idea, however I didn't succeed in making it work on the 3.0 kernel. Any other suggestions?
I'm guessing this approach could also lead to missing edges in cases where function pointer casts are used, so I'd also be interested in knowing whether this is likely in the Linux kernel.
As a side note, there seems to be other tools that are able to do semantic analysis of the source to infer potential pointer values, but AFAICT, none of them are design to be used in a project such as the Linux kernel.
Any help would be much appreciated.
We've done global points-to analysis (with indirect function pointers) and full call graph construction of monolithic C systems of 26 million lines (18,000 compilation units).
We did it using our DMS Software Reengineering Toolkit, its C Front End and its associated flow analysis machinery. The points-to analysis machinery (and the other analyses) are conservative; yes, you get some bogus points-to and therefore call edges as a consequence. These are pretty hard to avoid.
You can help such analyzers by providing certain crucial facts about key functions, and by harnessing knowledge such as "embedded systems [and OSes] tend not to have cycles in the call graph", which means you can eliminate some of these. Of course, you have to allow for exceptions; my moral: "in big systems, everything happens."
The particular problem included dynamically loaded(!) C modules using a special loading scheme specific to this particular software, but that just added to the problem.
Casts on function pointers shouldn't lose edges; a conservative analysis should simply assume that the cast pointer matches any function in the system with signature corresponding to the casted result. More problematic are casts which produce sort-of-compatible signatures; if you cast a function pointer to void* foo(uint) when the actual function being called accepts an int, the points to analysis will necessarily conservatively choose the wrong functions. You can't blame the analyzer for that; the cast lies in that case. Yes, we saw this kind of trash in the 26 million line system.
This is certainly the right scale for analyzing Linux (which I think is a mere 8 million lines or so :-). But we haven't tried it specifically on Linux.
Setting up this tool is complicated because you have to capture all the details about the compilations themselves, and in particular the configuration of the Linux kernal you want to generate. So you pretty much have to intercept the compiler calls to get the command line switches, etc.

Why don't Haskell compilers facilitate deterministic memory management?

With the wealth of type information available why can't Haskell runtimes avoid running GC to clean up? It should be possible to figure out all usages and insert appropriate calls to alloc/release in the compiled code, right? This would avoid the overhead of a runtime GC.
It is sensible to ask whether functional programming languages can do less GC by tracking usage. Although the general problem of whether some data can safely be discarded is undecidable (because of conditional branching), it's surely plausible to work harder statically and find more opportunities for direct deallocation.
It's worth paying attention to the work of Martin Hofmann and the team on the Mobile Resource Guarantees project, who made type-directed memory (de/re)allocation a major theme. The thing that makes their stuff work, though, is something Haskell doesn't have in its type system --- linearity. If you know that a function's input data are secret from the rest of the computation, you can reallocate the memory they occupy. The MRG stuff is particularly nice because it manages a realistic exchange rate between deallocation for one type and allocation for another which turns into good old-fashioned pointer-mangling underneath a purely functional exterior. In fact, lots of lovely parsimonious mutation algorithms (e.g. pointer-reversing traversal, overwrite-the-tail-pointer construction, etc) can be made to look purely functional (and checked for nasty bugs) using these techniques.
In effect, the linear typing of resources gives a conservative but mechanically checkable approximation to the kind of usage analysis that might well help to reduce GC. Interesting questions then include how to mix this treatment cleanly (deliberate adverb choice) with the usual persistent deal. It seems to me that quite a lot of intermediate data structures has an initial single-threaded phase in recursive computation, before being either shared or dropped when the computation finishes. It may be possible to reduce the garbage generated by such processes.
TL;DR There are good typed approaches to usage analysis which cut GC, but Haskell has the wrong sort of type information just now to be particularly useful for this purpose.
Region-based memory management is what programmers in C and C++ often end up programming by hand: Allocate a chunk of memory ("region", "arena", etc.), allocate the individual data in it, use them and eventually deallocate the whole chunk when you know none of the individual data are needed any more. Work in the 90s by Tofte, Aiken, and others (incl. yours truly, with our respective colleagues), has shown that it is possible to statically infer region allocation and deallocation points automatically ("region inference") in such a way as to guarantee that chunks are never freed too early and, in practice, early enough to avoid having too much memory being held for long after the last data in it was needed. The ML Kit for Regions, for example, is a full Standard ML compiler based on region inference. In its final version it is combined with intra-region garbage collection: If the static inference shows there is a long-living region, use garbage collection inside it. You get to have your cake and eat it, too: You have garbage collection for long living data, but a lot of data is managed like a stack, even though it would end in a heap ordinarily.
Consider the following pseudo-code:
func a = if some_computation a
then a
else 0
To know whether a is garbage after calling func a, the compiler has to be able to know the result of some_computation a. If it could do that in the general case (which requires solving the halting problem), thrn there'd be no need to emit code for this function at all, let alone garbage collect it. Type information is not sufficient.
It's not easy to determine object lifetime with lazy evaluation. The JHC compiler does have (had?) region memory management which tries to release memory by deallocation when the lifetime is over.
I'm also curious exactly what you mean by deterministic memory management.
Type information has mostly to do with compile time where as memory management is a runtime thing, so I don't think they are related to each other.

Gc using type information

Does anyone know of a GC algorithm which utilises type information to allow incremental collection, optimised collection, parallel collection, or some other nice feature?
By type information, I mean real semantics. Let me give an example: suppose we have an OO style class with methods to maintain a list which hide the representation. When the object becomes unreachable, the collector can just run down the list deleting all the nodes. It knows they're all unreachable now, because of encapsulation. It also knows there's no need to do a general scan of the nodes for pointers, because it knows all the nodes are the same type.
Obviously, this is a special case and easily handled with destructors in C++. The real question is whether there is way to analyse types used in a program, and direct the collector to use the resulting information to advantage. I guess you'd call this a type directed garbage collector.
The idea of at least exploiting containers for garbage collection in some way is not new, though in Java, you cannot generally assume that a container holds the only reference to objects within it, so your approach will not work in that context.
Here are a couple of references. One is for leak detection, and the other (from my research group) is about improving cache locality.
http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=4814126
http://www.cs.umass.edu/~emery/pubs/06-06.pdf
You might want to visit Richard Jones's extensive garbage collection bibliography for more references, or ask the folks on gc-list.
I don't think it has anything to do with a specific algorithm.
When the GC computes the graph of objects relationship, the information that a Collection object is sole responsible for those elements of the list is implicitly present in the graph if the compiler was good enough to extract it.
Whatever the GC algorithm chosen: the information depends more on how the compiler/runtime will extract this information.
Also, I would avoid C and C++ with GC. Because of pointer arithmetic, aliasing and the possibility to point within an object (reference on a data member or in an array), it's incredibly hard to perform accurate garbage collection in these languages. They have not been crafted for it.

Resources