Clojure: What is the underlying data structure of a lazy sequence? - data-structures

As stated above, what is the underlying data structure of a lazy sequence ?
Is it a list ? If it is, then what kind of list is it ? Where can I find references about this ?

The data structure is a clojure.lang.Lazyseq, defined here. The lazy-seq macro creates such.
As you can see, a LazySeq is essentially a linked list which starts life with a thunk (zero-parameter function) member fn. When the sequence is realized, fn is used to generate data member s or sv and is itself annulled. I can't quite figure out how s and sv relate to one another.

Related

Julia: Abstract types vs Union of types

I am quite new to Julia, and I still have some doubts on which style is better when trying to do certain things... For instance, I have lots of doubts on the performance or style differences of using Abstract types vs defining Unions.
An example: Let's imagine we want to implement several types of units (Mob, Knight, ...) which should share most (if not all) of their attributes and most (if not all) of their methods.
I see two options to provide structure: First, one could declare an abstract type AbstractUnit from which the other types derive from, and then create methods for the abstract type. It would look something like this:
abstract type AbstractUnit end
mutable struct Knight <: AbstractUnit
id :: Int
[...]
end
mutable struct Peasant <: AbstractUnit
id :: Int
[...]
end
id(u::T) where T <: AbstractUnit = u.id
[...]
Alternatively, one could define a union of types and create methods for the union. It would look something like this:
mutable struct Knight
id :: Int
[...]
end
mutable struct Peasant
id :: Int
[...]
end
const Unit = Union{Knight,Peasant}
id(u::Unit) = u.id
[...]
I understand some of the conceptual differences between these two approaches, and think the first one is more expandable. However, I have a lot of doubts in terms of performance. For instance, how bad would it be creating arrays of the AbstractUnit vs arrays of the union type in terms of memory allocation at runtime?
Thanks!
Absolutely use AbstractUnit instead of Unit in your methods' argument type annotations. Unions are abstract types too, actually, but as you pointed out, you can't add new types to it. In either case, the method is compiled to a specialization for each concrete type like Knight or Peasant, so your methods' performance won't be different.
As for Arrays' element type parameters, there is the isbits Union optimization, but as the name suggests it only works if all the types in your Union are isbitstypes (no pointers, immutable). Your structs are mutable so that's already not applicable. See, memory access is faster when instances are directly stored in Arrays, and the element type parameter (T in Vector{T}) must be a concrete isbitstype to allow this. When the element type parameter is abstract or mutable, generally Arrays only directly store pointers to the actual instances because multiple or mutable concrete types can have unknown and varying memory size. If the abstract type is an isbits Union though, instances could be stored directly in the Array: enough memory is allocated per element to contain the largest concrete type in the Union as well as a tag byte for each element specifying its type. A byte only has 256 values, so presumably this only works for Unions of at most 256 concrete types.
Another possible optimization by using a Vector{Unit} over Vector{AbstractUnit} is Union-splitting a type instability. I'm really not going to be able to make an example method that explains it better than the linked blog, so I'll just give the short version. When Julia's compiler fails to infer the type of a variable in your method at all (::Any annotations in #code_warntype), inner method calls involving the variable must do type checks and dispatch (selecting the specialization) at run-time, which can cost significant time. However, if Julia's compiler can infer the variable to be a Union of a few concrete types (in practice, at most 4), conditional branches for each type can be used to eliminate most type checks and do dispatch at compile-time. Vector{AbstractUnit} can contain any number of types <: AbstractUnit, so the compiler cannot use Union-splitting. Vector{Unit} however lets the compiler know the elements must be either Knight or Peasant, which allows Union-splitting.
P.S. This is often a source of confusion for beginners, but while Unit is an abstract type, Vector{Unit} is a concrete type, just with an abstract type parameter. After all, it can have instances, all Arrays directly containing pointers to Knight or Peasant instances.

What is the purpose/use of a map in XPath 3.1?

I understand the need for an array type in XPath 3.1 as they're fundamental to JSON. And yes I understand you can create a literal map() in an XPath query.
But is there a way XML or JSON can be structured where a query would naturally return a map on an XPath query against the underlying document? Or does it exist solely for the case where converting results into a map to then operate on is of benefit?
Probably the main use cases I've seen for maps are
(a) to capture the result of parsing JSON input, when the input data is in JSON
(b) to construct a structure that can be serialized as JSON, when JSON output is required.
(c) to provide complex input parameters to functions (like the fn:transform() or fn:serialize() functions)
(d) to capture multiple results or compound results from functions, e.g. a function that computes both the min and max of a sequence. If maps had been available at the time, they could have been used to get the namespace context of an element much more elegantly than the in-scope-prefixes/namespace-uri-for-prefix mechanism.
(e) a map whose entries are functions can be used like an object in OO languages, to achieve polymorphism -- especially useful in XQuery which lacks XSLT's template rule despatch mechanism. The fn:random-number-generator() function design illustrates the idea.
(f) a map can act as a simple struct for compound values, e.g. complex numbers. (It could have been used for date/time/duration/QName if available, or for the error information available in a catch clause)
"is there a way [..] JSON can be structured where a query would naturally return a map?": anything in JSON being an "object"
https://www.json.org/json-en.html: "An object is an unordered set of
name/value pairs. An object begins with {left brace and ends with
}right brace")
maps (pun intended) to an XDM map.
So in JSON both arrays and objects are fundamental and in the XDM you can represent a JSON array as an XDM array and a JSON object as an XDM map.

Compiler/Interpreter design: should built in methods have their own Node or should a lookup table be used?

I'm designing a interpreter which uses recursive descent and I've got to the point where I'm starting to implement built in methods.
One example of a method I'm implementing is the print() method which outputs to the console, just like python's print() method and Java's System.out.println().
However it has come to my attention that there are multiple ways of implementing these built in methods. I'm sure there is many more but I have determined 2 viable ways of achieving this and I'm trying to establish which way is the best practise. For context below is the different layers I'm using within my interpreter which is loosely based off of https://www.geeksforgeeks.org/introduction-of-compiler-design/ and other tutorials I've come across.
Lexer
Parser
Semantic Analyser
Interpreter / Code Generator
1. Creating a AST Node for each individual built in method.
This method entails programming the parser to generate a node for each individual method. This means that a unique node will exist for each method. For example:
The parser will look to generate a node when a TPRINT token is found in the lexer.
print : TPRINT TLPAREN expr TRPAREN {$$ = new Print($3);}
;
And this is what the print class looks like.
class Print : public Node {
public:
virtual VariableValue visit_Semantic(SemanticAnalyzer* analyzer) override;
virtual VariableValue visit_Interpreter(Interpreter* interpreter) override;
Node* value;
Print(Node* cvalue) {
value = cvalue;
}
}
From there I define the visit_Semantic and visit_interpreter methods, and visit them using recursion from the top node.
I can think of a couple advantages/disadvantages to using this method:
Advantages
When the code generator walks the tree and visits the Print node's visit_interpreter method, it can straight away execute the response directly as it is programmed into it's visit methods.
Disadvantages
I will have to write a lot of copy paste code. I will have to create a node for each separate method and define it's parser grammar.
2. Creating a Generic AST Node for a method call Node and then use a lookup table to determine which method is being called.
This involves creating one generic node MethodCall and grammar to determine if a method has been called, with some unique identifier such as a string for the method it is referring to. Then, when the MethodCall's visit_Interpreter or visit_Semantic method is called, it looks up in a table which code to execute.
methcall : TIDENTIFIER TLPAREN call_params TRPAREN {$$ = new MethodCall($1->c_str(), $3);}
;
MethodCall Node. Here the unique identifier is std::string methodName:
class MethodCall : public Node {
public:
virtual VariableValue visit_Semantic(SemanticAnalyzer* analyzer) override;
virtual VariableValue visit_Interpreter(Interpreter* interpreter) override;
std::string methodName;
ExprList *params;
MethodCall(std::string cmethodName, ExprList *cparams) {
params = cparams;
methodName = cmethodName;
}
};
Advantages:
One generic grammar/node for all method calls. This makes it more readable
Disadvantages:
At some point the unique identifier std::string methodName will have to be compared in a lookup table to determine a response for it. This is not as efficient as directly programming in a response to a Node's visit methods.
Which practise is the best way of handling methods within a compiler/interpreter? Are there different practises all together which are better, or are there any other disadvantages/advantages I am missing?
I'm fairly new to compiler/interpreter design so please ask me to clarify if I have got some terminology wrong.
As far as I see it, you have to split things up into methods somewhere. The question is, do you want to implement this as part of the parser definition (solution 1) or do you want to implement this on the C++ side (solution 2).
Personally, I would prefer to keep the parser definition simple and move this logic to the C++ side, that is solution 2.
From a runtime perspective of solution 2, I wouldn't worry too much about that. But in the end, it depends on how frequently that method is called and how many identifiers you have. Having just a few identifiers is different from comparing say hundreds of strings in an "else if" manner.
You could first implement it the simple straight-forward way, i.e. match the string identifiers in an "else if" manner, and see if you run into runtime issues.
If you experience runtime issues, you could use a hash function. The "hardcore" way would be to implement an optimal hash function yourself and check hash function optimality offline, since you know the set of string identifiers. But for your application that would be probably be rather overkill or for academic purposes, and I would recommend to just use unordered_map from the STL (which uses hashing under the hood, see also How std::unordered_map is implemented) to map your string identifiers to index numbers, so that you can implement your jump table with an efficient switch operation on those index numbers.
You should most definitely use a table lookup. It makes things much easier for you. Also, think about the functions users define! Then you'll definitely need a table.

Understanding immutable composite types with fields of mutable types in Julia

Initial note: I'm working in Julia, but this question probably applies to many languages.
Setup: I have a composite type as follows:
type MyType
x::Vector{String}
end
I write some methods to act on MyType. For example, I write a method that allows me to insert a new element in x, e.g. function insert!(d::MyType, itemToInsert::String).
Question: Should MyType be mutable or immutable?
My understanding: I've read the Julia docs on this, as well as more general (and highly upvoted) questions on Stackoverflow (e.g. here or here), but I still don't really have a good handle on what it means to be mutable/immutable from a practical perspective (especially for the case of an immutable composite type, containing a mutable array of immutable types!)
Nonetheless, here is my attempt: If MyType is immutable, then it means that the field x must always point to the same object. That object itself (a vector of Strings) is mutable, so it is perfectly okay for me to insert new elements into it. What I am not allowed to do is try and alter MyType so that the field x points to an entirely different object. For example, methods that do the following are okay:
MyType.x[1] = "NewValue"
push!(MyType.x, "NewElementToAdd")
But methods that do the following are not okay:
MyType.x = ["a", "different", "string", "array"]
Is this right? Also, is the idea that the object that an immutable types field values are locked to are those that are created within the constructor?
Final Point: I apologise if this appears to duplicate other questions on SO. As stated, I have looked through them and wasn't able to get the understanding that I was after.
So here is something mind bending to consider (at least to me):
julia> immutable Foo
data::Vector{Float64}
end
julia> x = Foo([1.0, 2.0, 4.0])
Foo([1.0,2.0,4.0])
julia> append!(x.data, x.data); pointer(x.data)
Ptr{Float64} #0x00007ffbc3332018
julia> append!(x.data, x.data); pointer(x.data)
Ptr{Float64} #0x00007ffbc296ac28
julia> append!(x.data, x.data); pointer(x.data)
Ptr{Float64} #0x00007ffbc34809d8
So the data address is actually changing as the vector grows and needs to be reallocated! But - you can't change data yourself, as you point out.
I'm not sure there is a 100% right answer is really. I primarily use immutable for simple types like the Complex example in the docs in some performance critical situations, and I do it for "defensive programming" reasons, e.g. the code has no need to write to the fields of this type so I make it an error to do so. They are a good choice IMO whenever the type is a sort of an extension of a number, e.g. Complex, RGBColor, and I use them in place of tuples, as a kind of named tuple (tuples don't seem to perform well with Julia right now anyway, wheres immutable types perform excellently).

C# Performance of LINQ vs. foreach iterator block

1) Do these generate the same byte code?
2) If not, is there any gain in using one over the other in certain circumstances?
// LINQ select statement
return from item in collection
select item.Property;
// foreach in an iterator block
foreach (item in collection)
yield return item.Property;
They don't generate the same code but boil down to the same thing, you get an object implementing IEnumerable<typeof(Property)>. The difference is that linq provides iterator from its library (in this case most likely by using WhereSelectArrayIterator or WhereSelectListIterator) whereas in the second example you yourself generate an iterator block that dissects a collection. An iterator block method is always, by ways of compiler magic, compiled as a separate class implementing IEnumerable<typeof(yield)> which you don't see but instantiate implicitly when you call iterator block method.
Performance wise, #1 should be slightly (but just slightly) faster for indexable collections because when you loop through the resulting IEnumerable you go directly from your foreach into collection retrieval in an optimized linq iterator. In example #2 you go from foreach into your iterator block's foreach and from there into collection retrieval and your performance depends mostly on how smart the compiler is at optimizing yield logic. In any case I would imagine that for any complex collection mechanism the cost of retrieval marginalizes this difference.
IMHO, I would always go with #1, if nothing else it saves me from having to write a separate method just for iterating.
No they don't generate the same byte code. The first one returns a pre-existing class in the framework. The second one returns a compiler generated state machine that returns the items from the collection. That state machine is very similar to the class that exists in the framework already.
I doubt there's much performance difference between the two. Both are doing very similar things in the end.

Resources