Binding using std::bind vs lambdas. How expensive are they? - c++11

I was playing with bind and I was thinking, are lambdas as expensive as function pointers?
What I mean is, as I understand lambdas, they are syntactic sugar for functors and bind is similar. However, if you do this:
#include<functional>
#include<iostream>
void fn2(int a, int b)
{
std::cout << a << ", " << b << std::endl;
}
void fn1(int a, int b)
{
//auto bound = std::bind(fn2, a, b);
//static auto bound = std::bind(fn2, a, b);
//auto bound = [&]{ fn2(a, b); };
static auto bound = [&]{ fn2(a, b); };
bound();
}
int main()
{
fn1(3, 4);
fn1(1, 2);
return 0;
}
Now, if I were to use the 1st auto bound = std::bind(fn2, a, b);, I get an output of 3, 4
1, 2, the 2nd I get 3, 4
3, 4. The 3rd and 4th I get output like the 1st.
Now I get why the 1st and 2nd work that way, they are getting initialised at the beginning of the function call (the static one, only the 1st time it is called). However, 3 and 4 seem to have compiler magic going on where the generated functors are not really creating references to the enclosing scope's variables, but are actually latching on to the symbols whether or not it is initialised only the first time or every time.
Can someone clarify what is actually happening here?
Edit: What I was missing is using static auto bound = std::bind(fn2, std::ref(a), std::ref(b)); to have it work as the 4th option.

You have this code:
static auto bound = [&]{ fn2(a, b); };
Assignment is done only first time you are invoking this function because it's static. So in fact it's called only once. Compiler creates closure when you are making lambdas, so references to a and b from first call to fn1 was captured. It's very risky. It may lead to dangling references. I'm surprised it didn't crashed since you are making closure from function parameters passed by value - to local variables.
I recommend this excellent article about lambdas: http://www.cprogramming.com/c++11/c++11-lambda-closures.html .

As a general rule, only use [&] lambdas when your closure is going to go away by the end of the current scope.
If it is going to outlast the current scope, and you need by-reference, explicitly capture the things you are going to capture, or create local pointers to the things you are going to capture and capture them by-value.
In your case, your static lambda code is full of undefined behavior, as you [&] capture a and b in the first call, then use it in the second call.
In theory, the compiler could rewrite your code to capture a and b by value instead of by reference, then call that every time, because the only difference between that implementation and the one you wrote occurs when the behavior is undefined, and the result will be much faster.
It could do a more efficient job by ignoring your static completely, as the entire state of your static object is undefined after you leave scope the first time you call, and the construction has no visible side effects.
To fix your problem with the lambdas, use [=] or [a,b] to introduce the lambda, and it will capture the a and b by value. I prefer to capture state explicitly on lambdas when I expect the lambda to persist longer than the current block.

Related

Halide external function call from generator

I want to implement simple image processing routine quite similar to Auto Levels, so need to precalculate thresholds, make LUT and then make histogram stretching/normalization applying LUT.
But my question is not about algorithm side, it is about using extern defined functions, because i need a couple of while cycles for LUT calculation and i think using extern is good for it.
I tried following examples from Halide sources and checked this question too
I use AOT compilation currently testing on PC(winx64), aiming for arm in future, and have the following generator code:
Var x("x"), y("y");
Func make_a_root{ "make_a_root" };
Buffer<bitType> Lut{256, "lut"};
make_a_root(x, y) = inputY(x, y);
ExternFuncArgument arg = make_a_root;
Func g;
g.define_extern("generateAutoLevelsLut", { arg }, UInt(8), 2, Halide::NameMangling::CPlusPlus);
g.compute_root();
inputY has Input<Buffer<uint8_t>> inputY{ "input_y", 2 }; type
First i just want to make it run the call, so function body makes nothing but print (can i define function in same cpp file as generator?)
int generateAutoLevelsLut(halide_buffer_t * input, halide_buffer_t * out)
{
printf("\nextern call\n");
return 0;
}
I tried default mangling with extern "C" too.
Never succeeded getting print message though, so my question is, why this happenin. Is it just misunderstanding on some syntax or are there any problem with calling extern function from generator code?
EDIT:
Added usage of extern like 'out(x,y) = g(x,y)' (lvalue should be actually used!) , and it started to make a call. Now struggling with host == NULL. Digging into bounds inference stuff.
EDIT 2:
I added basic bounds inference checks, now it does not crash.. The next problem i have now, is: Is it possible to make call to external function, without actually influencing output result in direct manner?
Let me concretise what i mean.
The generator code looks like following:
Buffer<bitType> lut{256, "lut"};
args[0] = inputY;
args[1] = lut;
g.define_extern("generateAutoLevelsLut", args, { UInt(8) }, 2, Halide::NameMangling::C);
outputY(x, y) = g(x, y); // Call line
g.compute_root();
outputY.compute_root();
Extern functon code fills second input 'lut' with some dummy LUT:
Halide::Runtime::Buffer<uint16_t> im2Buffer(*input2);
Mat im2Mat(Size(im2Buffer.width(), im2Buffer.height()), CVC_8U, im2Buffer.data(), im2Buffer.stride(1));
for (int i = 0; i < 256; i++)
im2Mat.at<uchar>(i) = i;
And if i comment the 'Call line' in generator, it optimizes away the call to extern at all.
I want to make something like:
Func lutRoot;
lutRoot(x) = lut(x); // to convert from Buffer
outputY(x, y) = autoLevelsPrecalcLut(inputY, lutRoot)(x, y);
And here lut is implicitly passed into extern and filled there. But it doesn't work, as well as other variants which ignore the modification of 'output'... or maybe this whole approach is wrong?
Any suggestions? Thanks
EDIT 3:
Solved task avoiding extern calls, replacing while cycles with argmin and RDom combo, but original question about extern still remains
That should work (or fail with a linker error if it wasn't going to). It's possible the Halide pipeline doesn't think it needs to call your extern function. E.g. does something use the result?
Alternatively, try stderr instead, just in case it's an output stream buffering issue. That extern function definition is likely to cause Halide to error out (because it doesn't reply to the bounds inference query), and erroring out calls abort by default, which would swallow things printed to stdout.

Pybind11: Follow up to binding a function with std::initializer_list

I know that there is a similar question here: Binding a function with std::initializer_list argument using pybind11 but because I cannot comment (not enough reputation) I ask my question here: Do the results from the above-linked question also apply to constructors: I.e. if I have a constructor which takes std::initializer_list<T> is there no way to bind it?
There's no simple way to bind it, at least. Basically, as mentioned in the other post (and my original response in the pybind11 issue tracker), we can't dynamically construct a std::initializer_list: it's a compile-time construct. Constructor vs method vs function doesn't matter here: we cannot convert a set of dynamic arguments into the compile-time initializer_list construct.
But let me give you a way that you could, partially, wrap it if you're really stuck with a C++ design that requires it. You first have to decide how many arguments you're going to support. For example, let's say you want to support 1, 2, or 3 int arguments passed via initializer_list<int> in the bound constructor for a MyType. You could write:
#include <stl.h>
py::class_<MyType>(m, "MyClass")
.def(py::init([](int a) { return new MyClass({ a }); }))
.def(py::init([](int a, int b) { return new MyClass({ a, b }); }))
.def(py::init([](int a, int b, int c) { return new MyClass({ a, b, c }); }))
.def(py::init([](std::vector<int> v) {
if (vals.size() == 1) return new MyClass({ v[0] });
elsif (vals.size() == 2) return new MyClass({ v[0], v[1] });
elsif (vals.size() == 3) return new MyClass({ v[0], v[1], v[2] });
else throw std::runtime_error("Wrong number of ints for a MyClass");
});
where the first three overloads take integer values as arguments and the last one takes a list. (There's no reason you'd have to use both approaches--I'm just doing it for the sake of example).
Both of these are rather gross, and don't scale well, but they exhibit the fundamental issue: each size of an initializer_list needs to be compiled from a different piece of C++ code. And that's why pybind11 can't support it: we'd have to compile different versions of the conversion code for each possible initializer_list argument length--and so either the binary size explodes for any number of arguments that might be used, or there's an arbitrary argument size cut-off beyond which you start getting a fatal error. Neither of those are nice options.
Edit: As for your question specifically about constructors: there's no difference here. The issue is that we can't convert arguments into the required type, and argument conversion is identical whether for a constructor, method, or function.

Pass a vector starting from index i by reference

I am writing a function in C++
int maxsubarray(vector<int>&nums)
say I have a vector
v={1,2,3,4,5}
I want to pass
{3,4,5}
to the function,i.e. pass the vector starting from index 2. In C I know I can call maxsubarray(v+2)
but in C++ it doesn't work. I can modify the function by adding start index parameter to make it work of course. Just want to know can I do it without modifying my original function?
THX
You will have to create a temporary vector with the part you want to pass:
std::vector<int> v = {1,2,3,4,5};
std::vector<int> v2(v.begin() + 2, v.end());
maxsubarray(v2);
The obvious solution is to make a new vector and pass that one instead. I definitely do not recommend that. The most idiomatic way is to make your function take iterators:
template<typename It>
It::value_type maxsubarray(It begin, It end) { ... }
and then use it like this:
std::vector<int> nums(...);
auto max = maxsubarray(begin(nums) + 2, end(nums));
Anything else involving copies, is just inefficient and not necessary.
Not without constructing another vector.
You can either build a new vector a pass it by reference to the function (but this might not be ideal from a performance point of view. You generally pass by reference to avoid unnecessary copies) or use pointers:
//copy the vector
std::vector<int> copy(v.begin()+2, v.end());
maxsubarray(copy);
//pass a pointer to the given element
int maxsubarray(int * nums)
maxsubarray(&v[2]);
You could try calling it with a temporary:
int myMax = maxsubarray(vector<int>(v.begin() + 2, v.end()));
That might require changing the function signature to
int maxsubarray(const vector<int> &nums);
since (I think) temporaries can't bind to non-const references, but that change should be preferred here if maxsubarray won't modify nums.

Data structure for storing variables in an interpreted language

I am designing my own experimental scripting language for the purpose of embedding it in my bigger application.
Almost everything I wanted to do was programmed smoothly, but the "simple" act of storing variables in memory appeared to be the hardest part here. I don't know how to store them to allow all type checking, global variables and special flags on them. First look at a sample code:
a = 1
b = 2
someFunction()
print(a) --> This should read the global variable and print `1`
a = 3 --> Now `a` should become a local variable of this function
and the global `a` remain unchanged
x = 4 --> `x` should always be local of this function
end
I call the "locality" of variables their levels so variables in nested blocks have a higher level. In the above code, a and b are level 1 variables. Local variables of someFunction will have level 2. The first line of the function should read the global variable a (level 1) but the second line should create a variable again called a but with level 2 that shadows the global a from that point onwards. The third line should create the variable x with level 2. How to store and keep track of all these in memory?
What I tried so far:
Method 1: Storing maps of variable=>value in array of levels:
variables
{
level=1 //global variables
{
a => 1,
b => 2
},
level=2 //function variables
{
a => 3,
x => 4
}
}
But that will make variable look-up really slow since one has to search all the levels for a given variable.
Method 2: Storing the (variable, level) pairs as keys of a map:
variables
{
(a, 1) => 1, //global
(b, 1) => 2, //global
(a, 2) => 3, //function
(x, 2) => 3 //function
}
This has the same problem as before since we have to try the pair (variable, level) with all possible levels for a given variable.
What method should I use for optimal memory usage and fastest access time?
Additional notes:
I know about how variables are managed on stack and heap on other "real" languages, but I find it tricky to do this on an interpreted language. "This mustn't be how Lua and Python do that," I always think. Correct me if I'm wrong. I'm trying to store the variable in maps and internal C++ structures.
And finally, this is how I represent a variable. Do you think it's big and there can be more memory-efficient representations? (I've also tried to put the "Level" as a member here but it had the same problem as the other too.)
struct Member
{
uchar type; //0=num, 1=str, 2=function, 3=array, etc
uchar flags; //0x80 = read-only, 0x40 = write-only, etc
union {
long double value_num;
char* value_str;
int value_func;
//etc
};
};
An easy thing to do, similar to your array, is to maintain a stack of maps. Each map contains the bindings for that scope. To bind a variable, add it to the top map; to look up a variable, start at the top of the stack and stop when you reach a map that contains a binding for that variable. Search takes a little bit, but starting from the top/end you only have to search until you find it — in most cases, this search will not be very long.
You can also make the stack implicit by encapsulating this logic in an Environment class that has local bindings and an inherited environment used for resolving unknown variables. Need to go into a new scope? Create a new environment with the current environment as its base, use it, then discard it when the scope is finished. The root/global environment can just have a null inherited environment. This is what I would probably do.
Its worth noting that if, inside a function, you don't have access to any variables from the caller function, it lowers the number of levels you need to look at. For example:
variable a;
function one() {
variable b;
// in this function, we can see the global a, local b
two();
}
function two() {
// in this function, we can see the global a, local c
// we cannot see the local b of our caller
variable c;
while (true) {
variable d;
// here we can see local d, local c, global a
}
}
The idea being that function boundaries limit the visibility of variables, with the global scope being "special".
That being said, you can consider removing the specialness of global variables, but allowing the code to specify that they want access to non-local variables
variable a;
function one() {
global a; // or upvar #0 a;
variable b;
// in this function, we can see the global a, local b
two();
}
function two() {
// in this function, we can see the local c
// and the local b of our caller
// (since we specifically say we want access to "b" one level up)
upvar 1 b;
variable c;
}
It looks complicated at first, but it's really easy to understand once you get used to it (upvar is a construct from the Tcl programming language). What it allows you is access to variables in your caller's scope, but it avoids some of the costly lookup involved by requiring that you specify exactly where that variable comes from (with 1 being one level up the call stack, 2 being two levels up, and #0 being "special" in saying "the uppermost call stack, the global)

How to achieve "optimal" operator overload-resolution in arithmetic expressions with rvalues?

first of all, I apologize for the overly verbose question. I couldn't think of any other way to accurately summarize my problem... Now on to the actual question:
I'm currently experimenting with C++0x rvalue references... The following code produces unwanted behavior:
#include <iostream>
#include <utility>
struct Vector4
{
float x, y, z, w;
inline Vector4 operator + (const Vector4& other) const
{
Vector4 r;
std::cout << "constructing new temporary to store result"
<< std::endl;
r.x = x + other.x;
r.y = y + other.y;
r.z = z + other.z;
r.w = w + other.w;
return r;
}
Vector4&& operator + (Vector4&& other) const
{
std::cout << "reusing temporary 2nd operand to store result"
<< std::endl;
other.x += x;
other.y += y;
other.z += z;
other.w += w;
return std::move(other);
}
friend inline Vector4&& operator + (Vector4&& v1, const Vector4& v2)
{
std::cout << "reusing temporary 1st operand to store result"
<< std::endl;
v1.x += v2.x;
v1.y += v2.y;
v1.z += v2.z;
v1.w += v2.w;
return std::move(v1);
}
};
int main (void)
{
Vector4 r,
v1 = {1.0f, 1.0f, 1.0f, 1.0f},
v2 = {2.0f, 2.0f, 2.0f, 2.0f},
v3 = {3.0f, 3.0f, 3.0f, 3.0f},
v4 = {4.0f, 4.0f, 4.0f, 4.0f},
v5 = {5.0f, 5.0f, 5.0f, 5.0f};
///////////////////////////
// RELEVANT LINE HERE!!! //
///////////////////////////
r = v1 + v2 + (v3 + v4) + v5;
return 0;
}
results in the output
constructing new temporary to store result
constructing new temporary to store result
reusing temporary 1st operand to store result
reusing temporary 1st operand to store result
while I had hoped for something like
constructing new temporary to store result
reusing temporary 1st operand to store result
reusing temporary 2nd operand to store result
reusing temporary 2nd operand to store result
After trying to re-enact what the compiler was doing (I'm using MinGW G++ 4.5.2 with option -std=c++0x in case it matters), it actually seems quite logical. The standard says that arithmetic operations of equal precedence are evaluated/grouped left-to-right (why I assumed right-to-left I don't know, I guess it's more intuitive to me). So what happened here is that the compiler evaluated the sub-expression (v3 + v4) first (since it's in parentheses?), and then began matching the operations in the expression left-to-right against the operator overloads, resulting in a call to Vector4 operator + (const Vector4& other) for the sub-expression v1 + v2. If I want to avoid the unnecessary temporary, I'd have to make sure that no more than one lvalue operand appears to the immediate left of any parenthesized sub-expression, which is counter-intuitive to anyone using this "library" and innocently expecting optimal performance (as in minimizing the creation of temporaries).
(I'm aware that there's ambiguity in my code regarding operator + (Vector4&& v1, const Vector4& v2) and operator + (Vector4&& other) when (v3 + v4) is to be added to the result of v1 + v2, resulting in a warning. But it's harmless in my case and I don't want to add yet another overload for two rvalue reference operands - anyone know if there's a way to disable this warning in gcc?)
Long story short, my question boils down to: Is there any way or pattern (preferably compiler-independent) this vector class could be rewritten to enable arbitrary use of parentheses in expressions that still results in the "optimal" choice of operator overloads (optimal in terms of "performance", i.e. maximizing the binding to rvalue references)? Perhaps I'm asking for too much though and it's impossible... if so, then that's fine too. I just want to make sure I'm not missing anything.
Many thanks in advance
Addendum
First thanks to the quick responses I got, within minutes (!) - I really should have started posting here sooner...
It's becoming tedious replying in the comments, so I think a clarification of my intent with this class design is in order. Maybe you can point me to a fundamental conceptual flaw in my thought process if there is one.
You may notice that I don't hold any resources in the class like heap memory. Its members are only scalar types even. At first sight this makes it a suspect candidate for move-semantics based optimizations (see also this question that actually helped me a great deal grasping the concepts behind rvalue references).
However, since the classes this one is supposed to be a prototype for will be used in a performance-critical context (a 3D engine to be precise), I want to optimize every little thing possible. Low-complexity algorithms and maths-related techniques like look-up tables should of course make up the bulk of the optimizations as anything else would simply be addressing the symptoms and not eradicating the real reason for bad performance. I am well aware of that.
With that out of the way, my intent here is to optimize algebraic expressions with vectors and matrices that are essentially plain-old-data structs without pointers to data in them (mainly due to the performance drawbacks you get with data on the heap [having to dereference additional pointers, cache considerations etc.]).
I don't care about move-assignment or construction, I just don't want more temporaries being created during the evaluation of a complicated algebraic expression than absolutely necessary (usually just one or two, e.g. a matrix and a vector).
Those are my thoughts that might be erroneous. If they are, please correct me:
To achieve this without relying on RVO, return-by-reference is necessary (again: keep in mind I don't have remote resources, only scalar data members).
Returning by reference makes the function-call expression an lvalue, implying the returned object is not a temporary, which is bad, but returning by rvalue reference makes the function-call expression an xvalue (see 3.10.1), which is okay in the context of my approach (see 4)
Returning by reference is dangerous, because of the possibly short lifetime of objects, but:
temporaries are guaranteed to live until the end of the evaluation of the expression they were created in, therefore:
making it safe to return by reference from those operators that take at least one rvalue-reference as their argument, if the object referenced by this rvalue reference argument is the one being returned by reference. Therefore:
Any arbitrary expression that only employs binary operators can be evaluated by creating only one temporary when not more than one PoD-like type is involved, and the binary operations don't require a temporary by nature (like matrix multiplication)
(Another reason to return by rvalue-reference is because it behaves like returning by value in terms of rvalue-ness of the function-call expression; and it's required for the operator/function-call expression to be an rvalue in order to bind to subsequent calls to operators that take rvalue references. As stated in (2), calls to functions that return by reference are lvalues, and would therefore bind to operators with the signature T operator+(const T&, const T&), resulting in the creation of an unnecessary temporary)
I could achieve the desired performance by using a C-style approach of functions like add(Vector4 *result, Vector4 *v1, Vector4 *v2), but come on, we're living in the 21st century...
In summary, my goal is creating a vector class that achieves the same performance as the C-approach using overloaded operators. If that in itself is impossible, than I guess it can't be helped. But I'd appreciate if someone could explain to me why my approach is doomed to fail (the left-to-right operator evaluation issue that was the initial reason for my post aside, of course).
As a matter of fact, I've been using the "real" vector class this one is a simplification of for a while without any crashes or corrupted memory so far. And in fact, I never actually return local objects as references, so there shouldn't be any problems. I dare say what I'm doing is standard-compliant.
Any help on the original issue would of course be appreciated as well!
many thanks for all the patience again
You should not return an rvalue reference, you should return a value. In addition, you should not specify both a member and a free operator+. I'm amazed that even compiled.
Edit:
r = v1 + v2 + (v3 + v4) + v5;
How could you possibly only have one temporary value when you're performing two sub-computations? That's just impossible. You can't re-write the Standard and change this.
You will just have to trust your users to do something not completely stupid, like write the above line of code, and expect to have just one temporary.
I recommend modeling your code after the basic_string operator+() found in chapter 21 of N3225.

Resources