Consider the below function,
public static int foo(int x){
return x + 5;
}
Now, let us call it,
int in = /*Input taken from the user*/;
int x = foo(10); // ... (1)
int y = foo(in); // ... (2)
Here, can the compiler change
int x = foo(10); // ... (1)
to
int x = 15; // ... (1)
by evaluating the function call during compile time since the input to the function is available during compile time ?
I understand this is not possible during the call marked (2) because the input is available only during run time.
I do not want to know a way of doing it in any specific language. I would like to know why this can or can not be a feature of a compiler itself.
C++ does have a method for this:
Have a read up on the 'constexpr' keyword in C++11, it allows compile time evaluation of functions.
They have a limitation: the function must be a return statement (not multiple lines of code), but can call other constexpr functions (C++14 does not have this limitation AFAIK).
static constexpr int foo(int x){
return x + 5;
}
EDIT:
Why a compiler might not evaluate a function (just my guess):
It might not be appropriate to remove a function by evaluating it without being told.
The function could be used in different compilation units, and with static/dynamic inputs: thus evaluating it in some circumstances and adding a call in other places.
This use would provide inconsistent execution times (especially on a deterministic platform like AVR) where timing may be important, or at least need to be predictable.
Also interrupts (and how the compiler interacts with them) may come into play here.
EDIT:
constexpr is actually stronger -- it requires that the compiler do this. The compiler is free to fold away functions without constexpr, but the programmer can't rely on it doing so.
Can you give an example in the case where the user would have benefited from this but the compiler chose not to do it ?
inline functions may, or may not resolve to constant expressions which could be optimized into the end result.
However, a constexpr guarantees it. An inline function cannot be used as a compile time constant whereas constexpr can allow you to formulate compile time functions and more so, objects.
A basic example where constexpr makes a guarantee that inline cannot.
constexpr int foo( int a, int b, int c ){
return a+b+c;
}
int array[ foo(1, 2, 3) ];
And the same as a simple object.
struct Foo{
constexpr Foo( int a, int b, int c ) : val(a+b+c){}
int val;
};
constexpr Foo foo( 1,2,4 );
int array[ foo.val ];
Unless foo.val is a compile time constant, the code above will not compile.
Even as just a function, an inline function has no guarantee. And the linker can also do inlining over multiple compilation units, after the syntax has been compiled (array bounds checked for integer constants).
This is kind of like meta-programming, but without the templates. Of course these examples do not do the topic justice, however very complex solutions would benefit from the ability to use objects and functional programming to achieve a result.
Yes, evaluation can happen during compile time. This comes under the heading of constant folding and function inlining, both of which are common optimizations for optimizing compilers.
Many languages do not have strong distinction between "compile time" and "run time", but the general rule is that the language defines an "execution model" which defines the behavior of any particular program with any particular input (or specifies that it is undefined). The compiler must produce an executable that can read any input and produce the corresponding output as defined by the execution model. What happens inside the executable doesn't matter -- as long as the externally viewed behavior is correct.
Here "input", "output" and "behavior" includes all possible interactions with the environment that are defined in the execution model, including timing effects.
Related
I want to implement simple image processing routine quite similar to Auto Levels, so need to precalculate thresholds, make LUT and then make histogram stretching/normalization applying LUT.
But my question is not about algorithm side, it is about using extern defined functions, because i need a couple of while cycles for LUT calculation and i think using extern is good for it.
I tried following examples from Halide sources and checked this question too
I use AOT compilation currently testing on PC(winx64), aiming for arm in future, and have the following generator code:
Var x("x"), y("y");
Func make_a_root{ "make_a_root" };
Buffer<bitType> Lut{256, "lut"};
make_a_root(x, y) = inputY(x, y);
ExternFuncArgument arg = make_a_root;
Func g;
g.define_extern("generateAutoLevelsLut", { arg }, UInt(8), 2, Halide::NameMangling::CPlusPlus);
g.compute_root();
inputY has Input<Buffer<uint8_t>> inputY{ "input_y", 2 }; type
First i just want to make it run the call, so function body makes nothing but print (can i define function in same cpp file as generator?)
int generateAutoLevelsLut(halide_buffer_t * input, halide_buffer_t * out)
{
printf("\nextern call\n");
return 0;
}
I tried default mangling with extern "C" too.
Never succeeded getting print message though, so my question is, why this happenin. Is it just misunderstanding on some syntax or are there any problem with calling extern function from generator code?
EDIT:
Added usage of extern like 'out(x,y) = g(x,y)' (lvalue should be actually used!) , and it started to make a call. Now struggling with host == NULL. Digging into bounds inference stuff.
EDIT 2:
I added basic bounds inference checks, now it does not crash.. The next problem i have now, is: Is it possible to make call to external function, without actually influencing output result in direct manner?
Let me concretise what i mean.
The generator code looks like following:
Buffer<bitType> lut{256, "lut"};
args[0] = inputY;
args[1] = lut;
g.define_extern("generateAutoLevelsLut", args, { UInt(8) }, 2, Halide::NameMangling::C);
outputY(x, y) = g(x, y); // Call line
g.compute_root();
outputY.compute_root();
Extern functon code fills second input 'lut' with some dummy LUT:
Halide::Runtime::Buffer<uint16_t> im2Buffer(*input2);
Mat im2Mat(Size(im2Buffer.width(), im2Buffer.height()), CVC_8U, im2Buffer.data(), im2Buffer.stride(1));
for (int i = 0; i < 256; i++)
im2Mat.at<uchar>(i) = i;
And if i comment the 'Call line' in generator, it optimizes away the call to extern at all.
I want to make something like:
Func lutRoot;
lutRoot(x) = lut(x); // to convert from Buffer
outputY(x, y) = autoLevelsPrecalcLut(inputY, lutRoot)(x, y);
And here lut is implicitly passed into extern and filled there. But it doesn't work, as well as other variants which ignore the modification of 'output'... or maybe this whole approach is wrong?
Any suggestions? Thanks
EDIT 3:
Solved task avoiding extern calls, replacing while cycles with argmin and RDom combo, but original question about extern still remains
That should work (or fail with a linker error if it wasn't going to). It's possible the Halide pipeline doesn't think it needs to call your extern function. E.g. does something use the result?
Alternatively, try stderr instead, just in case it's an output stream buffering issue. That extern function definition is likely to cause Halide to error out (because it doesn't reply to the bounds inference query), and erroring out calls abort by default, which would swallow things printed to stdout.
In the program below, function dec uses scanf to read an arbitrary input from the user.
dec is called from main and depending on the input it returns 1 or 0 and accordingly an operation will be performed. However, the value analysis indicates that y is always 0, even after the call to scanf. Why is that?
Note: the comments below apply to versions earlier than Frama-C 15 (Phosphorus, 20170501); in Frama-C 15, the Variadic plugin is enabled by default (and its short name is now -variadic).
Solution
Enable Variadic (-va) before running the value analysis (-val), it will eliminate the warning and the program will behave as expected.
Detailed explanation
Strictly speaking, Frama-C itself (the kernel) only does the parsing; it's up to the plug-ins themselves (e.g. Value/EVA) to evaluate the program.
From your description, I believe you must be using Value/EVA to analyze a program. I do not know exactly which version you are using, so I'll describe the behavior with Frama-C Silicon.
One limitation of ACSL (the specification language used by Frama-C) is that it is not currently possible to specify contracts for variadic functions such as scanf. Therefore, the specifications shipped with the Frama-C standard library are insufficient. You can notice this in the following program:
#include <stdio.h>
int d;
int main() {
scanf("%d", &d);
Frama_C_show_each(d);
return 0;
}
Running frama-c -val file.c will output, among other things:
...
[value] using specification for function scanf
FRAMAC_SHARE/libc/stdio.h:150:[value] warning: no \from part for clause 'assigns *__fc_stdin;' of function scanf
[value] Done for function scanf
[value] Called Frama_C_show_each({0})
...
That warning means that the specification is incorrect, which explains the odd behavior.
The solution in this case is to use the Variadic plug-in (-va, or -va-help for more details), which will specialize variadic calls and add specifications to them, thus avoiding the warning and behaving as expected. Here's the resulting code (-print) after running the Variadic plug-in on the example above:
$ frama-c -va file.c -print
[... lots of definitions from stdio.h ...]
/*# requires valid_read_string(format);
requires \valid(param0);
ensures \initialized(param0);
assigns \result, *__fc_stdin, *param0;
assigns \result
\from (indirect: *__fc_stdin), (indirect: *(format + (0 ..)));
assigns *__fc_stdin
\from (indirect: *__fc_stdin), (indirect: *(format + (0 ..)));
assigns *param0
\from (indirect: *__fc_stdin), (indirect: *(format + (0 ..)));
*/
int scanf_0(char const *format, int *param0);
int main(void)
{
int __retres;
scanf_0("%d",& d);
Frama_C_show_each(d);
__retres = 0;
return __retres;
}
In this example, scanf was specialized to scanf_0, with a proper ACSL annotation. Running EVA on this program will not emit any warnings and produce the expected output:
# frama-c -va file.c -val
...
[value] Done for function scanf_0
[value] Called Frama_C_show_each([-2147483648..2147483647])
...
Note: the GUI in Frama-C 14 (Silicon) does not allow the Variadic plug-in to be enabled (even after ticking it in the Analyses panel), so you must use the command-line in this case to obtain the expected result and avoid the warning. Starting from Frama-C 15 (Phosphorus, to be released in 2017), this won't be necessary: Variadic will be enabled by default and so your example would work from the start.
I compiled the following code on VS2013 (using "Release" mode optimization) and was dismayed to find the assembly of std::swap(v1,v2) was not the same as std::swap(v3,v4).
#include <vector>
#include <iterator>
#include <algorithm>
template <class T>
class WRAPPED_VEC
{
public:
typedef T value_type;
void push_back(T value) { m_vec.push_back(value); }
WRAPPED_VEC() = default;
WRAPPED_VEC(WRAPPED_VEC&& other) : m_vec(std::move(other.m_vec)) {}
WRAPPED_VEC& operator =(WRAPPED_VEC&& other)
{
m_vec = std::move(other.m_vec);
return *this;
}
private:
std::vector<T> m_vec;
};
int main (int, char *[])
{
WRAPPED_VEC<int> v1, v2;
std::generate_n(std::back_inserter(v1), 10, std::rand);
std::generate_n(std::back_inserter(v2), 10, std::rand);
std::swap(v1, v2);
std::vector<int> v3, v4;
std::generate_n(std::back_inserter(v3), 10, std::rand);
std::generate_n(std::back_inserter(v4), 10, std::rand);
std::swap(v3, v4);
return 0;
}
The std::swap(v3, v4) statement turns into "perfect" assembly. How can I achieve the same efficiency for std::swap(v1, v2)?
There are a couple of points to be made here.
1. If you don't know for absolutely certain that your way of calling swap is equivalent to the "correct" way of calling swap, you should always use the "correct" way:
using std::swap;
swap(v1, v2);
2. A really convenient way to look at the assembly for something like calling swap is to put the call by itself in a test function. That makes it easy to isolate the assembly:
void
test1(WRAPPED_VEC<int>& v1, WRAPPED_VEC<int>& v2)
{
using std::swap;
swap(v1, v2);
}
void
test2(std::vector<int>& v1, std::vector<int>& v2)
{
using std::swap;
swap(v1, v2);
}
As it stands, test1 will call std::swap which looks something like:
template <class T>
inline
swap(T& x, T& y) noexcept(is_nothrow_move_constructible<T>::value &&
is_nothrow_move_assignable<T>::value)
{
T t(std::move(x));
x = std::move(y);
y = std::move(t);
}
And this is fast. It will use WRAPPED_VEC's move constructor and move assignment operator.
However vector swap is even faster: It swaps the vector's 3 pointers, and if std::allocator_traits<std::vector<T>::allocator_type>::propagate_on_container_swap::value is true (and it is not), also swaps the allocators. If it is false (and it is), and if the two allocators are equal (and they are), then everything is ok. Otherwise Undefined Behavior happens.
To make test1 identical to test2 performance-wise you need:
friend
void
swap(WRAPPED_VEC<int>& v1, WRAPPED_VEC<int>& v2)
{
using std::swap;
swap(v1.m_vec, v2.m_vec);
}
One interesting thing to point out:
In your case, where you are always using std::allocator<T>, the friend function is always a win. However if your code allowed other allocators, possibly those with state, which might compare unequal, and which might have propagate_on_container_swap::value false (as std::allocator<T> does), then these two implementations of swap for WRAPPED_VEC diverge somewhat:
1. If you rely on std::swap, then you take a performance hit, but you will never have the possibility to get into undefined behavior. Move construction on vector is always well-defined and O(1). Move assignment on vector is always well-defined and can be either O(1) or O(N), and either noexcept(true) or noexcept(false).
If propagate_on_container_move_assignment::value is false, and if the two allocators involved in a move assignment are unequal, vector move assignment will become O(N) and noexcept(false). Thus a swap using vector move assignment will inherit these characteristics. However, no matter what, the behavior is always well-defined.
2. If you overload swap for WRAPPED_VEC, thus relying on the swap overload for vector, then you expose yourself to the possibility of undefined behavior if the allocators compare unequal and have propagate_on_container_swap::value equal to false. But you pick up a potential performance win.
As always, there are engineering tradeoffs to be made. This post is meant to alert you to the nature of those tradeoffs.
PS: The following comment is purely stylistic. All capital names for class types are generally considered poor style. It is tradition that all capital names are reserved for macros.
The reason for this is that std::swap does have an optimized overload for type std::vector<T> (see right click -> go to definition). To make this code work fast for your wrapper, follow instructions found on cppreference.com about std::swap:
std::swap may be specialized in namespace std for user-defined types,
but such specializations are not found by ADL (the namespace std is
not the associated namespace for the user-defined type). The expected
way to make a user-defined type swappable is to provide a non-member
function swap in the same namespace as the type: see Swappable for
details.
I was playing with bind and I was thinking, are lambdas as expensive as function pointers?
What I mean is, as I understand lambdas, they are syntactic sugar for functors and bind is similar. However, if you do this:
#include<functional>
#include<iostream>
void fn2(int a, int b)
{
std::cout << a << ", " << b << std::endl;
}
void fn1(int a, int b)
{
//auto bound = std::bind(fn2, a, b);
//static auto bound = std::bind(fn2, a, b);
//auto bound = [&]{ fn2(a, b); };
static auto bound = [&]{ fn2(a, b); };
bound();
}
int main()
{
fn1(3, 4);
fn1(1, 2);
return 0;
}
Now, if I were to use the 1st auto bound = std::bind(fn2, a, b);, I get an output of 3, 4
1, 2, the 2nd I get 3, 4
3, 4. The 3rd and 4th I get output like the 1st.
Now I get why the 1st and 2nd work that way, they are getting initialised at the beginning of the function call (the static one, only the 1st time it is called). However, 3 and 4 seem to have compiler magic going on where the generated functors are not really creating references to the enclosing scope's variables, but are actually latching on to the symbols whether or not it is initialised only the first time or every time.
Can someone clarify what is actually happening here?
Edit: What I was missing is using static auto bound = std::bind(fn2, std::ref(a), std::ref(b)); to have it work as the 4th option.
You have this code:
static auto bound = [&]{ fn2(a, b); };
Assignment is done only first time you are invoking this function because it's static. So in fact it's called only once. Compiler creates closure when you are making lambdas, so references to a and b from first call to fn1 was captured. It's very risky. It may lead to dangling references. I'm surprised it didn't crashed since you are making closure from function parameters passed by value - to local variables.
I recommend this excellent article about lambdas: http://www.cprogramming.com/c++11/c++11-lambda-closures.html .
As a general rule, only use [&] lambdas when your closure is going to go away by the end of the current scope.
If it is going to outlast the current scope, and you need by-reference, explicitly capture the things you are going to capture, or create local pointers to the things you are going to capture and capture them by-value.
In your case, your static lambda code is full of undefined behavior, as you [&] capture a and b in the first call, then use it in the second call.
In theory, the compiler could rewrite your code to capture a and b by value instead of by reference, then call that every time, because the only difference between that implementation and the one you wrote occurs when the behavior is undefined, and the result will be much faster.
It could do a more efficient job by ignoring your static completely, as the entire state of your static object is undefined after you leave scope the first time you call, and the construction has no visible side effects.
To fix your problem with the lambdas, use [=] or [a,b] to introduce the lambda, and it will capture the a and b by value. I prefer to capture state explicitly on lambdas when I expect the lambda to persist longer than the current block.
first of all, I apologize for the overly verbose question. I couldn't think of any other way to accurately summarize my problem... Now on to the actual question:
I'm currently experimenting with C++0x rvalue references... The following code produces unwanted behavior:
#include <iostream>
#include <utility>
struct Vector4
{
float x, y, z, w;
inline Vector4 operator + (const Vector4& other) const
{
Vector4 r;
std::cout << "constructing new temporary to store result"
<< std::endl;
r.x = x + other.x;
r.y = y + other.y;
r.z = z + other.z;
r.w = w + other.w;
return r;
}
Vector4&& operator + (Vector4&& other) const
{
std::cout << "reusing temporary 2nd operand to store result"
<< std::endl;
other.x += x;
other.y += y;
other.z += z;
other.w += w;
return std::move(other);
}
friend inline Vector4&& operator + (Vector4&& v1, const Vector4& v2)
{
std::cout << "reusing temporary 1st operand to store result"
<< std::endl;
v1.x += v2.x;
v1.y += v2.y;
v1.z += v2.z;
v1.w += v2.w;
return std::move(v1);
}
};
int main (void)
{
Vector4 r,
v1 = {1.0f, 1.0f, 1.0f, 1.0f},
v2 = {2.0f, 2.0f, 2.0f, 2.0f},
v3 = {3.0f, 3.0f, 3.0f, 3.0f},
v4 = {4.0f, 4.0f, 4.0f, 4.0f},
v5 = {5.0f, 5.0f, 5.0f, 5.0f};
///////////////////////////
// RELEVANT LINE HERE!!! //
///////////////////////////
r = v1 + v2 + (v3 + v4) + v5;
return 0;
}
results in the output
constructing new temporary to store result
constructing new temporary to store result
reusing temporary 1st operand to store result
reusing temporary 1st operand to store result
while I had hoped for something like
constructing new temporary to store result
reusing temporary 1st operand to store result
reusing temporary 2nd operand to store result
reusing temporary 2nd operand to store result
After trying to re-enact what the compiler was doing (I'm using MinGW G++ 4.5.2 with option -std=c++0x in case it matters), it actually seems quite logical. The standard says that arithmetic operations of equal precedence are evaluated/grouped left-to-right (why I assumed right-to-left I don't know, I guess it's more intuitive to me). So what happened here is that the compiler evaluated the sub-expression (v3 + v4) first (since it's in parentheses?), and then began matching the operations in the expression left-to-right against the operator overloads, resulting in a call to Vector4 operator + (const Vector4& other) for the sub-expression v1 + v2. If I want to avoid the unnecessary temporary, I'd have to make sure that no more than one lvalue operand appears to the immediate left of any parenthesized sub-expression, which is counter-intuitive to anyone using this "library" and innocently expecting optimal performance (as in minimizing the creation of temporaries).
(I'm aware that there's ambiguity in my code regarding operator + (Vector4&& v1, const Vector4& v2) and operator + (Vector4&& other) when (v3 + v4) is to be added to the result of v1 + v2, resulting in a warning. But it's harmless in my case and I don't want to add yet another overload for two rvalue reference operands - anyone know if there's a way to disable this warning in gcc?)
Long story short, my question boils down to: Is there any way or pattern (preferably compiler-independent) this vector class could be rewritten to enable arbitrary use of parentheses in expressions that still results in the "optimal" choice of operator overloads (optimal in terms of "performance", i.e. maximizing the binding to rvalue references)? Perhaps I'm asking for too much though and it's impossible... if so, then that's fine too. I just want to make sure I'm not missing anything.
Many thanks in advance
Addendum
First thanks to the quick responses I got, within minutes (!) - I really should have started posting here sooner...
It's becoming tedious replying in the comments, so I think a clarification of my intent with this class design is in order. Maybe you can point me to a fundamental conceptual flaw in my thought process if there is one.
You may notice that I don't hold any resources in the class like heap memory. Its members are only scalar types even. At first sight this makes it a suspect candidate for move-semantics based optimizations (see also this question that actually helped me a great deal grasping the concepts behind rvalue references).
However, since the classes this one is supposed to be a prototype for will be used in a performance-critical context (a 3D engine to be precise), I want to optimize every little thing possible. Low-complexity algorithms and maths-related techniques like look-up tables should of course make up the bulk of the optimizations as anything else would simply be addressing the symptoms and not eradicating the real reason for bad performance. I am well aware of that.
With that out of the way, my intent here is to optimize algebraic expressions with vectors and matrices that are essentially plain-old-data structs without pointers to data in them (mainly due to the performance drawbacks you get with data on the heap [having to dereference additional pointers, cache considerations etc.]).
I don't care about move-assignment or construction, I just don't want more temporaries being created during the evaluation of a complicated algebraic expression than absolutely necessary (usually just one or two, e.g. a matrix and a vector).
Those are my thoughts that might be erroneous. If they are, please correct me:
To achieve this without relying on RVO, return-by-reference is necessary (again: keep in mind I don't have remote resources, only scalar data members).
Returning by reference makes the function-call expression an lvalue, implying the returned object is not a temporary, which is bad, but returning by rvalue reference makes the function-call expression an xvalue (see 3.10.1), which is okay in the context of my approach (see 4)
Returning by reference is dangerous, because of the possibly short lifetime of objects, but:
temporaries are guaranteed to live until the end of the evaluation of the expression they were created in, therefore:
making it safe to return by reference from those operators that take at least one rvalue-reference as their argument, if the object referenced by this rvalue reference argument is the one being returned by reference. Therefore:
Any arbitrary expression that only employs binary operators can be evaluated by creating only one temporary when not more than one PoD-like type is involved, and the binary operations don't require a temporary by nature (like matrix multiplication)
(Another reason to return by rvalue-reference is because it behaves like returning by value in terms of rvalue-ness of the function-call expression; and it's required for the operator/function-call expression to be an rvalue in order to bind to subsequent calls to operators that take rvalue references. As stated in (2), calls to functions that return by reference are lvalues, and would therefore bind to operators with the signature T operator+(const T&, const T&), resulting in the creation of an unnecessary temporary)
I could achieve the desired performance by using a C-style approach of functions like add(Vector4 *result, Vector4 *v1, Vector4 *v2), but come on, we're living in the 21st century...
In summary, my goal is creating a vector class that achieves the same performance as the C-approach using overloaded operators. If that in itself is impossible, than I guess it can't be helped. But I'd appreciate if someone could explain to me why my approach is doomed to fail (the left-to-right operator evaluation issue that was the initial reason for my post aside, of course).
As a matter of fact, I've been using the "real" vector class this one is a simplification of for a while without any crashes or corrupted memory so far. And in fact, I never actually return local objects as references, so there shouldn't be any problems. I dare say what I'm doing is standard-compliant.
Any help on the original issue would of course be appreciated as well!
many thanks for all the patience again
You should not return an rvalue reference, you should return a value. In addition, you should not specify both a member and a free operator+. I'm amazed that even compiled.
Edit:
r = v1 + v2 + (v3 + v4) + v5;
How could you possibly only have one temporary value when you're performing two sub-computations? That's just impossible. You can't re-write the Standard and change this.
You will just have to trust your users to do something not completely stupid, like write the above line of code, and expect to have just one temporary.
I recommend modeling your code after the basic_string operator+() found in chapter 21 of N3225.