I wrote a simple program:
constexpr int strlen_c(char const* s)
{
return *s ? 1 + strlen_c(s + 1) : 0;
}
int main()
{
return strlen_c("hello world");
}
I expected that the compiler optimizes the function and evaluates its result in compile time. But actually the generated machine code evaluates the result in a loop:
mov edx, offset aHelloWorld ; "hello world"
loc_408D00:
add edx, 1
mov eax, edx
sub eax, offset aHelloWorld ; "hello world"
cmp byte ptr [edx], 0
jnz short loc_408D00
leave
retn
The program is being compiled with g++ version 5.3 with flags -std=c++11 -Ofast -O2. The same result I obtain in Visual Studio 2013, and g++ 4.9.
Quaestion what is the reason the compiler couldn't optimize the given code?
A constexpr function is not necessarily always evaluated at compile time. However, it must be evaluated at compile time if used in a constexpr context, So, following will work regardless of the compiler optimizations:
int main()
{
constexpr auto len = strlen_c("hello world");
return len;
}
Following is the assembly generated for the above code:
main:
mov eax, 11
ret
Demo
Related
I have a trouble with a compiling assembly code (nasm).
On Linux (elf32) it not fails after compilation using g++, but when I tried to build it with i686-w64-mingw32-g++ (for Win32) it failed.
My build.sh script:
#!/bin/bash
nasm -fwin32 wct.asm
i686-w64-mingw32-g++ -m32 -O2 -Wall -fno-exceptions -ffloat-store -ffast-math -fno-rounding-math -fno-signaling-nans -fcx-limited-range -fno-math-errno -funsafe-math-optimizations -fassociative-math -freciprocal-math -ffinite-math-only -fno-signed-zeros -fno-trapping-math -frounding-math -fsingle-precision-constant -fcx-fortran-rules -fno-rtti -mfpmath=387 -mfancy-math-387 -fno-ident -fmerge-all-constants -mpreferred-stack-boundary=2 -falign-functions=1 -falign-jumps=1 -falign-loops=1 -fno-unroll-loops -fno-math-errno -s main.cpp wct.obj -o wct.exe
strip --strip-unneeded wct.exe
There is assembly code:
[bits 32]
section .text
global wct
wct:
mov esi, [esp+4]
mov edi, esi
mov ecx, [esp+8]
#L:
lodsw
sub ax, 04141h
cmp al,0Fh
jne #F
dec al
jmp #E
#F:
cmp al,0Eh
jne #E
inc al
#E:
mov bx, ax
shr bx, 8
cmp bl,0Fh
jne ##F
dec bl
jmp ##E
##F:
cmp bl,0Eh
jne ##E
inc bl
##E:
shl al, 4
add ax, bx
stosb
loop #L
ret
main.cpp:
#include <fstream>
using namespace std;
extern "C" int wct(char* buff, int N);
#define N 1024*1024
char buff[N];
ifstream in;
ofstream out;
int size;
int main(int argc, char* argv[]) {
if ( argc == 1 ) return 0;
in.open(argv[1], ios_base::in | ios_base::binary);
if ( argc >= 3 )
out.open(argv[2], ios_base::out | ios_base::binary);
if( in.is_open())
{
while(!in.eof())
{
in.read((char *)&buff, sizeof buff);
size = in.gcount()/2;
wct((char *)&buff, size);
if ( out.is_open())
out.write((char *)&buff, size);
else
{
out.close();
}
}
}
in.close();
out.close();
return 0;
}
I am obviously doing something wrong, because of I am always getting the same error while using build.sh script:
/tmp/cc3SD7dA.o:main.cpp:(.text.startup+0x90): undefined reference to `wct'
collect2: error: ld returned 1 exit status
How I can fix that?
On Windows the GCC compiler expects a leading underscore in external symbols. So change all wct in the asm file to _wct.
If you want to test the program in Windows and in Linux you can "globalize" two consecutive labels: wct and _wct:
...
global wct
global _wct
...
wct:
_wct:
...
Linux gets the wct without underscore and Windows gets it with it.
BTW: The assembly procedure is a C function and has to follow the CDECL calling convention. The function can freely change the registers EAX, ECX, and EDX (caller saved). The other registers (EBX,ESI,EDI,EBP) have to be returned unchanged. If the function needs to use them, it has to save and restore them (callee saved):
wct:
_wct:
push esi ; sp+= 4
push edi ; sp+= 4
push ebx ; sp+= 4
; ======
; sp+= 12
mov esi, [esp+16]
mov edi, esi
mov ecx, [esp+20]
...
pop ebx
pop edi
pop esi
ret
I am new to 64bit Assembly coding. So I tried some simple Programms:
c-programm:
#include <stdio.h>
extern double bla();
double x=0;
int main() {
x=bla();
printf(" %f",x);
return 0;
}
Assembly:
section .data
section .text
global bla
bla:
mov rax,10
movq xmm0,rax
ret
The result was alwals 0.0 instead of 10.0
But when i make it without a immediate it works fine
#include <stdio.h>
extern double bla(double y);
double x=0;
double a=10;
int main() {
x=bla(a);
printf("add returned %f",x);
return 0;
}
section .data
section .text
global bla
bla:
movq rax,xmm0
movq xmm0,rbx ;xmm0=0 now
movq xmm0,rax ;xmm0=10 now
ret
Do I need a different Instruction to load a Immediate in a 64bit Register?
The problem here was that the OP was trying to move 10 into a floating-point register with the following code:
mov rax,10
movq xmm0,rax
That cannot work, since movq into xmm0 assumes that the bit-pattern of the source is already in floating-point format - and of course it isn't: it's an integer.
#Michael Petch's suggestion was to use the (NASM) assembler's floating-point converter as follows:
mov rax,__float64__(10.0)
movq xmm0,rax
That then produces the expected output.
I am using a templated class in c++ and I am planning on ensuring compatibility with doubles and mpfr floats. The only division that occurs in the program is division by 2. The behavior for doubles and mpfr floats for division by 2 should be different because, in mpfr, I have direct access to the exponent.
Question: What do you suggest to result in the most efficient compiled code?
I expect that a run-time solution checking the type of the templated variable would be inefficient.
Boost's mpfr wrapper does not seem useful because it doesn't seem to use the mpfr_div_2ui command and would, instead, divide by the mpfr float with a value of 2. I expect this to be slower than directly changing the exponent.
I could use an overloaded command to deal with the two cases of mpfr floats and doubles.
I could use some user-set #define flag that the user would need to set to use mpfr data types.
Are there any other suggestions?
I'd check whether number<mpfr_floatXXX> doesn't already detect the optimization.
Boost's mpfr wrapper does not seem useful because it doesn't seem to use the mpfr_div_2ui command and would, instead, divide by the mpfr float with a value of 2. I expect this to be slower than directly changing the exponent.
That expectation is not well founded. Simply check:
#include <boost/multiprecision/mpfr.hpp>
int main() {
using namespace boost::multiprecision;
mpfr_float_50 n ("787878787878");
n /= 2;
}
Compiles into
mov rax, QWORD PTR fs:40
mov QWORD PTR [rsp+232], rax
xor eax, eax
lea rdi, [rsp+16]
call mpfr_init2
cmp QWORD PTR [rsp+40], 0
xor ecx, ecx
mov edx, 10
lea rdi, [rsp+16]
call mpfr_set_str
test eax, eax
cmp QWORD PTR [rsp+40], 0
lea rsi, [rsp+16]
xor ecx, ecx
mov edx, 2
mov rdi, rsi
call mpfr_div_ui
So, it isn't nearly as bad as you made it seem.
Implementation
Here is my non-generic implementation:
mp::mpfr_float_50 div_2ui(mp::mpfr_float_50 const& f, unsigned i) {
mp::mpfr_float_50 r;
::mpfr_div_2ui(
r.backend().data(),
f.backend().data(),
i,
MPFR_RNDN);
return r;
}
A generic implementation would look like:
template <typename T, typename Enable = void> struct is_mpfr : boost::mpl::false_ {};
template <unsigned digits10, mp::mpfr_allocation_type AllocationType, mp::expression_template_option ET>
struct is_mpfr<
mp::number<mp::mpfr_float_backend<digits10, AllocationType>, ET >
> : boost::mpl::true_
{};
template <typename T>
T div_2ui_impl(T f, unsigned i, boost::mpl::false_) {
while (i--)
f /= 2;
return f;
}
template <typename Mpfr>
Mpfr div_2ui_impl(Mpfr f, unsigned i, boost::mpl::true_) {
std::cout << "-- optimized --";
Mpfr r;
::mpfr_div_2ui(r.backend().data(), f.backend().data(), i, MPFR_RNDN);
return r;
}
template <typename T>
T div_2ui(T const &f, unsigned i) {
return div_2ui_impl(f, i, is_mpfr<T> { });
}
Live Demo
Live On Coliru
template <typename T>
void test() {
T n("787878787878");
n = arith::div_2ui(n, 1);
std::cout << __FUNCTION__ << ": " << n << "\n";
}
int main() {
std::cout << std::fixed;
test<mp::mpfr_float_50>();
test<mp::mpfr_float_100>();
test<mp::cpp_int>();
test<mp::cpp_dec_float_100>();
test<mp::number<mp::gmp_int> >();
test<mp::mpf_float_1000>();
}
Prints
-- optimized --test: 393939393939.000000
-- optimized --test: 393939393939.000000
test: 393939393939
test: 393939393939.000000
test: 393939393939
test: 393939393939.000000
Consider the following
while(true)
{
if(x>5)
// Run function A
else
// Run function B
}
if x is always less than 5, does visual studio compiler do any optimization? i.e. like never checks if x is larger than 5 and always run function B
It depends on whether or not the compiler "knows" that x will always be less than 5.
Yes, nearly all modern compilers are capable of removing the branch. But the compiler needs to be able to prove that the branch will always go one direction.
Here's an example that can be optimized:
int x = 1;
if (x > 5)
printf("Hello\n");
else
printf("World\n");
The disassembly is:
sub rsp, 40 ; 00000028H
lea rcx, OFFSET FLAT:??_C#_06DKJADKFF#World?6?$AA#
call QWORD PTR __imp_printf
x = 1 is provably less than 5. So the compiler is able to remove the branch.
But in this example, even if you always input less than 5, the compiler doesn't know that. It must assume any input.
int x;
cin >> x;
if (x > 5)
printf("Hello\n");
else
printf("World\n");
The disassembly is:
cmp DWORD PTR x$[rsp], 5
lea rcx, OFFSET FLAT:??_C#_06NJBIDDBG#Hello?6?$AA#
jg SHORT $LN5#main
lea rcx, OFFSET FLAT:??_C#_06DKJADKFF#World?6?$AA#
$LN5#main:
call QWORD PTR __imp_printf
The branch stays. But note that it actually hoisted the function call out of the branch. So it really optimized the code down to something like this:
const char *str = "Hello\n";
if (!(x > 5))
str = "World\n";
printf(str);
If I have the following C++ code to compare two 128-bit unsigned integers, with inline amd-64 asm:
struct uint128_t {
uint64_t lo, hi;
};
inline bool operator< (const uint128_t &a, const uint128_t &b)
{
uint64_t temp;
bool result;
__asm__(
"cmpq %3, %2;"
"sbbq %4, %1;"
"setc %0;"
: // outputs:
/*0*/"=r,1,2"(result),
/*1*/"=r,r,r"(temp)
: // inputs:
/*2*/"r,r,r"(a.lo),
/*3*/"emr,emr,emr"(b.lo),
/*4*/"emr,emr,emr"(b.hi),
"1"(a.hi));
return result;
}
Then it will be inlined quite efficiently, but with one flaw. The return value is done through the "interface" of a general register with a value of 0 or 1. This adds two or three unnecessary extra instructions and detracts from a compare operation that would otherwise be fully optimized. The generated code will look something like this:
mov r10, [r14]
mov r11, [r14+8]
cmp r10, [r15]
sbb r11, [r15+8]
setc al
movzx eax, al
test eax, eax
jnz is_lessthan
If I use "sbb %0,%0" with an "int" return value instead of "setc %0" with a "bool" return value, there's still two extra instructions:
mov r10, [r14]
mov r11, [r14+8]
cmp r10, [r15]
sbb r11, [r15+8]
sbb eax, eax
test eax, eax
jnz is_lessthan
What I want is this:
mov r10, [r14]
mov r11, [r14+8]
cmp r10, [r15]
sbb r11, [r15+8]
jc is_lessthan
GCC extended inline asm is wonderful, otherwise. But I want it to be just as good as an intrinsic function would be, in every way. I want to be able to directly return a boolean value in the form of the state of a CPU flag or flags, without having to "render" it into a general register.
Is this possible, or would GCC (and the Intel C++ compiler, which also allows this form of inline asm to be used) have to be modified or even refactored to make it possible?
Also, while I'm at it — is there any other way my formulation of the compare operator could be improved?
Here we are almost 7 years later, and YES, gcc finally added support for "outputting flags" (added in 6.1.0, released ~April 2016). The detailed docs are here, but in short, it looks like this:
/* Test if bit 0 is set in 'value' */
char a;
asm("bt $0, %1"
: "=#ccc" (a)
: "r" (value) );
if (a)
blah;
To understand =#ccc: The output constraint (which requires =) is of type #cc followed by the condition code to use (in this case c to reference the carry flag).
Ok, this may not be an issue for your specific case anymore (since gcc now supports comparing 128bit data types directly), but (currently) 1,326 people have viewed this question. Apparently there's some interest in this feature.
Now I personally favor the school of thought that says don't use inline asm at all. But if you must, yes you can (now) 'output' flags.
FWIW.
I don't know a way to do this. You may or may not consider this an improvement:
inline bool operator< (const uint128_t &a, const uint128_t &b)
{
register uint64_t temp = a.hi;
__asm__(
"cmpq %2, %1;"
"sbbq $0, %0;"
: // outputs:
/*0*/"=r"(temp)
: // inputs:
/*1*/"r"(a.lo),
/*2*/"mr"(b.lo),
"0"(temp));
return temp < b.hi;
}
It produces something like:
mov rdx, [r14]
mov rax, [r14+8]
cmp rdx, [r15]
sbb rax, 0
cmp rax, [r15+8]
jc is_lessthan