Templates and division - performance

I am using a templated class in c++ and I am planning on ensuring compatibility with doubles and mpfr floats. The only division that occurs in the program is division by 2. The behavior for doubles and mpfr floats for division by 2 should be different because, in mpfr, I have direct access to the exponent.
Question: What do you suggest to result in the most efficient compiled code?
I expect that a run-time solution checking the type of the templated variable would be inefficient.
Boost's mpfr wrapper does not seem useful because it doesn't seem to use the mpfr_div_2ui command and would, instead, divide by the mpfr float with a value of 2. I expect this to be slower than directly changing the exponent.
I could use an overloaded command to deal with the two cases of mpfr floats and doubles.
I could use some user-set #define flag that the user would need to set to use mpfr data types.
Are there any other suggestions?

I'd check whether number<mpfr_floatXXX> doesn't already detect the optimization.
Boost's mpfr wrapper does not seem useful because it doesn't seem to use the mpfr_div_2ui command and would, instead, divide by the mpfr float with a value of 2. I expect this to be slower than directly changing the exponent.
That expectation is not well founded. Simply check:
#include <boost/multiprecision/mpfr.hpp>
int main() {
using namespace boost::multiprecision;
mpfr_float_50 n ("787878787878");
n /= 2;
}
Compiles into
mov rax, QWORD PTR fs:40
mov QWORD PTR [rsp+232], rax
xor eax, eax
lea rdi, [rsp+16]
call mpfr_init2
cmp QWORD PTR [rsp+40], 0
xor ecx, ecx
mov edx, 10
lea rdi, [rsp+16]
call mpfr_set_str
test eax, eax
cmp QWORD PTR [rsp+40], 0
lea rsi, [rsp+16]
xor ecx, ecx
mov edx, 2
mov rdi, rsi
call mpfr_div_ui
So, it isn't nearly as bad as you made it seem.
Implementation
Here is my non-generic implementation:
mp::mpfr_float_50 div_2ui(mp::mpfr_float_50 const& f, unsigned i) {
mp::mpfr_float_50 r;
::mpfr_div_2ui(
r.backend().data(),
f.backend().data(),
i,
MPFR_RNDN);
return r;
}
A generic implementation would look like:
template <typename T, typename Enable = void> struct is_mpfr : boost::mpl::false_ {};
template <unsigned digits10, mp::mpfr_allocation_type AllocationType, mp::expression_template_option ET>
struct is_mpfr<
mp::number<mp::mpfr_float_backend<digits10, AllocationType>, ET >
> : boost::mpl::true_
{};
template <typename T>
T div_2ui_impl(T f, unsigned i, boost::mpl::false_) {
while (i--)
f /= 2;
return f;
}
template <typename Mpfr>
Mpfr div_2ui_impl(Mpfr f, unsigned i, boost::mpl::true_) {
std::cout << "-- optimized --";
Mpfr r;
::mpfr_div_2ui(r.backend().data(), f.backend().data(), i, MPFR_RNDN);
return r;
}
template <typename T>
T div_2ui(T const &f, unsigned i) {
return div_2ui_impl(f, i, is_mpfr<T> { });
}
Live Demo
Live On Coliru
template <typename T>
void test() {
T n("787878787878");
n = arith::div_2ui(n, 1);
std::cout << __FUNCTION__ << ": " << n << "\n";
}
int main() {
std::cout << std::fixed;
test<mp::mpfr_float_50>();
test<mp::mpfr_float_100>();
test<mp::cpp_int>();
test<mp::cpp_dec_float_100>();
test<mp::number<mp::gmp_int> >();
test<mp::mpf_float_1000>();
}
Prints
-- optimized --test: 393939393939.000000
-- optimized --test: 393939393939.000000
test: 393939393939
test: 393939393939.000000
test: 393939393939
test: 393939393939.000000

Related

How do I enable /SAFESEH with assemly-code / SEH-performance

I've developed a little program that tests the performance of 32 bit Windows structured exception handling. To keep the overhead minimal in contrast to the rest, I wrote the code generating an filtering the exception in assembly.
This is the C++-code:
#include <Windows.h>
#include <iostream>
using namespace std;
bool __fastcall getPointerFaultSafe( void *volatile *from, void **to );
int main()
{
auto getThreadTimes = []( LONGLONG &kt, LONGLONG &ut )
{
union
{
FILETIME ft;
LONGLONG ll;
} creationTime, exitTime, kernelTime, userTime;
GetThreadTimes( GetCurrentThread(), &creationTime.ft, &exitTime.ft, &kernelTime.ft, &userTime.ft );
kt = kernelTime.ll;
ut = userTime.ll;
};
LONGLONG ktStart, utStart;
getThreadTimes( ktStart, utStart );
size_t const COUNT = 100'000;
void *pv;
for( size_t c = COUNT; c; --c )
getPointerFaultSafe( nullptr, &pv );
LONGLONG ktEnd, utEnd;
getThreadTimes( ktEnd, utEnd );
double ktNsPerException = (ktEnd - ktStart) * 100.0 / COUNT,
utNsPerException = (utEnd - utStart) * 100.0 / COUNT;
cout << "kernel-time per exception: " << ktNsPerException << "ns" << endl;
cout << "user-time per exception: " << utNsPerException << "ns" << endl;
return 0;
}
This is the assembly-code:
.686P
PUBLIC ?getPointerFaultSafe##YI_NPCRAXPAPAX#Z
PUBLIC sehHandler
.SAFESEH sehHandler
sehHandler PROTO
_DATA SEGMENT
byebyeOffset dd 0
_DATA ENDS
exc_ctx_eax = 0b0h
exc_ctx_eip = 0b8h
_TEXT SEGMENT
?getPointerFaultSafe##YI_NPCRAXPAPAX#Z PROC
ASSUME ds:_DATA
push OFFSET sehHandler
push dword ptr fs:0
mov dword ptr fs:0, esp
mov byebyeOffset, OFFSET byebye - OFFSET mightfail
mov al, 1
mightfail:
mov ecx, dword ptr [ecx]
mov dword ptr [edx], ecx
byebye:
mov edx, dword ptr [esp]
mov dword ptr fs:0, edx
add esp, 8
ret 0
?getPointerFaultSafe##YI_NPCRAXPAPAX#Z ENDP
sehHandler PROC
mov eax, dword ptr [esp + 12]
mov dword ptr [eax + exc_ctx_eax], 0
mov edx, byebyeOffset
add [eax + exc_ctx_eip], edx
mov eax, 0
ret 0
sehHandler ENDP
_TEXT ENDS
END
How do I get the asm-module of my program /SAFESEH-compatible?
Why does this program consume so much userland CPU-time? The library-code being called by the operating-system after the exception has begun to be handled should have only to save all the registers in the CONTEXT-structure, fill the EXCEPTION_RECORD-structure, call the topmost exception-filter which - in this case - shifts the execution two instructions further, and when it returns it will in my case restore all the registers an continue execution according to what I returned in EAX. That's should all not be so much time that almost 1/3 of the CPU-time will be spent in userland. That's about 2,3ms, i.e. when my old Ryzen 1800X is boosting on one core with 4GHz, about 5.200 clock-cycles.
I'm using the byebyeOffset-variable in my code to carry the distance between the unsafe instruction that might generate an access-violation and the safe code afterwards. I'm initializing this variable before the unsafe instruction. But it would be nice to have this offset statically as an immediate at the point where I add it on EIP in the exception-filter function sehHandler; but the offsets are scoped to getPointerFaultSafe. Of course storing the offset and fetching it from the variable take a negligible part of the overall computation time, but it would be nicer to have a clean solution.

C function access to R0 to R12 registers

I need to write the C function that will return the value of a specific hardware register. For example R0.
I am unclear from the GHS documentation how this is done with the macros provided by the GHS Compiler.
uint32_t readRegRx(void)
{
uint32_t x = 0U;
__asm("MOV Rx, ??");
return x;
}
What is the syntax in the GHS compiler for referencing a local variable as an argument to an inline assembly instruction?
I've seen this in the GHS documentation:
asm int mulword(a,b)
{
%con a %con b
mov r0,a
mov r1,b
mul r0,r1,r0
%con a %reg b
mov r0,a
mul r0,b,r0
%reg a %con b
mov r0,b
mul r0,a,r0
%reg a %reg b
mul r0,a,b
%con a %mem b
mov r0,a
ldr r1,b
mul r0,r1,r0
%mem a %con b
ldr r0,a
mov r1,b
mul r0,r1,r0
%mem a %mem b
ldr r0,a
ldr r1,b
mul r0,r1,r0
%error
}
But this isn't exactly what I want, I think. The example from the documention above describes a function taking arguments. The return value is implicitly in R0.
In my case, what I want is to use a plain C function, with embedded inline assembly to read a register (R-anything) and store the value in a local variable in the function.
I received information from the GHS support and it addresses the means to read hardware registers (Rn) within C functions analogous to the extended GNU ARM features. The following applies to GHS compiler usage, not GNU compiler usage:
"For asm macro purposes (use within an asm macro), it's probably best to use the enhanced GNU asm syntax, and you'll want to turn on "Accept GNU asm statements" (--gnu_asm)."
int bar(void)
{
int a;
__asm__("mov r0, %0" : : "r"(a)); // move a to r0. Replace r0 to suit.
// Stuff
return a;
}
Alternative method:
asm void readR0(r0Val)
{
%reg r0Val
mov r0Val,r0
}
void foo(void)
{
register int regValReg = 0;
readR0(regValReg);
// Stuff
}

Why does __get_cpuid return all zeros for leaf=4?

I want to write a simple program which calls __get_cpuid to get the cache information:
#include <cpuid.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
int main(int argc, char **argv)
{
int leaf = atoi(argv[1]);
uint32_t eax = 0, ebx = 0, ecx = 0, edx = 0;
if (__get_cpuid(leaf, &eax, &ebx, &ecx, &edx))
{
printf("leaf=%d, eax=0x%x, ebx=0x%x, ecx=0x%x, edx=0x%x\n",
leaf, eax, ebx, ecx, edx);
}
return 0;
}
First, I pass leaf as 2:
$ ./a.out 2
leaf=2, eax=0x76035a01, ebx=0xf0b2ff, ecx=0x0, edx=0xca0000
Since there is 0xff in ebx, it means I can get cache info from leaf=4 (refer here):
$ ./a.out 4
leaf=4, eax=0x0, ebx=0x0, ecx=0x0, edx=0x0
But this time, all return values are 0. Why can't I get valid information from __get_cpuid?
Looking at the linked reference for EAX=4 we see that ECX needs to be set to "cache level to query (e.g. 0=L1D, 1=L2, or 0=L1D, 1=L1I, 2=L2)".
I couldn't quickly find any documentation on __get_cpuid, but a search did turn up the soure code, where I noticed that you need to call __get_cpuid_count to have ecx set before the call to cpuid (otherwise you'll get random answers - mostly 0s it seems).

constexpr keywork doesn't affect on code generation

I wrote a simple program:
constexpr int strlen_c(char const* s)
{
return *s ? 1 + strlen_c(s + 1) : 0;
}
int main()
{
return strlen_c("hello world");
}
I expected that the compiler optimizes the function and evaluates its result in compile time. But actually the generated machine code evaluates the result in a loop:
mov edx, offset aHelloWorld ; "hello world"
loc_408D00:
add edx, 1
mov eax, edx
sub eax, offset aHelloWorld ; "hello world"
cmp byte ptr [edx], 0
jnz short loc_408D00
leave
retn
The program is being compiled with g++ version 5.3 with flags -std=c++11 -Ofast -O2. The same result I obtain in Visual Studio 2013, and g++ 4.9.
Quaestion what is the reason the compiler couldn't optimize the given code?
A constexpr function is not necessarily always evaluated at compile time. However, it must be evaluated at compile time if used in a constexpr context, So, following will work regardless of the compiler optimizations:
int main()
{
constexpr auto len = strlen_c("hello world");
return len;
}
Following is the assembly generated for the above code:
main:
mov eax, 11
ret
Demo

In GCC-style extended inline asm, is it possible to output a "virtualized" boolean value, e.g. the carry flag?

If I have the following C++ code to compare two 128-bit unsigned integers, with inline amd-64 asm:
struct uint128_t {
uint64_t lo, hi;
};
inline bool operator< (const uint128_t &a, const uint128_t &b)
{
uint64_t temp;
bool result;
__asm__(
"cmpq %3, %2;"
"sbbq %4, %1;"
"setc %0;"
: // outputs:
/*0*/"=r,1,2"(result),
/*1*/"=r,r,r"(temp)
: // inputs:
/*2*/"r,r,r"(a.lo),
/*3*/"emr,emr,emr"(b.lo),
/*4*/"emr,emr,emr"(b.hi),
"1"(a.hi));
return result;
}
Then it will be inlined quite efficiently, but with one flaw. The return value is done through the "interface" of a general register with a value of 0 or 1. This adds two or three unnecessary extra instructions and detracts from a compare operation that would otherwise be fully optimized. The generated code will look something like this:
mov r10, [r14]
mov r11, [r14+8]
cmp r10, [r15]
sbb r11, [r15+8]
setc al
movzx eax, al
test eax, eax
jnz is_lessthan
If I use "sbb %0,%0" with an "int" return value instead of "setc %0" with a "bool" return value, there's still two extra instructions:
mov r10, [r14]
mov r11, [r14+8]
cmp r10, [r15]
sbb r11, [r15+8]
sbb eax, eax
test eax, eax
jnz is_lessthan
What I want is this:
mov r10, [r14]
mov r11, [r14+8]
cmp r10, [r15]
sbb r11, [r15+8]
jc is_lessthan
GCC extended inline asm is wonderful, otherwise. But I want it to be just as good as an intrinsic function would be, in every way. I want to be able to directly return a boolean value in the form of the state of a CPU flag or flags, without having to "render" it into a general register.
Is this possible, or would GCC (and the Intel C++ compiler, which also allows this form of inline asm to be used) have to be modified or even refactored to make it possible?
Also, while I'm at it — is there any other way my formulation of the compare operator could be improved?
Here we are almost 7 years later, and YES, gcc finally added support for "outputting flags" (added in 6.1.0, released ~April 2016). The detailed docs are here, but in short, it looks like this:
/* Test if bit 0 is set in 'value' */
char a;
asm("bt $0, %1"
: "=#ccc" (a)
: "r" (value) );
if (a)
blah;
To understand =#ccc: The output constraint (which requires =) is of type #cc followed by the condition code to use (in this case c to reference the carry flag).
Ok, this may not be an issue for your specific case anymore (since gcc now supports comparing 128bit data types directly), but (currently) 1,326 people have viewed this question. Apparently there's some interest in this feature.
Now I personally favor the school of thought that says don't use inline asm at all. But if you must, yes you can (now) 'output' flags.
FWIW.
I don't know a way to do this. You may or may not consider this an improvement:
inline bool operator< (const uint128_t &a, const uint128_t &b)
{
register uint64_t temp = a.hi;
__asm__(
"cmpq %2, %1;"
"sbbq $0, %0;"
: // outputs:
/*0*/"=r"(temp)
: // inputs:
/*1*/"r"(a.lo),
/*2*/"mr"(b.lo),
"0"(temp));
return temp < b.hi;
}
It produces something like:
mov rdx, [r14]
mov rax, [r14+8]
cmp rdx, [r15]
sbb rax, 0
cmp rax, [r15+8]
jc is_lessthan

Resources