sm-level : 1.3 vs 2.0 performance - performance

My code doesn't depend on sm level. I can build it with sm10, If I want. But when I tried to build it with 1.3 instead of 2.0, as I did it before, I got x1.25 performance with no code changes!
sm20 -> 35ms
sm13 -> 25ms
After that gorgeous results, I tried to box/unbox every option in project settings->CUDA settings->all :) I guess, I found the stuff, which made that awesome speed:
If I use sm13 with "no fast math generation" (further fm - fast
math), I have 25ms
If I use sm13 with fm, I have 25ms
sm20 without fm = 35ms
sm20 with fm = 25ms (that is the same result)
Why is this so? Maybe sm13 forces using hardware maths, but sm20 not? Or it is only coincidence, and the latter sm level have lower performance, refer to lower sm level programs?

In addition to compiling in release mode, as pointed out by #Robert Crovella, you should also consider that when you target sm_13 the compiler is able to simplify some of the floating point maths. sm_20 and later supports precise division, precise square root, and denormals by default.
You can try disabling these features with the command line options -ftz=true -prec-div=false -prec-sqrt=false. See the best practices guide for more information.

Related

Can I make my compiler use fast-math on a per-function basis?

Suppose I have
template <bool UsesFastMath> void foo(float* data, size_t length);
and I want to compile one instantiation with -ffast-math (--use-fast-math for nvcc), and the other instantiation without it.
This can be achieved by instantiating each of the variants in a separate translation unit, and compiling each of them with a different command-line - with and without the switch.
My question is whether it's possible to indicate to popular compilers (*) to apply or not apply -ffast-math for individual functions - so that I'll be able to have my instantiations in the same translation unit.
Notes:
If the answer is "no", bonus points for explaining why not.
This is not the same questions as this one, which is about turning fast-math on and off at runtime. I'm much more modest...
(*) by popular compilers I mean any of: gcc, clang, msvc icc, nvcc (for GPU kernel code) about which you have that information.
In GCC you can declare functions like following:
__attribute__((optimize("-ffast-math")))
double
myfunc(double val)
{
return val / 2;
}
This is GCC-only feature.
See working example here -> https://gcc.gnu.org/ml/gcc/2009-10/msg00385.html
It seems that GCC not verifies optimize() arguments. So typos like "-ffast-match" will be silently ignored.
As of CUDA 7.5 (the latest version I am familiar with, although CUDA 8.0 is currently shipping), nvcc does not support function attributes that allow programmers to apply specific compiler optimizations on a per-function basis.
Since optimization configurations set via command line switches apply to the entire compilation unit, one possible approach is to use as many different compilation units as there are different optimization configurations, as already noted in the question; source code may be shared and #include-ed from a common file.
With nvcc, the command line switch --use_fast_math basically controls three areas of functionality:
Flush-to-zero mode is enabled (that is, denormal support is disabled)
Single-precision reciprocal, division, and square root are switched to approximate versions
Certain standard math functions are replaced by equivalent, lower-precision, intrinsics
You can apply some of these changes with per-operation granularity by using appropriate intrinsics, others by using PTX inline assembly.

getting system time in Vxworks

is there anyways to get the system time in VxWorks besides tickGet() and tickAnnounce? I want to measure the time between the task switches of a specified task but I think the precision of tickGet() is not good enough because the the two tickGet() values at the beggining and the end of taskSwitchHookAdd function is always the same!
If you are looking to try and time task switches, I would assume you need a timer at least at the microsecond (us) level.
Usually, timers/clocks this fine grained are only provided by the platform you are running on. If you are working on an embedded system, you can try and read thru the manuals for your board support package (if there is one) to see if there are any functions provided to access various timers on a board.
A more low level solution would be to figure out the processor that is running on your system and then write some simple assembly code to poll the processor's internal timebase register (TBR). This might require a bit of research on the processor you are running on, but could be easily done.
If you are running on a PPC based processor, you can use the code below to read the TBR:
loop: mftbu rx #load most significant half from TBU
mftbl ry #load least significant half from TBL
mftbu rz #load from TBU again
cmpw rz,rx #see if 'old' = 'new'
bne loop #repeat if two values read from TBU are unequal
On an x86 based processor, you might consider using the RDTSC assembly instruction to read the Time Stamp Counter (TSC). On vxWorks, pentiumALib has some library functions (pentiumTscGet64() and pentiumTscGet32()) that will make reading the TSC easier using C.
source: http://www-inteng.fnal.gov/Integrated_Eng/GoodwinDocs/pdf/Sys%20docs/PowerPC/PowerPC%20Elapsed%20Time.pdf
Good luck!
It depends on what platform you are on, but if it is x86 then you can use:
pentiumTscGet64();

feature request: an atomicAdd() function included in gwan.h

In the G-WAN KV options, KV_INCR_KEY will use the 1st field as the primary key.
That means there is a function which increments atomically already built in the G-WAN core to make this primary index work.
It would be good to make this function opened to be used by servlets, i.e. included in gwan.h.
By doing so, ANSI C newbies like me could benefit from it.
There was ample discussion about this on the old G-WAN forum, and people were invited to share their experiences with atomic operations in order to build a rich list of documented functions, platform by platform.
Atomic operations are not portable because they address the CPU directly. It means that the code for Intel x86 (32-bit) and Intel AMD64 (64-bit) is different. Each platform (ARM, Power7, Cell, Motorola, etc.) has its own atomic instruction sets.
Such a list was not published in the gwan.h file so far because basic operations are easy to find (the GCC compiler offers several atomic intrinsics as C extensions) but more sophisticated operations are less obvious (needs asm skills) and people will build them as they need - for very specific uses in their code.
Software Engineering is always a balance between what can be made available at the lowest possible cost to entry (like the G-WAN KV store, which uses a small number of functions) and how it actually works (which is far less simple to follow).
So, beyond the obvious (incr/decr, set/get), to learn more about atomic operations, use Google, find CPU instruction sets manuals, and arm yourself with courage!
Thanks for Gil's helpful guidance.
Now, I can do it by myself.
I change the code in persistence.c, as below:
firstly, i changed the definition of val in data to volatile.
//data[0]->val++;
//xbuf_xcat(reply, "Value: %d", data[0]->val);
int new_count, loops=50000000, time1, time2, time;
time1=getus();
for(int i; i<loops; i++){
new_count = __sync_add_and_fetch(&data[0]->val, 1);
}
time2=getus();
time=loops/(time2-time1);
time=time*1000;
xbuf_xcat(reply, "Value: %d, time: %d incr_ops/msec", new_count, time);
I got 52,000 incr_operations/msec with my old E2180 CPU.
So, with GCC compiler I can do it by myself.
thanks again.

Switching from debug to release configuration having no effect on performance?

I have tested a couple of benchmarking snippets on Delphi like this one:
uses
..., Diagnostics;
procedure TForm2.Button1Click(Sender: TObject);
var
i,elapsed: integer;
stopwatch: TStopwatch;
ff: textfile;
begin
if FileExists('c:\bench.txt') then
DeleteFile('c:\bench.txt');
stopwatch := TStopwatch.create;
stopwatch.Reset;
stopwatch.Start;
AssignFile(ff,'c:\bench.txt');
Rewrite(ff);
for I := 1 to 999000 do
write(ff,'Delphi programmers are ladies men :D');
CloseFile(ff);
stopwatch.Stop;
elapsed := stopwatch.ElapsedMilliseconds;
ShowMessage(inttostr(elapsed));
end;
It does not matter if I run/compile under debug or release configuration the result is around 900.
When I switch from debug to release in Visual Studio (for both c++ and c#) my programs become MAGICALLY faster. I am using Delphi 2010 and I activate release configuration from project manager as well as project -> configuration manager and even project -> options -> Delphi compiler but with no effect why??
If it matters: I am using Windows XP, I got 1gb RAM and an Intel Core2 CPU.
Did you check, how the configurations differ? Even if they have names like RELEASE or DEBUG, they are fully configurable. You can even configure them the other way round.
The code you are timing is mostly I/O related. So make sure that the IO checks are turned off in the RELEASE configuration.
Delphi still creates fast code even when debugged ;)
In addition to what Uwe said, make sure you do a "Build" after switching the configuration. Doing a simple compile or running the app will not recompile all units with the new settings.
Like the other commenters, I also wouldn't expect too much of a difference between the two configurations given the benchmark used. The real bottleneck is the I/O and that will very likely outbalance any performance differences between DEBUG and RELEASE.
Finally, debugging in Delphi just isn't that much slower than Release builds. Heck, I sometimes run Outlook in the debugger for most of the day (I'm developing Outlook addins) without noticing any perceivable performance difference.
That's a bad test case I think. All you do is write to a file, which means most of the time is spent in Windows code, not in your Delphi code, and hence the compiler settings won't significantly affect total execution time
There's nothing in your main code bulk:
for I := 1 to 999000 do
write(ff,'Delphi programmers are ladies men :D');
that requires strenuous checks. Your choices are:
Range checking
Overflow checking
I/O checking
Of those three, only I/O checking will apply, and that is probably the equivalent of adding:
for I := 1 to 999000 do
begin
hresult := Write(ff, 'Dephi programmers are ladies men :D');
if hresult < 0 then
raise EIOException.Create('That''s what your mom told me, in bed.');
end;
And a the CMP and JNE CPU instructions are not very complicated. They're dwarfed by writing to the hard-drive.
It runs just as fast because it is fast.

Is there a performance difference between inc(i) and i := i + 1 in Delphi?

I have a procedure with a lot of
i := i +1;
in it and I think
inc(i);
looks a lot better. Is there a performance difference or does the function call just get inlined by the compiler? I know this probably doesn't matter at all to my app, I'm just curious.
EDIT: I did some gauging of the performance and found the difference to be very small, in fact as small as 5.1222741794670901427682121946224e-8! So it really doesn't matter. And optimization options really didn't change the outcome much. Thanks for all tips and suggestions!
There is a huge difference if Overflow Checking is turned on. Basically Inc does not do overflow checking. Do as was suggested and use the disassembly window to see the difference when you have those compiler options turned on (it is different for each).
If those options are turned off, then there is no difference. Rule of thumb, use Inc when you don't care about a range checking failure (since you won't get an exception!).
Modern compilers optimize the code.
inc(i) and i:= i+1; are pretty much the same.
Use whichever you prefer.
Edit: As Jim McKeeth corrected: with Overflow Checking there is a difference. Inc does not do a range checking.
It all depends on the type of "i". In Delphi, one normally declares loop-variables as "i: Integer", but it could as well be "i: PChar" which resolves to PAnsiChar on everything below Delphi 2009 and FPC (I'm guessing here), and to PWideChar on Delphi 2009 and Delphi.NET (also guessing).
Since Delphi 2009 can do pointer-math, Inc(i) can also be done on typed-pointers (if they are defined with POINTER_MATH turned on).
For example:
type
PSomeRecord = ^RSomeRecord;
RSomeRecord = record
Value1: Integer;
Value2: Double;
end;
var
i: PSomeRecord;
procedure Test;
begin
Inc(i); // This line increases i with SizeOf(RSomeRecord) bytes, thanks to POINTER_MATH !
end;
As the other anwsers already said : It's relativly easy to see what the compiler made of your code by opening up :
Views > Debug Windows > CPU Windows > Disassembly
Note, that compiler options like OPTIMIZATION, OVERFLOW_CHECKS and RANGE_CHECKS might influence the final result, so you should take care to have the settings according to your preference.
A tip on this : In every unit, $INCLUDE a file that steers the compiler options, this way, you won't loose settings when your .bdsproj or .dproj is somehow damaged. (Look at the sourcecode of the JCL for a good example on this)
You can verify it in the CPU window while debugging. The generated CPU instructions are the same for both cases.
I agree Inc(I); looks better although this may be subjective.
Correction: I just found this in the documentation for Inc:
"On some platforms, Inc may generate
optimized code, especially useful in
tight loops."
So it's probably advisable to stick to Inc.
You could always write both pieces of code (in separate procedures), put a breakpoint in the code and compare the assembler in the CPU window.
In general, I'd use inc(i) wherever it's obviously being used only as a loop/index of some sort, and + 1 wherever the 1 would make the code easier to maintain (ie, it might conceivable change to another integer in the future) or just more readable from an algorithm/spec point of view.
"On some platforms, Inc may generate optimized code, especially useful in tight loops."
For optimized compiler such as Delphi it doesn't care. That is about old compilers (e.g. Turbo Pascal)

Resources