Switching from debug to release configuration having no effect on performance? - performance

I have tested a couple of benchmarking snippets on Delphi like this one:
uses
..., Diagnostics;
procedure TForm2.Button1Click(Sender: TObject);
var
i,elapsed: integer;
stopwatch: TStopwatch;
ff: textfile;
begin
if FileExists('c:\bench.txt') then
DeleteFile('c:\bench.txt');
stopwatch := TStopwatch.create;
stopwatch.Reset;
stopwatch.Start;
AssignFile(ff,'c:\bench.txt');
Rewrite(ff);
for I := 1 to 999000 do
write(ff,'Delphi programmers are ladies men :D');
CloseFile(ff);
stopwatch.Stop;
elapsed := stopwatch.ElapsedMilliseconds;
ShowMessage(inttostr(elapsed));
end;
It does not matter if I run/compile under debug or release configuration the result is around 900.
When I switch from debug to release in Visual Studio (for both c++ and c#) my programs become MAGICALLY faster. I am using Delphi 2010 and I activate release configuration from project manager as well as project -> configuration manager and even project -> options -> Delphi compiler but with no effect why??
If it matters: I am using Windows XP, I got 1gb RAM and an Intel Core2 CPU.

Did you check, how the configurations differ? Even if they have names like RELEASE or DEBUG, they are fully configurable. You can even configure them the other way round.
The code you are timing is mostly I/O related. So make sure that the IO checks are turned off in the RELEASE configuration.
Delphi still creates fast code even when debugged ;)

In addition to what Uwe said, make sure you do a "Build" after switching the configuration. Doing a simple compile or running the app will not recompile all units with the new settings.
Like the other commenters, I also wouldn't expect too much of a difference between the two configurations given the benchmark used. The real bottleneck is the I/O and that will very likely outbalance any performance differences between DEBUG and RELEASE.
Finally, debugging in Delphi just isn't that much slower than Release builds. Heck, I sometimes run Outlook in the debugger for most of the day (I'm developing Outlook addins) without noticing any perceivable performance difference.

That's a bad test case I think. All you do is write to a file, which means most of the time is spent in Windows code, not in your Delphi code, and hence the compiler settings won't significantly affect total execution time

There's nothing in your main code bulk:
for I := 1 to 999000 do
write(ff,'Delphi programmers are ladies men :D');
that requires strenuous checks. Your choices are:
Range checking
Overflow checking
I/O checking
Of those three, only I/O checking will apply, and that is probably the equivalent of adding:
for I := 1 to 999000 do
begin
hresult := Write(ff, 'Dephi programmers are ladies men :D');
if hresult < 0 then
raise EIOException.Create('That''s what your mom told me, in bed.');
end;
And a the CMP and JNE CPU instructions are not very complicated. They're dwarfed by writing to the hard-drive.
It runs just as fast because it is fast.

Related

Did WH_CALLWNDPROC hooks performance dramatically decrease with Win10 (Compared to Win7)?

We are in the process of upgrading our workstations to Win10 from Win7. While investigating reports of performance degradation, I came to the conclusion it was caused by a WH_CALLWNDPROC hook installed by a third party.
I came to this conclusion based on the result of the following test application (Done in Delphi 10 Seattle)
procedure TForm3.Button1Click(Sender: TObject);
var
I: Integer;
SW : TStopWatch;
begin
sw := TStopWatch.StartNew;
for I := 0 to 1000000 do
begin
if Combobox1.ItemIndex > 0 then
Exit;
end;
sw.Stop;
ShowMessage(sw.ElapsedMilliseconds.ToString);
end;
(For those unfamiliar with Delphi, TStopwatch uses QueryPerformanceFrequency/QueryPerformanceCounter APIs to get elapsed time)
The execution time for this method is
Win10 : 1485 msecs
Win7 : 4996 msecs
(Note : Both machine are on wildly different hardware and can't really be compared to eachother).
Now, if I add a hook before executing the same code
function MySystemWndProcHook(Code: Integer; wParam: WParam; lParam: LParam): LRESULT; stdcall;
begin
Result := CallNextHookEx(FHook, Code, wParam, LParam);
end;
procedure TForm3.FormCreate(Sender: TObject);
begin
FHook := SetWindowsHookEx(WH_CALLWNDPROC, #MySystemWndProcHook, 0, GetCurrentThreadId)
end;
The execution time now becomes :
Win10 : 19552 msecs (about 1300% longer)
Win7 : 8682 msecs (about 75% longer)
Now, as I mentionned, both workstation are on different hardware, but I don't believe that alone could explain the difference. Win10 has an i7 cpu while Win7 has an i3. If anything, I'd expect the i3 to take a bigger hit (less cache, less resource... )
So, did WH_CALLWNDPROC hooks get that much slower since Win7?
A quick google search didn't seem to reveal any other report of this issue. Can anybody reproduce my results?
If it can't be reproduced, anyone has any idea what settings/conflicting application could be causing this? (Already tried disabling Windows Defender real time scanning and it didn't affect performance).
EDIT : This was tested under Win10 1803 64 bits. The test application itself was 32 bits.
EDIT2 : Same application compiled in 64 bits gives the following execution time.
Win10 : 780 msecs / 10201 msecs.
Win7 : 6419 msec / 9201 msecs.
EDIT3 : Interestingly enough, when running the application (32bits) as admin :
Win10 : 12693 msecs / 18028 msecs
Also, (on yet another workstation), running as different user makes a difference :
Win10(1809) / "standard user" : 9430 / 17440 msecs
Win10(1809) / System : 5220 / 10160 msecs (Started remotely through PsExec)
EDIT4 : If ran as admin, the application will run faster from a USB key than from a hard disk. (Note : So far, I only tested on system with a single drive. At this point, I wouldn't exclude that only the OS drive is slower.)
EDIT5 : I found out quite a few more things about this situation.
First, running "As Admin"(win10) causes the application to have a WH_CALLWNDPROCRET hook to be installed. I haven't found where it is coming from (OS, Delphi's framework, other app?). It is definitely not there when simply running the app.
The performance hit doesn't seem to be so much on the hook itself, but on its effect on SendMessage.
We are in contact with Microsoft's support, they have reproduced similar results (on a 100k loop instead of 1m) :
Windows 7 - Without hook 0.018396 seconds.
Windows 10 - Without hook 0.025102 seconds.
Windows 7 - With hook 0.167941 seconds.
Windows 10 - With hook 1.105929 seconds.
(Investigation still on-going so still no conclusions thus far)
Those result also suggest many of our workstations perform way worse than they should when there are no hooks involved.
So, WH_CALLWNDPROC and WH_CALLWNDPROCRET hooks do degrade performance quite a bit. And quite a bit more so in Win10 than it did in Win7.
Some of the performance hit is coming from the mitigation code for Spectre and Meltdown. Early reports from Microsoft suggest the rest is apparently from lock contention in the window manager (win32k*.sys).
As for the weird result I've got in my investigation :
Running "As Admin" caused an additional hook to be installed in my application which explains the massive slowdown I witnessed
Many of the test I did were in test machine accessed through a remote admin tool... which happens to install global WH_CALLWNDPROC/WH_CALLWNDPROCRET hook itself, which made my test result flawed. Running locally "fixed" the results. Took me a while to find out about it since my application is 32 bits and the hooks were 64 bits, so my application wasn't notified of them (but still incurred the performance hit).
2020-02-04 : I just received an update from Microsoft. Their engineer identified a few issues that contribute to the performance degradation. Current estimate for a Windows Insider version containing fixes is 2020H1, early 2020H2

Delphi REST mac memory leak

I am currently looking for a way around an apparent memory leak in the Mac implementation of the REST client. The code to generate the memory leak is the following (running XE8, update 1):
program mac_REST_leak_test;
{$APPTYPE CONSOLE}
{$R *.res}
uses
System.SysUtils, REST.Client, REST.Types, IPPeerClient;
var
request : TRestRequest;
ii, iMax : integer;
begin
iMax := 1;
for ii := 0 to iMax do
begin
request := TRestRequest.Create(nil);
// Fake Online REST API for Testing and Prototyping
request.Client := TRestClient.Create('http://jsonplaceholder.typicode.com/');
request.Method := rmPOST;
request.Execute();
request.Client.Free();
request.Free();
end;
end.
This is the smallest block of code that demonstrates the leak. Essentially, I have a synching service that makes REST requests every so often.
When I run this on Windows, using MadExcept, no leaks are found. Examining the running process in ProcessMonitor shows no increase in the amount of memory being used.
When run on a Mac, however, the Activity Monitor shows the memory allocated to the app continue to rise. Further, when run using Instruments, there appear to be leaks dealing with several URL and HTTP classes on mac.
Does anybody know how to resolve this leak?
(As an aside, it would be really helpful to know exactly where the leak is coming from on Mac, but the only Delphi classes listed are the TMethodImplementationIntercept. I'm to believe that this is due to the fact that Delphi doesn't generate a dSYM file for Mac. If anybody knows a way around that, that would be awesome too!)
UPDATE
By varying iMax from 1 to 10 and comparing the FastMM4 output, it appears that the leak is in the class Macapi.ObjectiveC.TConvObjID.XForm. The 10 iteration output contains 9 more leaks with this as a stack trace compared to the 1 iteration. I have reported this to Embarcadero as RSP-12242.
Yes FastMM4 has OSX leak reporting support in latest SVN revision. Unfortunatly, the "global" leaks from a simple empty Delphi FMX application makes it difficult to analyse the mem-logfile. A few leaks has been fixed in XE10 but some objects in MacApi.ObjectiveC bridge still generate leaks. I have reported that in Quality Central & Quality Portal (QC & QP). So it's difficult to use FastMM4 for leak finding.
Please separate Delphi object leaks and ObjectiveC leaks, second you can find with instruments.

getting system time in Vxworks

is there anyways to get the system time in VxWorks besides tickGet() and tickAnnounce? I want to measure the time between the task switches of a specified task but I think the precision of tickGet() is not good enough because the the two tickGet() values at the beggining and the end of taskSwitchHookAdd function is always the same!
If you are looking to try and time task switches, I would assume you need a timer at least at the microsecond (us) level.
Usually, timers/clocks this fine grained are only provided by the platform you are running on. If you are working on an embedded system, you can try and read thru the manuals for your board support package (if there is one) to see if there are any functions provided to access various timers on a board.
A more low level solution would be to figure out the processor that is running on your system and then write some simple assembly code to poll the processor's internal timebase register (TBR). This might require a bit of research on the processor you are running on, but could be easily done.
If you are running on a PPC based processor, you can use the code below to read the TBR:
loop: mftbu rx #load most significant half from TBU
mftbl ry #load least significant half from TBL
mftbu rz #load from TBU again
cmpw rz,rx #see if 'old' = 'new'
bne loop #repeat if two values read from TBU are unequal
On an x86 based processor, you might consider using the RDTSC assembly instruction to read the Time Stamp Counter (TSC). On vxWorks, pentiumALib has some library functions (pentiumTscGet64() and pentiumTscGet32()) that will make reading the TSC easier using C.
source: http://www-inteng.fnal.gov/Integrated_Eng/GoodwinDocs/pdf/Sys%20docs/PowerPC/PowerPC%20Elapsed%20Time.pdf
Good luck!
It depends on what platform you are on, but if it is x86 then you can use:
pentiumTscGet64();

sm-level : 1.3 vs 2.0 performance

My code doesn't depend on sm level. I can build it with sm10, If I want. But when I tried to build it with 1.3 instead of 2.0, as I did it before, I got x1.25 performance with no code changes!
sm20 -> 35ms
sm13 -> 25ms
After that gorgeous results, I tried to box/unbox every option in project settings->CUDA settings->all :) I guess, I found the stuff, which made that awesome speed:
If I use sm13 with "no fast math generation" (further fm - fast
math), I have 25ms
If I use sm13 with fm, I have 25ms
sm20 without fm = 35ms
sm20 with fm = 25ms (that is the same result)
Why is this so? Maybe sm13 forces using hardware maths, but sm20 not? Or it is only coincidence, and the latter sm level have lower performance, refer to lower sm level programs?
In addition to compiling in release mode, as pointed out by #Robert Crovella, you should also consider that when you target sm_13 the compiler is able to simplify some of the floating point maths. sm_20 and later supports precise division, precise square root, and denormals by default.
You can try disabling these features with the command line options -ftz=true -prec-div=false -prec-sqrt=false. See the best practices guide for more information.

Is there a performance difference between inc(i) and i := i + 1 in Delphi?

I have a procedure with a lot of
i := i +1;
in it and I think
inc(i);
looks a lot better. Is there a performance difference or does the function call just get inlined by the compiler? I know this probably doesn't matter at all to my app, I'm just curious.
EDIT: I did some gauging of the performance and found the difference to be very small, in fact as small as 5.1222741794670901427682121946224e-8! So it really doesn't matter. And optimization options really didn't change the outcome much. Thanks for all tips and suggestions!
There is a huge difference if Overflow Checking is turned on. Basically Inc does not do overflow checking. Do as was suggested and use the disassembly window to see the difference when you have those compiler options turned on (it is different for each).
If those options are turned off, then there is no difference. Rule of thumb, use Inc when you don't care about a range checking failure (since you won't get an exception!).
Modern compilers optimize the code.
inc(i) and i:= i+1; are pretty much the same.
Use whichever you prefer.
Edit: As Jim McKeeth corrected: with Overflow Checking there is a difference. Inc does not do a range checking.
It all depends on the type of "i". In Delphi, one normally declares loop-variables as "i: Integer", but it could as well be "i: PChar" which resolves to PAnsiChar on everything below Delphi 2009 and FPC (I'm guessing here), and to PWideChar on Delphi 2009 and Delphi.NET (also guessing).
Since Delphi 2009 can do pointer-math, Inc(i) can also be done on typed-pointers (if they are defined with POINTER_MATH turned on).
For example:
type
PSomeRecord = ^RSomeRecord;
RSomeRecord = record
Value1: Integer;
Value2: Double;
end;
var
i: PSomeRecord;
procedure Test;
begin
Inc(i); // This line increases i with SizeOf(RSomeRecord) bytes, thanks to POINTER_MATH !
end;
As the other anwsers already said : It's relativly easy to see what the compiler made of your code by opening up :
Views > Debug Windows > CPU Windows > Disassembly
Note, that compiler options like OPTIMIZATION, OVERFLOW_CHECKS and RANGE_CHECKS might influence the final result, so you should take care to have the settings according to your preference.
A tip on this : In every unit, $INCLUDE a file that steers the compiler options, this way, you won't loose settings when your .bdsproj or .dproj is somehow damaged. (Look at the sourcecode of the JCL for a good example on this)
You can verify it in the CPU window while debugging. The generated CPU instructions are the same for both cases.
I agree Inc(I); looks better although this may be subjective.
Correction: I just found this in the documentation for Inc:
"On some platforms, Inc may generate
optimized code, especially useful in
tight loops."
So it's probably advisable to stick to Inc.
You could always write both pieces of code (in separate procedures), put a breakpoint in the code and compare the assembler in the CPU window.
In general, I'd use inc(i) wherever it's obviously being used only as a loop/index of some sort, and + 1 wherever the 1 would make the code easier to maintain (ie, it might conceivable change to another integer in the future) or just more readable from an algorithm/spec point of view.
"On some platforms, Inc may generate optimized code, especially useful in tight loops."
For optimized compiler such as Delphi it doesn't care. That is about old compilers (e.g. Turbo Pascal)

Resources