Delphi REST mac memory leak - macos

I am currently looking for a way around an apparent memory leak in the Mac implementation of the REST client. The code to generate the memory leak is the following (running XE8, update 1):
program mac_REST_leak_test;
{$APPTYPE CONSOLE}
{$R *.res}
uses
System.SysUtils, REST.Client, REST.Types, IPPeerClient;
var
request : TRestRequest;
ii, iMax : integer;
begin
iMax := 1;
for ii := 0 to iMax do
begin
request := TRestRequest.Create(nil);
// Fake Online REST API for Testing and Prototyping
request.Client := TRestClient.Create('http://jsonplaceholder.typicode.com/');
request.Method := rmPOST;
request.Execute();
request.Client.Free();
request.Free();
end;
end.
This is the smallest block of code that demonstrates the leak. Essentially, I have a synching service that makes REST requests every so often.
When I run this on Windows, using MadExcept, no leaks are found. Examining the running process in ProcessMonitor shows no increase in the amount of memory being used.
When run on a Mac, however, the Activity Monitor shows the memory allocated to the app continue to rise. Further, when run using Instruments, there appear to be leaks dealing with several URL and HTTP classes on mac.
Does anybody know how to resolve this leak?
(As an aside, it would be really helpful to know exactly where the leak is coming from on Mac, but the only Delphi classes listed are the TMethodImplementationIntercept. I'm to believe that this is due to the fact that Delphi doesn't generate a dSYM file for Mac. If anybody knows a way around that, that would be awesome too!)
UPDATE
By varying iMax from 1 to 10 and comparing the FastMM4 output, it appears that the leak is in the class Macapi.ObjectiveC.TConvObjID.XForm. The 10 iteration output contains 9 more leaks with this as a stack trace compared to the 1 iteration. I have reported this to Embarcadero as RSP-12242.

Yes FastMM4 has OSX leak reporting support in latest SVN revision. Unfortunatly, the "global" leaks from a simple empty Delphi FMX application makes it difficult to analyse the mem-logfile. A few leaks has been fixed in XE10 but some objects in MacApi.ObjectiveC bridge still generate leaks. I have reported that in Quality Central & Quality Portal (QC & QP). So it's difficult to use FastMM4 for leak finding.
Please separate Delphi object leaks and ObjectiveC leaks, second you can find with instruments.

Related

Did WH_CALLWNDPROC hooks performance dramatically decrease with Win10 (Compared to Win7)?

We are in the process of upgrading our workstations to Win10 from Win7. While investigating reports of performance degradation, I came to the conclusion it was caused by a WH_CALLWNDPROC hook installed by a third party.
I came to this conclusion based on the result of the following test application (Done in Delphi 10 Seattle)
procedure TForm3.Button1Click(Sender: TObject);
var
I: Integer;
SW : TStopWatch;
begin
sw := TStopWatch.StartNew;
for I := 0 to 1000000 do
begin
if Combobox1.ItemIndex > 0 then
Exit;
end;
sw.Stop;
ShowMessage(sw.ElapsedMilliseconds.ToString);
end;
(For those unfamiliar with Delphi, TStopwatch uses QueryPerformanceFrequency/QueryPerformanceCounter APIs to get elapsed time)
The execution time for this method is
Win10 : 1485 msecs
Win7 : 4996 msecs
(Note : Both machine are on wildly different hardware and can't really be compared to eachother).
Now, if I add a hook before executing the same code
function MySystemWndProcHook(Code: Integer; wParam: WParam; lParam: LParam): LRESULT; stdcall;
begin
Result := CallNextHookEx(FHook, Code, wParam, LParam);
end;
procedure TForm3.FormCreate(Sender: TObject);
begin
FHook := SetWindowsHookEx(WH_CALLWNDPROC, #MySystemWndProcHook, 0, GetCurrentThreadId)
end;
The execution time now becomes :
Win10 : 19552 msecs (about 1300% longer)
Win7 : 8682 msecs (about 75% longer)
Now, as I mentionned, both workstation are on different hardware, but I don't believe that alone could explain the difference. Win10 has an i7 cpu while Win7 has an i3. If anything, I'd expect the i3 to take a bigger hit (less cache, less resource... )
So, did WH_CALLWNDPROC hooks get that much slower since Win7?
A quick google search didn't seem to reveal any other report of this issue. Can anybody reproduce my results?
If it can't be reproduced, anyone has any idea what settings/conflicting application could be causing this? (Already tried disabling Windows Defender real time scanning and it didn't affect performance).
EDIT : This was tested under Win10 1803 64 bits. The test application itself was 32 bits.
EDIT2 : Same application compiled in 64 bits gives the following execution time.
Win10 : 780 msecs / 10201 msecs.
Win7 : 6419 msec / 9201 msecs.
EDIT3 : Interestingly enough, when running the application (32bits) as admin :
Win10 : 12693 msecs / 18028 msecs
Also, (on yet another workstation), running as different user makes a difference :
Win10(1809) / "standard user" : 9430 / 17440 msecs
Win10(1809) / System : 5220 / 10160 msecs (Started remotely through PsExec)
EDIT4 : If ran as admin, the application will run faster from a USB key than from a hard disk. (Note : So far, I only tested on system with a single drive. At this point, I wouldn't exclude that only the OS drive is slower.)
EDIT5 : I found out quite a few more things about this situation.
First, running "As Admin"(win10) causes the application to have a WH_CALLWNDPROCRET hook to be installed. I haven't found where it is coming from (OS, Delphi's framework, other app?). It is definitely not there when simply running the app.
The performance hit doesn't seem to be so much on the hook itself, but on its effect on SendMessage.
We are in contact with Microsoft's support, they have reproduced similar results (on a 100k loop instead of 1m) :
Windows 7 - Without hook 0.018396 seconds.
Windows 10 - Without hook 0.025102 seconds.
Windows 7 - With hook 0.167941 seconds.
Windows 10 - With hook 1.105929 seconds.
(Investigation still on-going so still no conclusions thus far)
Those result also suggest many of our workstations perform way worse than they should when there are no hooks involved.
So, WH_CALLWNDPROC and WH_CALLWNDPROCRET hooks do degrade performance quite a bit. And quite a bit more so in Win10 than it did in Win7.
Some of the performance hit is coming from the mitigation code for Spectre and Meltdown. Early reports from Microsoft suggest the rest is apparently from lock contention in the window manager (win32k*.sys).
As for the weird result I've got in my investigation :
Running "As Admin" caused an additional hook to be installed in my application which explains the massive slowdown I witnessed
Many of the test I did were in test machine accessed through a remote admin tool... which happens to install global WH_CALLWNDPROC/WH_CALLWNDPROCRET hook itself, which made my test result flawed. Running locally "fixed" the results. Took me a while to find out about it since my application is 32 bits and the hooks were 64 bits, so my application wasn't notified of them (but still incurred the performance hit).
2020-02-04 : I just received an update from Microsoft. Their engineer identified a few issues that contribute to the performance degradation. Current estimate for a Windows Insider version containing fixes is 2020H1, early 2020H2

Call to ExAllocatePoolWithTag never returns

I am having some issues with my virtualHBA driver on Windows Server 2016. A ran the HLK crashdump support test. 3 times out of 10 the test passed. In those 3 failing tests, the crashdump hangs at 0% while taking Complete dump, or Kernel dump or minidump.
By kernel debugging my code, I found that the call to ExAllocatePoolWithTag() for buffer allocation never actually returns.
Below is the statement which never returns.
pDeviceExtension->pcmdbuf=(struct mycmdrsp *)ExAllocatePoolWithTag(NonPagedPoolCacheAligned,pcmdqSignalSize,((ULONG)'TA1'));
I searched on the web regarding this. However, all of the found pages are focusing on this function returning NULL which in my case never returns.
Any help on how to move forward would be highly appreciated.
Thanks in advance.
You can't allocate memory in crash dump mode. You're running at HIGH_LEVEL with interrupts disabled and so you're calling this API at the wrong IRQL.
The typical solution for a hardware adapter is to set the RequestedDumpBufferSize in the PORT_CONFIGURATION_INFORMATION structure during the normal HwFindAdapter call. Then when you're called again in crash dump mode you use the CrashDumpRegion field to get your dump buffer allocation. You then need to write your own "crash dump mode only" allocator to allocate buffers out of this memory region.
It's a huge pain, especially given that it's difficult/impossible to know how much memory you're ultimately going to need. I usually calculate some minimal configuration overhead (i.e. 1 channel, 8 I/O requests at a time, etc.) and then add in a registry configurable slush. The only benefit is that the environment is stripped down so you don't need to be in your all singing, all dancing configuration.

Directx Texture interface to existing memory

I'm writing a rendering app that communicates with an image processor as a sort of virtual camera, and I'm trying to figure out the fastest way to write the texture data from one process to the awaiting image buffer in the other.
Theoretically I think it should be possible with 1 DirectX copy from VRAM directly to the area of memory I want it in, but I can't figure out how to specify a region of memory for a texture to occupy, and thus must perform an additional memcpy. DX9 or DX11 solutions would be welcome.
So far, the docs here: http://msdn.microsoft.com/en-us/library/windows/desktop/bb174363(v=vs.85).aspx have held the most promise.
"In Windows Vista CreateTexture can create a texture from a system memory pointer allowing the application more flexibility over the use, allocation and deletion of the system memory"
I'm running on Windows 7 with the June 2010 Directx SDK, However, whenever I try and use the function in the way it specifies, I the function fails with an invalid arguments error code. Here is the call I tried as a test:
static char s_TextureBuffer[640*480*4]; //larger than needed
void* p = (void*)s_TextureBuffer;
HRESULT res = g_D3D9Device->CreateTexture(640,480,1,0, D3DFORMAT::D3DFMT_L8, D3DPOOL::D3DPOOL_SYSTEMMEM, &g_ReadTexture, (void**)p);
I tried with several different texture formats, but with no luck. I've begun looking into DX11 solutions, it's going slowly since I'm used to DX9. Thanks!

Some Windows API calls fail unless the string arguments are in the system memory rather than local stack

We have an older massive C++ application and we have been converting it to support Unicode as well as 64-bits. The following strange thing has been happening:
Calls to registry functions and windows creation functions, like the following, have been failing:
hWnd = CreateSysWindowExW( ExStyle, ClassNameW.StringW(), Label2.StringW(), Style,
Posn.X(), Posn.Y(),
Size.X(), Size.Y(),
hParentWnd, (HMENU)Id,
AppInstance(), NULL);
ClassNameW and Label2 are instances of our own Text class which essentially uses malloc to allocate the memory used to store the string.
Anyway, when the functions fail, and I call GetLastError it returns the error code for "invalid memory access" (though I can inspect and see the string arguments fine in the debugger). Yet if I change the code as follows then it works perfectly fine:
BSTR Label2S = SysAllocString(Label2.StringW());
BSTR ClassNameWS = SysAllocString(ClassNameW.StringW());
hWnd = CreateSysWindowExW( ExStyle, ClassNameWS, Label2S, Style,
Posn.X(), Posn.Y(),
Size.X(), Size.Y(),
hParentWnd, (HMENU)Id,
AppInstance(), NULL);
SysFreeString(ClassNameWS); ClassNameWS = 0;
SysFreeString(Label2S); Label2S = 0;
So what gives? Why would the original functions work fine with the arguments in local memory, but when used with Unicode, the registry function require SysAllocString, and when used in 64-bit, the Windows creation functions also require SysAllocString'd string arguments? Our Windows procedure functions have all been converted to be Unicode, always, and yes we use SetWindowLogW call the correct default Unicode DefWindowProcW etc. That all seems to work fine and handles and draws Unicode properly etc.
The documentation at http://msdn.microsoft.com/en-us/library/ms632679%28v=vs.85%29.aspx does not say anything about this. While our application is massive we do use debug heaps and tools like Purify to check for and clean up any memory corruption. Also at the time of this failure, there is still only one main system thread. So it is not a thread issue.
So what is going on? I have read that if string arguments are marshalled anywhere or passed across process boundaries, then you have to use SysAllocString/BSTR, yet we call lots of API functions and there is lots of code out there which calls these functions just using plain local strings?
What am I missing? I have tried Googling this, as someone else must have run into this, but with little luck.
Edit 1: Our StringW function does not create any temporary objects which might go out of scope before the actual API call. The function is as follows:
Class Text {
const wchar_t* StringW () const
{
return TextStartW;
}
wchar_t* TextStartW; // pointer to current start of text in DataArea
I have been running our application with the debug heap and memory checking and other diagnostic tools, and found no source of memory corruption, and looking at the assembly, there is no sign of temporary objects or invalid memory access.
BUT I finally figured it out:
We compile our code /Zp1, which means byte aligned memory allocations. SysAllocString (in 64-bits) always return a pointer that is aligned on a 8 byte boundary. Presumably a 32-bit ANSI C++ application goes through an API layer to the underlying Unicode windows DLLs, which would also align the pointer for you.
But if you use Unicode, you do not get that incidental pointer alignment that the conversion mapping layer gives you, and if you use 64-bits, of course the situation will get even worse.
I added a method to our Text class which shifts the string pointer so that it is aligned on an eight byte boundary, and viola, everything runs fine!!!
Of course the Microsoft people say it must be memory corruption and I am jumping the wrong conclusion, but there is evidence it is not the case.
Also, if you use /Zp1 and include windows.h in a 64-bit application, the debugger will tell you sizeof(BITMAP)==28, but calling GetObject on a bitmap will fail and tell you it needs a 32-byte structure. So I suspect that some of Microsoft's API is inherently dependent on aligned pointers, and I also know that some optimized assembly (I have seen some from Fortran compilers) takes advantage of that and crashes badly if you ever give it unaligned pointers.
So the moral of all of this is, dont use "funky" compiler arguments like /Zp1. In our case we have to for historical reasons, but the number of times this has bitten us...
Someone please give me a "this is useful" tick on my answer please?
Using a bit of psychic debugging, I'm going to guess that the strings in your application are pooled in a read-only section.
It's possible that the CreateSysWindowsEx is attempting to write to the memory passed in for the window class or title. That would explain why the calls work when allocated on the heap (SysAllocString) but not when used as constants.
The easiest way to investigate this is to use a low level debugger like windbg - it should break into the debugger at the point where the access violation occurs which should help figure out the problem. Don't use Visual Studio, it has a nasty habit of being helpful and hiding first chance exceptions.
Another thing to try is to enable appverifier on your application - it's possible that it may show something.
Calling a Windows API function does not cross the process boundary, since the various Windows DLLs are loaded into your process.
It sounds like whatever pointer that StringW() is returning isn't valid when Windows is trying to access it. I would look there - is it possible that the pointer returned it out of scope and deleted shortly after it is called?
If you share some more details about your string class, that could help diagnose the problem here.

Switching from debug to release configuration having no effect on performance?

I have tested a couple of benchmarking snippets on Delphi like this one:
uses
..., Diagnostics;
procedure TForm2.Button1Click(Sender: TObject);
var
i,elapsed: integer;
stopwatch: TStopwatch;
ff: textfile;
begin
if FileExists('c:\bench.txt') then
DeleteFile('c:\bench.txt');
stopwatch := TStopwatch.create;
stopwatch.Reset;
stopwatch.Start;
AssignFile(ff,'c:\bench.txt');
Rewrite(ff);
for I := 1 to 999000 do
write(ff,'Delphi programmers are ladies men :D');
CloseFile(ff);
stopwatch.Stop;
elapsed := stopwatch.ElapsedMilliseconds;
ShowMessage(inttostr(elapsed));
end;
It does not matter if I run/compile under debug or release configuration the result is around 900.
When I switch from debug to release in Visual Studio (for both c++ and c#) my programs become MAGICALLY faster. I am using Delphi 2010 and I activate release configuration from project manager as well as project -> configuration manager and even project -> options -> Delphi compiler but with no effect why??
If it matters: I am using Windows XP, I got 1gb RAM and an Intel Core2 CPU.
Did you check, how the configurations differ? Even if they have names like RELEASE or DEBUG, they are fully configurable. You can even configure them the other way round.
The code you are timing is mostly I/O related. So make sure that the IO checks are turned off in the RELEASE configuration.
Delphi still creates fast code even when debugged ;)
In addition to what Uwe said, make sure you do a "Build" after switching the configuration. Doing a simple compile or running the app will not recompile all units with the new settings.
Like the other commenters, I also wouldn't expect too much of a difference between the two configurations given the benchmark used. The real bottleneck is the I/O and that will very likely outbalance any performance differences between DEBUG and RELEASE.
Finally, debugging in Delphi just isn't that much slower than Release builds. Heck, I sometimes run Outlook in the debugger for most of the day (I'm developing Outlook addins) without noticing any perceivable performance difference.
That's a bad test case I think. All you do is write to a file, which means most of the time is spent in Windows code, not in your Delphi code, and hence the compiler settings won't significantly affect total execution time
There's nothing in your main code bulk:
for I := 1 to 999000 do
write(ff,'Delphi programmers are ladies men :D');
that requires strenuous checks. Your choices are:
Range checking
Overflow checking
I/O checking
Of those three, only I/O checking will apply, and that is probably the equivalent of adding:
for I := 1 to 999000 do
begin
hresult := Write(ff, 'Dephi programmers are ladies men :D');
if hresult < 0 then
raise EIOException.Create('That''s what your mom told me, in bed.');
end;
And a the CMP and JNE CPU instructions are not very complicated. They're dwarfed by writing to the hard-drive.
It runs just as fast because it is fast.

Resources