Pixel modifying code runs quick in main app, really slow in Delphi 6 DirectShow filter with other problems - performance

I have a Delphi 6 application that sends bitmaps to a DirectShow DLL in real-time, 25 frames a second. The DirectShow DLL is my code too and is also written in Delphi 6 using the DSPACK DirectShow component suite. I have a simple block of code that goes through each pixel in the bitmap modifying the brightness and contrast of the image, if a certain flag is set, otherwise the bitmap is pushed out the DirectShow DLL unmodified (push source video filter). The code used to be in the main application and then I just moved it into the DirectShow DLL. When it was in the main application it ran fine. I could see the changes in the bitmap as expected. However, now that the code resides in the DirectShow DLL it has the following problems:
When the code block below is active the DirectShow DLL is really slow. I have a quad core i5 and it's really slow. I can also see a big spike in the CPU consumption. In contrast, the very same code running in the main application ran fine on an old single core P4. It did hit the CPU noticeably on that old machine but the video was smooth and there were no problems. The images are only 352 x 288 pixels in size.
I don't see the expected changes to the visible bitmap. I can trace the code in the DirectShow DLL and see the numerical values of each pixel properly altered by the code, but the viewable image in the Graph Edit ActiveMovie window looks completely unchanged.
If I deactivate the code, which I can do in real-time, the ActiveMovie window shows video that is as smooth as glass, perfectly rendered with the CPU barely touched. If I reactivate the code the video is now really choppy, probably showing only 1 to 2 frames a second with a long delay before the first frame is shown, and the CPU spikes. Not completely, but a lot more than I would expect.
I tried compiling the DirectShow DLL with everything on including range checking, overflow checking, etc. and there were no warnings or errors during run-time. I then tried compiling for fastest speed and it still had the exact same problems listed above. Something is really wrong and I can't figure out what. Note, I do indeed lock the canvas before modifying the bitmap and unlock it after I'm done. If it weren't for the "everything on" compilation run I noted above I'd say it felt like an FPU Exception was being raised and silently swallowed with every pixel computation, but as I said, no errors or Exceptions are occurring.
UPDATE: I am putting this here so that the solution, which is embedded in one of Roman R's comment, is plainly visible. The problem that I was not setting the PixelFormat property to pf24Bit before accessing the ScanLine property. As Roman suggested, not doing this must make the TBitmap code create a temporary copy of the bitmap. As soon as I added the line of code below the problems went away, both that of changes not being visible and the soft page faults. It's an insidious problem because the only object that is affected is the pointer you use to access the ScanLine property, since (assumption) it contains a pointer to a temporary copy of the bitmap. That's must be why the subsequent TextOut() call still worked since it worked on the original copy of the bitmap.
clip.PixelFormat := pf24bit; // The missing code line that fixes the problem.
Here's the code block I've been referring to:
function IntToByte(i: Integer): Byte;
begin
if i > 255 then
Result := 255
else if i < 0 then
Result := 0
else
Result := i;
end;
// ---------------------------------------------------------------
procedure brightnessTurboBoost(var clip: TBitmap; rangeExpansionPowerOf2: integer; shiftValue: Byte);
var
p0: PByte;
x,y: Integer;
begin
if (rangeExpansionPowerOf2 = 0) and (shiftValue = 0) then
exit; // These parameter settings will not change the pixel values.
for y := 0 to clip.Height-1 do
begin
p0 := clip.scanline[y];
// Can't just do the whole buffer as a big block of bytes since the
// individual scan lines may be padded for CPU alignment.
for x := 0 to (clip.Width - 1) * 3 do
begin
if rangeExpansionPowerOf2 >= 1 then
p0^ := IntToByte((p0^ shl rangeExpansionPowerOf2) + shiftValue)
else
p0^ := IntToByte(p0^ + shiftValue);
Inc(p0);
end;
end;
end;

There are a few things to say about this code snippet.
First of all, you are using Scanline property of TBitmap class. I have not been dealign with Delphi for many years, so I might be wrong about this but I am under impression that Scanline is not actually a thin accessor, is it? It might be internally hiding things which can dramatically affect performance, such as "if he wants to access the bits of the image, then we have to first convert it to DIB before returning pointers". So a thing looking so simple might appear to be a killer.
"if rangeExpansionPowerOf2 >= 1 then" in the inner loop body? You don't really want to compare this all the way. Either make two separate functions or duplicate the whole loop without in two version for zero and non-zero rangeExpansionPowerOf2 and do this if only once.
"for ... to (clip.Width - 1) * 3 do" I am not really sure that Delphi optimizes the upper boundary evaluation to make it only once. You might be doing those multiplication thrice for every pixel, while you could do it only once the whole image.
For top perofrmance IntToByte is definitely implemented in MMX to avoid ifs and process multiple bytes at once.
Still as you say that images are only 352x288, I would suspect that #1 is ruining the performance.

Related

How to set a cursor of non-standard size in Windows 10?

I am attempting to change the cursor in Windows 10 (version 1703) to a custom made one (conditional on some event when a script activates), that is larger than the default 32 by 32 size. The MWE based on my Autohotkey script is the following:
ImagePath = %A_ScriptDir%\foo.cur
Cursor_ID := 32512 ; Standard arrow
Cursor_Size := 128
^0::
SetSystemCursor( ImagePath, Cursor_ID, Cursor_Size, Cursor_Size )
return
SetSystemCursor( path, id, sizeX, sizeY )
{
Cursor := DllCall( "LoadImage",
UInt,0, Str,path, UInt,0x2, Int,sizeX, Int,sizeY, UInt,0x00000010, Ptr)
DllCall( "SetSystemCursor", Ptr,Cursor, Int,id )
}
(My code is based off of that found at https://autohotkey.com/board/topic/32608-changing-the-system-cursor/.)
As far as I can tell from the documentation of LoadImage, the function SetSystemCursor(...) should load the image with dimensions (sizeX, sizeY) when those parameters are not 0 (since the flag LR_DEFAULTSIZE = 0x00000040 is not set), but instead I get the following behaviour: no matter what sizes I set, the image gets scaled to (sizeX, sizeY), and then down/upscaled to (32, 32). This is most obvious by setting, say Cursor_Size := 2, then I get an upscaled version of a 2 by 2 image.
After some searching around I have found both information suggesting that this should work, and also to the effect that the size of cursors is always dictated by getSystemMetrics(SM_CXCURSOR)
and getSystemMetrics(SM_CYCURSOR): The biggest size of Windows Cursor (see also GetSystemMetrics).
Additional tests/ideas I've tried:
I checked the dimensions of the image corresponding to the handle returned
by LoadImage, and it seems to be (sizeX, sizeY), just as it should be,
therefore the scaling to 32 most likely happens upon executing SetSystemCursor.
I wanted to see if an application-specific cursor could bypass the
apparent 32 by 32 restriction, so using Resource Hacker, I replaced one of
the resources in Paint. It was scaled down to size 32 in the same way.
Setting the values that are returned by
getSystemMetrics(SM_CXCURSOR) and getSystemMetrics(SM_CYCURSOR)
might be an option if these indeed restrict cursor sizes, but I
could not find an appropriate function. I checked
SystemParametersInfo, but the only remotely relevant option
I found was SPI_SETCURSORS, and that just reloads the cursors from
registry.
It might be possible to change a registry value, though it would not
be my preferred solution, as it would most likely require a reboot
to take effect. Additionally, I haven't been able to find the relevant key.
My question would therefore be the following:
Is there a way to add an image of arbitrary size as a cursor in Windows 10, preferably without the need to reboot the computer? If so, how? Do SM_CXCURSOR and SM_CYCURSOR absolutely restrict the cursor's size? If they do, can these values be changed somehow?
EDIT:
It has been pointed out that yes, the documentation of GetSystemMetrics states "the system cannot create cursors of other sizes" than SM_CXCURSOR and SM_CYCURSOR, but at the same time at some of the other webpages I linked, people seem to claim to be able to create arbitrary sized cursors. Hence my request for confirmation/clarification of the matter.
Apart from that, the question about changing these values, or the existence of any other possible workaround would still be important to me.

std::copy runtime_error when working with uint16_t's

I'm looking for input as to why this breaks. See the addendum for contextual information, but I don't really think it is relevant.
I have an std::vector<uint16_t> depth_buffer that is initialized to have 640*480 elements. This means that the total space it takes up is 640*480*sizeof(uint16_t) = 614400.
The code that breaks:
void Kinect360::DepthCallback(void* _depth, uint32_t timestamp) {
lock_guard<mutex> depth_data_lock(depth_mutex);
uint16_t* depth = static_cast<uint16_t*>(_depth);
std::copy(depth, depth + depthBufferSize(), depth_buffer.begin());/// the error
new_depth_frame = true;
}
where depthBufferSize() will return 614400 (I've verified this multiple times).
My understanding of std::copy(first, amount, out) is that first specifies the memory address to start copying from, amount is how far in bytes to copy until, and out is the memory address to start copying to.
Of course, it can be done manually with something like
#pragma unroll
for(auto i = 0; i < 640*480; ++i) depth_buffer[i] = depth[i];
instead of the call to std::copy, but I'm really confused as to why std::copy fails here. Any thoughts???
Addendum: the context is that I am writing a derived class that inherits from FreenectDevice to work with a Kinect 360. Officially the error is a Bus Error, but I'm almost certain this is because libfreenect interprets an error in the DepthCallback as a Bus Error. Stepping through with lldb, it's a standard runtime_error being thrown from std::copy. If I manually enter depth + 614400 it will crash, though if I have depth + (640*480) it will chug along. At this stage I am not doing something meaningful with the depth data (rendering the raw depth appropriately with OpenGL is a separate issue xD), so it is hard to tell if everything got copied, or just a portion. That said, I'm almost positive it doesn't grab it all.
Contrasted with the corresponding VideoCallback and the call inside of copy(video, video + videoBufferSize(), video_buffer.begin()), I don't see why the above would crash. If my understanding of std::copy were wrong, this should crash too since videoBufferSize() is going to return 640*480*3*sizeof(uint8_t) = 640*480*3 = 921600. The *3 is from the fact that we have 3 uint8_t's per pixel, RGB (no A). The VideoCallback works swimmingly, as verified with OpenGL (and the fact that it's essentially identical to the samples provided with libfreenect...). FYI none of the samples I have found actually work with the raw depth data directly, all of them colorize the depth and use an std::vector<uint8_t> with RGB channels, which does not suit my needs for this project.
I'm happy to just ignore it and move on in some senses because I can get it to work, but I'm really quite perplexed as to why this breaks. Thanks for any thoughts!
The way std::copy works is that you provide start and end points of your input sequence and the location to begin copying to. The end point that you're providing is off the end of your sequence, because your depthBufferSize function is giving an offset in bytes, rather than the number of elements in your sequence.
If you remove the multiply by sizeof(uint16_t), it will work. At that point, you might also consider calling std::copy_n instead, which takes the number of elements to copy.
Edit: I just realised that I didn't answer the question directly.
Based on my understanding of std::copy, it shouldn't be throwing exceptions with the input you're giving it. The only thing in that code that could throw a runtime_error is the locking of the mutex.
Considering you have undefined behaviour as a result of running off of the end of your buffer, I'm tempted to say that has something to do with it.

OpenGL core profile incredible slowdown on OS X

I added a new GL renderer to my engine, which uses the core profile. While it runs fine on Windows and/or nvidia cards, it is like 10 times slower on OS X (3 fps instead of 30). The weird thing is, that my compatibility profile renderer runs fine.
I collected some traces with Instruments and the GL profiler:
https://www.dropbox.com/sh/311fg9wu0zrarzm/31CGvUcf2q
It shows that the application spends its time in glDrawRangeElements.
I tried the following things:
use glDrawElements instead (no effect)
flip culling (no effect on speed)
disable some GL_DYNAMIC_DRAW buffers (no effect)
bind index buffer after VAO when drawing (no effect)
converted indices to 4 byte (no effect)
use GL_BGRA textures (no effect)
What I didn't try is to align my vertices to 16 byte boundary and/or convert indices to 4 byte, but seriously, if that would be the issue then why the hell does the standard allow it?
I'm creating the context like this:
NSOpenGLPixelFormatAttribute attributes[] =
{
NSOpenGLPFAColorSize, 24,
NSOpenGLPFAAlphaSize, 8,
NSOpenGLPFADepthSize, 24,
NSOpenGLPFAStencilSize, 8,
NSOpenGLPFADoubleBuffer,
NSOpenGLPFAAccelerated,
NSOpenGLPFANoRecovery,
NSOpenGLPFAOpenGLProfile, NSOpenGLProfileVersion3_2Core,
0
};
NSOpenGLPixelFormat* format = [[NSOpenGLPixelFormat alloc] initWithAttributes:attributes];
NSOpenGLContext* context = [[NSOpenGLContext alloc] initWithFormat:format shareContext:nil];
[self.view setOpenGLContext:context];
[context makeCurrentContext];
Tried on the following specs:
radeon 6630M, OS X 10.7.5
radeon 6750M, OS X 10.7.5
geforce GT 330M, OS X 10.8.3
Do you have any ideas what I might do wrong? Again, it works fine with the compatibility profile (not using VAOs though).
UPDATE: reported to Apple.
UPDATE: Apple doesn't give a damn to the problem...anyway I created a small test program which is actually good. Now I compared the call stack with Instruments, and found out that when using the engine, glDrawRangeElements does two calls:
gleDrawArraysOrElements_ExecCore
gleDrawArraysOrElements_Entries_Body
while in the test program it calls only the second. Now the first call does something like an immediate mode render (gleFlushPrimitivesTCLFunc, gleRunVertexSubmitterImmediate), so obviously casues the slowdown.
Finally, I was able to reproduce the slowdown. This is just crazy... It is clearly caused by glBindAttribLocation being called on the "my_Position" attribute. Now I did some testing:
1 is default (as returned by glGetAttribLocation)
if I set it to zero, theres no problem
if I set it to 1, the rendering becomes slow
if I set it to any larger number, it is slow again
Obviously I relink the program (check code). It is not a problem in the implementation, I tested it with "normal" values too.
Test program:
https://www.dropbox.com/s/dgg48g1fwgyc5h0/SLOWDOWN_REPRO.zip
How to repro:
open with XCode
open common/glext.h (don't be disturbed by the name)
modify the GLDECLUSAGE_POSITION constant from 0 to 1
compile and run => slow
changing back to zero => good
I have managed to get myself the same problem in the following circumstance under
OS X Mavericks:
Instanced rendering using array buffers to give each instance its own modelToWorld and inverseNormal matrices; attribute locations are being specified through layout rather than using glGetAttribLocation
leaving one of these array buffers unused in the shader, where location is declared but the attribute isn't actually used for anything in the glsl code
In this case, a call to glDrawElementsInstanced takes up a LOT of CPU time (under normal circumstances, this call uses nearly zero CPU even when drawing several thousand instances).
You can tell that you're getting this specific problem if almost all of the CPU time used within glDrawElementsInstanced is spent in gleDrawArraysOrElements_ExecCore. Making sure that all of the array buffers are actually referenced in your shader code fixes the CPU time back to (nearly) zero.
I suspect that this is one of the situations where leaving a variable out of your main() in glsl confuses the compiler in to deleting all reference to that variable, leaving you with a dangling reference to an attribute or uniform.

How to determine programmatically the Windows' Performance settings with Delphi 2010

The following code is to fade my application on close.
procedure TfrmMain.btnClose1Click(Sender: TObject);
var
i : Integer;
begin
for i := 255 downto 0 do begin
frmMain.AlphaBlendValue := i;
application.ProcessMessages;
end;
Close;
end;
With Windows performance set to “Let Windows choose…”
When closing my Delphi app with the above code the fade is almost
instantaneous (maybe ¼ second at the most, if I blink I miss the
transition).
If I set the performance Option to ‘Adjust for best performance”
When exiting the same app the fade takes over 12 seconds.
Using the same code but commenting out the AlphaBlendValue change removes the delay.
I tested this out on both Delphi 2010 and DelphiXE2 and the results are the same.
This was tested on Windows 7 Ultimate 64bit if that makes any difference.
To say the least this behavior puzzles me.
I thought that the forms Alpha property was handled by the GPU and would therefore not be affected by Windows performance settings that should would be targeted at maximizing CPU performance.
So as far as this is concerned I'm not sure if this is a Windows 7 bug, a Delphi bug or just my lack of knowledge.
As far as a fix...
Is there a way to tell if Windows is running in crap graphics/max performance mode so that I can disable Alpha fade effects in my apps?
Edit for clarity:
While I would like to fix the fade what I am really looking for is a way to determine what the Windows performance setting is.
I am looking for how to determine a specific Windows setting - when you go into Windows Performance Options there are 3 tabs. On the first tab "Visual Effects" there are 3 canned options and a 4th option for 'Custom'. Minimally I am trying to determine if the option chosen is 'Adjust for best performance', if I could determine what the settings are on this tab even better.
Appreciate any help.
The fundamental problem with your code is that you are forcing 256 distinct updates irrespective of the performance characteristics of the machine. You don't have to use every single alpha blend value between 255 and 0. You can skip some values and still have a smooth fade.
You need to account for the actual graphics performance of the machine. Since you cannot predict that, you should account for real time in your fade code. Doing so will give you a consistent rate of fade irrespective of the performance characteristics of your machine.
So, here's a simple example to demonstrate tying the fade rate to real time:
procedure TfrmMain.btnClose1Click(Sender: TObject);
var
Stopwatch: TStopwatch;
NewAlphaBlendValue: Integer;
begin
Stopwatch := TStopwatch.StartNew;
while True do
begin
NewAlphaBlendValue := 255-(Stopwatch.ElapsedMilliseconds div 4);
if NewAlphaBlendValue>0 then
AlphaBlendValue := NewAlphaBlendValue
else
break;
end;
Close;
end;
The fade has a 1 second duration. You can readily adjust the mathematics to modify the duration to your requirements. This code will produce a smooth fade even on your low performing machine.
I would also comment that you should not use the global variable drmMain in a TfrmMain method. The TfrmMain method already has access to the instance. It is Self. And of course you can omit the Self. What's more the call to ProcessMessages is bad. That allows re-entrant handling of queued input messages. You don't want that to happen. So remove the call to ProcessMessages.
You actually ask about detecting the Adjust for best performance setting. But I think that's the wrong question. For a start you should fix your fade code so that the fade duration is independent of graphics performance.
Having done that you may still wish to disable the fade if the user has asked for lower quality appearance settings. I don't think you should look for one of the 3 canned options that you mention. They are quite possibly Windows version specific. Personally I would base the behaviour on the Animate windows when minimizing and maximizing setting. My rationale is that if the user does not want minimize and maximize to be animated, then presumably they don't want window close to be faded.
Here's how to read that setting:
function GetWindowAnimation: Boolean;
var
AnimationInfo: TAnimationInfo;
begin
AnimationInfo.cbSize := SizeOf(AnimationInfo);
if not SystemParametersInfo(SPI_GETANIMATION, AnimationInfo.cbSize,
#AnimationInfo, 0) then
RaiseLastOSError;
Result := AnimationInfo.iMinAnimate<>0;
end;
I think that most of the other settings that you may be concerned with can also be read using SystemParametersInfo. You should be able to work out how to do so by following the documentation.
Sorry for the tardy followup but it took me a while to figure out a working answer to my question and some of the issues behind it.
First, a thank you to David Heffernan for insight on a better way to handle the fade loop and information on the TStopWatch function from the Delphi's Diagnostics unit, much appreciated.
In regards to being able to determine the Windows' Performance settings...
When using the following un-optimized fade loop
procedure TfrmMain.btnFadeNCloseClick(Sender: TObject);
var
i : Integer;
begin
for i := 255 downto 0 do
frmMain.AlphaBlendValue := i;
Close;
end;
the actual Windows Performance Option settings causing the performance issue are "Enable desktop composition" and "Use visual styles on Windows and buttons". If both options are enabled there is no issue, if either setting is not enabled the loop crawls** (about 12 seconds on my system if the form is maximized).
Turns out that turning Aero Glass on or off affects these same 2 settings. So being able to detect if Aero Glass is on or not enables me to determine whether to not to enable the form effects, such as transition fades and other eye candy, in my apps. Plus now I can also capture that information in my bug reports.
**Note this appears to be an NVidia issue/bug, or at least an issue that is much more severe on systems with NVidia graphics cards. On 2
different NVidia systems (with recent, if not latest drivers) I
got similar results for a miximized form fade - less than
.001 seconds if Aero Glass is on, around 12 seconds if Aero Glass is
off. On a system with an Intel graphics card - less than .001 seconds
if Aero Glass is on, about 3.7 seconds if Aero Glass is off. Now
granted my test sampling is small, 3 NVidia systems (counting my
customer who initially reported the issue) and one non-NVidia system
but if I was using a decent NVidia graphic card I would not bother
turning Aero Glass off.
Below is the working code to detect if Aero Glass is enabled via Delphi:
This function has been tested on a Windows7 64 bit system and works with Delphi 2007, 2010 and Xe2 (32 & 64bit).
All of the various versions of the Delphi function below that I found on the net were broken - along with comments of people complaining about getting Access Violation errors.
What finally shed the light on fixing the bad code was Gerry Coll's response to: AccessViolationException in Delphi - impossible (check it, unbelievable...) which was about trying to fix AV errors in a function of the same type.
function ISAeroEnabled: Boolean;
type
_DwmIsCompositionEnabledFunc = function(var IsEnabled: Bool): HRESULT; stdcall;
var
Flag : BOOL;
DllHandle : THandle;
OsVersion : TOSVersionInfo;
DwmIsCompositionEnabledFunc: _DwmIsCompositionEnabledFunc;
begin
Result:=False;
ZeroMemory(#OsVersion, SizeOf(OsVersion));
OsVersion.dwOSVersionInfoSize := SizeOf(TOSVERSIONINFO);
if ((GetVersionEx(OsVersion)) and (OsVersion.dwPlatformId = VER_PLATFORM_WIN32_NT) and
(OsVersion.dwMajorVersion = 6) and (OsVersion.dwMinorVersion < 2)) then //Vista&Win7 only (no Win8)
begin
DllHandle := LoadLibrary('dwmapi.dll');
try
if DllHandle <> 0 then
begin
#DwmIsCompositionEnabledFunc := GetProcAddress(DllHandle, 'DwmIsCompositionEnabled');
if (#DwmIsCompositionEnabledFunc <> nil) then
begin
if DwmIsCompositionEnabledFunc(Flag)= S_OK then
Result:= Flag;
end;
end;
finally
FreeLibrary(DllHandle);
end;
end;
end;

Windows Client graphics written off the window to upper-left of screen

I have a Windows WinMain() window in which I write simple graphics -- merely LineTo() and FillRect(). The rectangles move around. After about an hour, the output that used o go to the main window, all of a sudden goes to the upper left corner of my screen -- as if client coordinates were being interpreted as screen coordinates. My GetDC()'s and ReleaseDC()'s seem to be balanced, and I even checked the return value from ReleaseDC(), make sure it is not 0 (per MSDN). Sometimes the output moves back to my main window. When I got to the debugger (VS 2010), my coordinates do not seem amiss--but output is going to the wrong place. I handle WM_PAINT, WM_CREATE, WM_TIMER, and a few others. I do not know how to debug this. Any help would be appreciated.
This has 'not checking return values' written all over it. Pretty crucial in raw Win32 programming, most every API function returns a boolean or a handle where FALSE or NULL indicates failure. GetLastError() provides the error code.
A cheap way to check for this without modifying code is by using the debugger to look at the EAX register value after the API call. A 0 indicates failure. In Visual Studio you can do so by using the #eax and #err pseudo variables in the Watch window, respectively the function return value and the GetLastError value.
This goes bad once Windows starts failing API calls, probably because of a resource leak. You can see it with TaskMgr.exe, Processes tab. View + Select Columns and tick Handles, USER objects and GDI objects. It is usually the latter, restoring the device context and releasing drawing objects is very easy to fumble. You don't have to wait until it fails, a steadily climbing number in one of those columns is the giveaway. It goes belly-up when the value hits 10,000
You must be calling GetDC(NULL) somewhere by mistake, which would get the DC for the entire desktop.
You could make all your GetDC calls call a wrapper function which asserts if the argument is NULL to help track this down:
#include <assert.h>
HDC GetDCAssert(HWND hWnd)
{
assert(hWnd);
return ::GetDC(hWnd);
}

Resources