How do you stop OS X from rebooting when developing OpenGL code - macos

I'm currently writing an OpenGL renderer on my 2011 13" MacBook Pro with a Sandybridge graphics chip.
I'm finding that I'm encountering a lot of kernel panics and reboots when developing OpenGL code. Frequently, whenever I have an error, my system just reboots, rather than gives me chance to catch the error and retrieve an error code.
I know that it is related to the graphics driver as the resultant problem reporting app displayed at reboot identifies it as the entity that crashed.
The specific issue seems closely related to texture creation. Clearly there is some bug in my code, but regardless, this really shouldn't be rebooting the OS under a high-level API like OpenGL.
Does OS X have any kind of debug mode functionality that I might enable, similar to that of D3D, so that I can catch the error earlier, rather than have to use russian roulette debugging?
(I'm aware of the OpenGL profiler, Driver Monitor and so on, yet have had little success with using these tools to catch these sorts of problems)

As you mention, OpenGL Profiler is the tool to use. You should check the box marked, "Break on VAR error" and "break on thread error," at least. If you have trouble with it, let me know and I might be able to help. (I'm no expert but have had some luck with it.)
Beyond that, the crashes you're seeing are probably related to you giving a pointer to OpenGL, and it attempting to read or write memory from that pointer, but the pointer is bad (or the length of the data is wrong). If it's texture related, then perhaps you're attempting to upload or download a texture and passing the wrong width and height, or have the wrong format. I've seen this happen when passing an incorrect number of elements to glDrawElements(). I was confused about whether an "element" was a vertex or an actual object (like a QUAD or TRIANGLE) when it happened to me. The VAR error reporting helped me find that issue.

Just to come back to this for anyone looking... it turns out, that the problem was entirely related to failing to set the current context as different threads begun issuing OpenGL commands.
So, each thread needed to lock a mutex, set the open gl context, and then begin its work. It would then release the context and then the lock, guaranteeing non-simultaneous access to the one OpenGL context.
So, no deeply unknown behaviour here really, just an inexperienced newbie not fully implementing the guidelines out there. :-)

Others have responded with potential workarounds. But note that your application should never be able to cause the machine to panic (which these days simply reboots the machine and presents a dialog to submit the report to Apple).
At a minimum, you should send the report to Apple. Additionally, you should file a bug report at http://bugreport.apple.com including the panic log, a system profiler report, and any details you can provide about how to reproduce (ideally, a sample app binary and/or source code). Filing your own bug report will help in many ways -- prioritizing your bug (dupes bump priority), providing reproduction steps in case the problem & fix aren't obvious from the backtrace in the panic report, and opening a channel between you and Apple in case they need more information from you to track it down.

Related

D3D9 Present returns D3DERR_DEVICELOST even in Windowed Mode(!)

Foreword
Since this appears to be a bug in the d3d9 Emulation on the Windows side, this would probably best addressed to Microsoft. If you know where I could get into contact with the DirectX Team, please tell me.
For the time being, I'm assuming that the only real chance we have is working around the bug.
What
We're investigating an inresponsiveness found in the Game Test Drive Unlimited 2.
Only when opening the Map and only when having an "RTX" Card (I think the most precise we got is GDDR6, because AMD also seems affected).
After long debugging, we found out, that it's not a simple fault of the game, but instead Present returning D3DERR_DEVICELOST even when having the game in Windowed Mode.
When the Game is in Fullscreen Mode, it properly does the required roundtrip over TestCooperativeLevel and Reset, but after the next frame is rendered, Present has lost the device again, causing the Hangup.
Now I'm looking for pointers on how to solve this issue. While it's probably some internal state corruption of some sorts, it's definitely triggered by an API Call only present when rendering the Map in that Game.
We will try to dig into d3d9.dll, but my suspicion is that the error code just comes from some Kernel/Driver Call, where our knowledge and tooling ends.
Ideally I'd like to fuzzy-find the drawcall by just hooking everything and omitting random apicalls, but I guess it's just not so easy and causes a lot more errors in most conditions.
Also note that an APITrace we did, showed D3D_OK for every single call including EndScene, up until Present, so it's not as simple as checking the return codes.
Trying to use Direct X 9 in Debug/Diagnostic Mode is also not really possible on Windows 10 anymore apparently, even when installing the SDK from June 2010
Thanks in Advance for any idea and maybe addresses to direct this problem to.

Kernel Panic when killing Node js - Help me figure it out

I am experiencing kernel panics when I kill node js under certain circumstances such as when it is stuck in an infinite loop (always) or when it is a stopped job under Bash (sometimes).
EDIT: My code isn't doing anything network related. I'm running a modified CoffeeScript repl.
I don't expect to be able to get a direct answer since it is a rather complicated problem and may be a bug in node, v8, or OS X for all I know at the moment.
However, I am at least somewhat familiar with all the technical aspects required to find it so I think with the right clues I could narrow it down, prevent it, and send a bug report to the appropriate people.
Feel free to have me investigate anything, up to and include using programs such as SIMBL and Application Enhancer if need be.
Here is the error report from the last kernel panic:
http://pastie.org/3043592
Thanks!
I can't tell for sure, but my suspicions would lie first with the following kernel extensions:
at.obdev.nke.LittleSnitch. Little Snitch messes with the network stack in some pretty major ways, so it seems likely that it might have something to do with your crashes (assuming that your node.js app is using sockets).
com.cisco.nke.ipsec. It also has to do with networking, so I'm also suspicious. Less so, though, because it (theoretically...) should just be adding a Cisco VPN interface.
org.pqrs.driver.NoEjectDelay, org.pqrs.driver.PCKeyboardHack, org.pqrs.driver.KeyRemap4MacBook. They're hacks. Need I say more?
com.shapeservices.msm.driver.MSMFramebuffer, com.shapeservices.msm.driver.MSMVideoDevice. iDisplay is unlikely to be related, but it might be!
If all else fails, submit a bug report at https://bugreport.apple.com.

Periodic GPU performance problem

I have a WinForms application that uses XNA to animate 3D models in a control. The app have been doing just fine for months but recently I've started to experience periodic pauses in the animation. Setting out to investigate what is going on I have established these facts:
It happens on my machine only, other machines works fine
Removing everything from my render loop does not improve the problem
In 2. I didn't actually remove everything, I limited my loop to set the viewport on my GraphicsDevice and then do a GraphicsDevice.Present.
Trying to dig further I fired up PIX to capture some statistics. Screenshots of two PIX runs can be viewed here (Run6) and here (Run14). Run6 is using my original render loop and Run14 is using the bare-bones Present loop.
PIX tells me that the GPU is periodically doing something, and I assume this is causing the pauses. What could be the cause of this? Or how do I go about finding out what the GPU is actually doing?
Update: since I usually trust my code to be perfect (who's laughing?) I started a new XNA project from scratch to see if it exhibit the same behavior. So starting a new XNA 3.1 Windows Game project and running PIX I get this timeline. The same periodic pauses. So the problem must be lower in the stack, in XNA or Direct3D.
So PIX shows that the GPU is working on something, I can see the list of DX calls made within each frame and the timing calculations shows that the pause occurs during (or after) the IDirect3DDevice9::Present call.
Update 2: I had previously installed and uninstalled XNA 4.0 CTP on the problematic machine. I cannot be certain that this is related but I thought that perhaps a reinstall of the XNA Game Studio 3.1 bits could make a difference. Turns out it did.
The underlying question remains the same (and the bounty is still up): what could affect XNA 3.1 (or DirectX) to make it behave like this and is there any logging/tracing power tool for the DirectX and/or GPU level out there that could shed some light on what is going on?
Note: I'm using XNA 3.1 on a Windows 7 x64 dual-core machine with 8GB RAM.
Note2: also posted this question on the XNA Creators forums here.
You could try to see if you can find something with Xperf that is close to your periodically problem, do not run your application but keep the programs open that would normally run besides your application. You could also try to do it again with the application running but it could give a cluttered view.
Start the tracing, do this in an elevated prompt.
xperf -on BASE+LATENCY -stackWalk Profile
Wait for a fair amount of time to be sure that the problem is traced.
Stop the tracing and open it like this.
xperf -d trace.etl
xperfview trace.etl
Analyze by looking at the graphs and consulting tables of specific intervals and see if you can find something that is related to the problem, the highest chance on finding it would be in the DPC and Interrupts section. But it might as well be something odd at the CPU or I/O section. Good luck!
Also more information on Xperf and how to obtain it, hopefully this delivers results.
If not, you can alternatively try GPUView which has been used for improvements in DWM,
this is also included next to Xperf with the Windows Performance Toolkit so you can easily try both!
log v
... wait for a fair amount of time to be sure that the problem is traced ...
log
gpuview merged.etl
In the case that gpuview gets out of memory you can try to add "/limit 3" or remove the v.
Read the documentation of the tools if you are stuck somewhere.
Hmm ... this seems to be occurring on the GPU, however it sounds like a CPU garbage collection issue. Can you run the CLR profiler and see if you can see any spikes in GC activity that you can correlate to the slowdowns?
I agree that it sounds unlikely since you can clearly see it in PIX, but it might offer a clue as to the cause.
If it's only happening on your own machine, then could it be drivers? Forgive me for being skeptical, but it's a 64 bit machine after all :D
This looks like either a vsync issue or GPU in its last throes. Since going back to a different version fixed it, and the "bottleneck" is in IDirect3DDevice9::Present lets go with the former option.
I'm not familiar with XNA so I don't know how much of the workings of D3D are exposed, but do you know what your PresentationParameters are set to?
Specifically try setting the swap effect set to Discard.

Debugging VBO Vertex buffers crashes

I'm using the VBO extension for storing Vertex, normal and color buffers (glBindBufferARB)
For some reason when changing buffers or doing some operation the application crashes with an access violation. When attaching The debugger I see that the crash is in some thread that is not my main thread which performs the opengl call with the execution in some dll which is related to the nvidia graphics driver.
What probably happened is that I gave some buffer call a bad buffer or with a wrong size. So my question is, how do I debug this situation? The crash seem to happen some time after the actual call and in a different thread.
Assuming this is about Windows, NVIDIA has a GLExpert tool. It can print various OpenGL warnings/errors.
In some other cases, using GLIntercept OpenGL call interceptor with error checking turned on can be useful.
If the tools do not help, well, then it's good old debugging. Try to narrow down the problem and locate what exactly causes a crash. If it's a NVIDIA specific problem, try installing different drivers and/or asking on NVIDIA developer forums.
I think you may just have to brute force that one.
I.e. comment out any vbo using lines a few at a time till your program doesn't crash anymore. Then you'll have an idea of which lines to focus your attention on and really scrutinize the parameters you're passing.
Also try sprinkling glError() calls liberally around your program. Often if you pass a bogus parameter glError will tell you something is wrong before it gets to the point of crashing.
One of the best OpenGl/D3D debugging tools is nVidia's NvPerfHUD. It won't help you find your exact problem, but it does provide another view of what you are sending into the rendering pipeline.
However, I will say that I've only used it with D3D applications so I don't know if it helps as much with OpenGL programs.
EDIT:
I'm not sure why this got voted down. I have debugged VB and IB problems with NvPerfHUD before. Simple things such as bad primitive counts and be diagnosed by looking at each individual draw call.

Arithmetic underflow or overflow exception during debugging

This is the day of weird behavior.
We have a Win32 project made with Delphi 2007, which hosts the .NET runtime and calls into .NET to show new forms, as part of a transition period.
Recently we've begun experiencing exceptions at seemingly random locations and points of our code: Arithmetic overflow or underflow.
The stack trace of one of these looks like this:
at System.Windows.Forms.UnsafeNativeMethods.DispatchMessageW(MSG& msg)
at System.Windows.Forms.Application.ComponentManager.System.Windows.Forms.UnsafeNativeMethods.IMsoComponentManager.FPushMessageLoop(Int32 dwComponentID, Int32 reason, Int32 pvLoopData)
at System.Windows.Forms.Application.ThreadContext.RunMessageLoopInner(Int32 reason, ApplicationContext context)
at System.Windows.Forms.Application.ThreadContext.RunMessageLoop(Int32 reason, ApplicationContext context)
at System.Windows.Forms.Application.RunDialog(Form form)
at System.Windows.Forms.Form.ShowDialog(IWin32Window owner)
at System.Windows.Forms.Form.ShowDialog()
at Gatsoft.Gat.UI.Windows.Forms.Remanaging.RemanageForm.DelphiOpenInNewMode(String employeeCode, String departmentCode, DateTime date) in C:\Dev\VS.NET\Gatsoft\Gatsoft.Gat.UI.Windows\Forms\Remanaging\RemanageForm.Delphi.cs:line 67
In the Visual Studio solution, one of the outmost class libraries (ie. pulls in all the references it can), has set a specific debug program, targetted for the Delphi project output. This allows us to debug .NET code from Visual Studio, even though the main bulk of the program is written in Delphi.
The problem only occurs when run from the debugger, not if we just run the exe file directly (either through explorer, shortcuts, or even Ctrl+F5 inside Visual Studio).
There's apparently no spyware on the machine (as hinted by this).
Any other things we can check?
Edit: It looks like the .NET debugger is enabling this SNaN flags, and the Delphi debugger does not. We'll have to investigate this further, but for now I'll accept #Lorenzo Boccaccia's answer.
Apparently Solved
Ok, it looks like we've finally nailed this problem. The problem started occuring without having the debugger attached as well, for our testers, so we had to prioritize the problem way up.
Finally we found one common issue with the machines that had the problem, they are Dell Lattitude D620 laptops with an NVIDIA Quadro NVS 110M, with an old driver from a system image used to provision the laptops, from back in 2006.
I found one post on the web, though I lost the url when I rebooted to update the display driver, that had a .NET service crashing, mostly when the machine was busy doing something on the screen. One way to reproduce his problem was to open a command prompt to C:\ and doing a DIR /S to just force a massive amount of screen updates, which would trigger the crash.
He too had a NVIDIA video card.
The problem on my machine occured roughly every 2-4 startups of our program, but after updating the video driver I've had 123 successfull startups without any problems. (BTW I can recommend AutoHotKey for such things).
So it looks like we've found the culprit, an old/buggy NVIDIA driver.
Updated this question so that perhaps someone in the future can save some time.
Now, if you'll excuse me, I'm going to go cry in a corner.
Jinxed!
I must've jinxed it. No sooner had I posted the above update than a colleague laptop failed, after updating the video driver.
Still, I'm positive it's a problem outside of our application now, so it just remains to figure out which specific things to update.
Further updates: Ok, my machine is now apparently fixed, not so with my colleagues machine. So far we've updated the BIOS, Chipset drivers, and currently SP3 for XP is on its way in.
A burn-in test will be done tonight, where the app will be left overnight starting up, as the problem cropped up either during startup, or at the first time some WinForms .NET code was executed. This app is mainly a Delphi Win32 app, but it hosts the .NET runtime, and the problem seems to be related to .NET code. When we "boot" the .NET runtime, the problem can appear, or when we fire the first .NET window from Win32 then it can also appear.
Statistically I'm ready to release this code now. Over the night the application has been started 3051 times without errors, whereas before I updated the video driver it crashed every 2-4 times.
Prodded and found(!/?)
This bug-fixing ordeal feels like going to the doctor, where the following conversation ensues:
Doc: Does this hurt?
Me: No...
Doc: What about now?
I've prodded and poked the application and finally I think I've found something we did that introduced this problem.
In our app we host the .NET runtime, from a Delphi 2007 Win32 application, and in our glue-code we have the following line (now):
rc := CorBindToRuntimeEx('v2.0.50727', 'wks',
STARTUP_LOADER_OPTIMIZATION_MULTI_DOMAIN or STARTUP_CONCURRENT_GC,
#clsid, #iid, UnkRuntimeEngine);
The two constants in the middle there was originally just a 0, meaning pick the defaults. This change was introduced a few months ago and the problem has been slowly creeping in on us after this. The change was introduced in order to encourage ANTS profiler to load our Win32 application + hosted .NET runtime in order to do performance profiling and the changes we introduced back then made that work. Additionally, the problem with arithmetic overflow/underflow has slowly been getting worse so I bet the problem didn't appear for a while after the change so it wasn't attributed to any of the changes we did.
Also, since we only (originally) saw the problem when running through the debugger, we thought something was wrong with Visual Studio and/or Delphi.
Anyway, statistically now, with a browser on one screen doing repeated scrolling up and down triggered by a javascript (apparently needed in order to trigger the bug), then I have been able to successfully start the application 726 times with a 0 in the call, and it crashes 5 out of 17 times with the two constants there.
Doc: Does this hurt?
And let's not get into who made that change in the first place. I'm sure the culprit wants to be left anonymous... cough
a debug version of a linked dll could be compiled with signaling nan support, see http://blogs.msdn.com/oldnewthing/archive/2008/07/02/8679191.aspx for an example of this problem.
that heisenbug was caused by uninitialized variables, here there could be a linked dll enabling the snan feature of the cpu and forgetting to disable it upon returning
Do the errors occur still occur if you attach the debugger after starting the application?

Resources