MongoDB on Windows slow *sometimes* using tailable cursors as message queue

MongoDB on Windows slow *sometimes* using tailable cursors as message queue - performance

We are completing a generic service-bus implementation on C, with clients for C#, Delphi, PL/SQL and PHP.
The library works great, we have awesome performance for our bus unless the MongoDB database is running on Windows (tested on 2008 R2, 2003 and 7) and there's no other "special" program running.
Out test do the following:
Program A sends a message on a capped collection
Program B tails on the message queue collection and "wakes up" when message appear using a cursor with awaitData param set to true
When Program B wakes up, prepares a messsage and sends a response to Program A inserting a document on a specific collection by Program A
Program A was already waiting on the second "response" collection and gets awaken when Program B (the producer) sent the response back
Loop ends there
Our testing program counts the loop and reports performance on a console application compiled with Visual Studio 2010.
We run this everything on one machine, or using a different machine for MongoDB and running consumer and producer on the same machine.
We run this on Windows 2008R2, Windows 2003 and Windows 7.
For 2008R2 we used the special mongo build for that OS, while for 2003 and 7 we used the "legacy" 64 bits build.
In a clean OS, with no programs running, our test performs about 32-50 roundtrips per second, which is a lousy performance compared to the "good" results we get when everything goes full speed.
Now, here comes the strange thing:
When starting certain application on the same machine where the mongo database runs, our tests speed up to about 450/sec (when running over loopback everything on the same machine) to about 300/sec when consumer and producer run on one machine, and mongodb in another machine going over the network.
The reason we never noticed this problem consistently before was because pretty much all the time we had in our development vms Visual Studio open, and Visual Studio is a program that acts as a "mongodb accelerator" (I know this sounds ridiculous, please don't bash me on this statement).
At first we noticed this issue "randomly" essentially when running our tests without VS open. So we tend to blame it on the underlying SAN where vmware runs, or the vm hosts, or cosmic rays or the NSA snooping on our program.
This was until we figured out finally the correlation between VS open at the same time while we were running tests, and narrowed down to the following:
MongoDB running on a Windows system (as console OR as service), virtual or physical versions 2008R2, 2003 or 7 will run slow a pattern of receiving data on a capped collection and waking up a tailing cursor then sending a response back to the consumer on another capped collection in the same way unless you simply start a program such as Visual Studio, Delphi XE4, Google Chrome browser, CrystalDiskMark disk I/O testing program (other program may speed up Mongo too). Then mongodb speeds up on full order of magnitude the pattern mentioned before.
We could not find exactly what these programs have in common that may cause the issue.
At this point we are stunned by the issue, I even reviewed the MongoDB code used for tailable cursor, but didn't find anything that smells as potentially causing a problem. The code pretty much spins for a max of 4 seconds waiting for data to appear, besides the suspicious "sleep" call on every loop, there was nothing else eye catching.
Is it possible that certain programs end up causing Sleep() Windows API call to behave differently? And that makes mongo do this operations on tailable cursor slower??
We think something is indeed "slowing down" because also the CPU utilization profiles goes down, like mongodb is literally "waiting" for something when it's running slow.
I know this pattern works fine on unix/linux based systems, I tried the same codebase on a Mac with no issues, so this horribly smells as a Windows issue.
Anyone else experienced a similar issue out there?

Found the source of the problem.
MongoDB calls straight Sleep() Windows API function, even for Sleep times lesser than 15ms.
Because of the default Windows minimum resolution, anything less than that (at least on Windows 2008 R2, Windows 2003 and Windows 7) will sleep for at least 15ms no matter what.
Simple solution in MongoDb is updating time_support.cpp from this:
void sleepmillis(long long s) {
fassert(16228, s <= 0xffffffff );
Sleep((DWORD) s);
}
to:
extern "C" unsigned int __stdcall timeBeginPeriod( unsigned int ms );
extern "C" unsigned int __stdcall timeEndPeriod( unsigned int ms );
// Notice bellow the arbitrary nature of 50ms set as the "minimum" timer resolution
// There seems to be no complete agreement on that *is* the default timer resolution in Windows.
// To be on the "safe side" let's use 50ms
#define BELLOW_WINDOWS_MIN_RESOLUTION(s)(s > 0 && s < 50)
void sleepmillis(long long s) {
fassert(16228, s <= 0xffffffff );
// When our waiting period falls bellow Windows min resolution, let's set resolution
// to 1 ms. Note that this change may effect all kernel scheduler thread operations.
// Apparently this changes the Windows kernel "quantum" length
// see http://msdn.microsoft.com/en-us/library/windows/desktop/dd757624(v=vs.85).aspx
// Applications such as Google Chrome seem to do this during the life of Chrome, that's why
// running apps which do this "accelerate" certain mongo operations that depending on proper
// Sleep() resolution on Windows
if(BELLOW_WINDOWS_MIN_RESOLUTION(s)) timeBeginPeriod(1);
Sleep((DWORD) s);
if(BELLOW_WINDOWS_MIN_RESOLUTION(s)) timeEndPeriod(1);
}

Related

Delphi Application Hangs

I have a working application that needs to clear and setup very large buffers from time to time. Sometimes the operation takes longer than 5 seconds to complete and then Desktop Window Manager times out and declares the application to be hung.
Is there any technique to cause DWM to allow more time before this happens?
Delphi Seattle, Windows 10, 64-bit application

The right solution is to put the long running task in a thread so that it does not block your UI thread. You should do that.
If you cannot bring yourself to take that task one, and it can be quite tricky, you can always disable ghosting by calling DisableProcessWindowsGhosting. But you really ought not to do that. You really ought to put the work in a thread.

Does Windows Server 2012 support the Beep() winapi function?

My office recently transitioned from running Windows 7 to using remote desktop sessions running on Windows Server 2012. Some VBA macros I was using include calls to the "Beep" function in the Windows API, which I used as an audio alert that execution had completed. This is the version of "Beep" that takes in a frequency parameter and a duration parameter, and plays a beep of that frequency in Hz for that duration in ms. On the new system, however, no sound is played, and there is no delay after calls to the function. Did Microsoft remove the Beep function in Server 2012?
EDIT: this is a different question than the linked discussion about Beep in Windows 7. I was previously using this code on Windows 7 without any problems. I have checked that my computer's speakers are turned on, that they are plugged in, and that the system volume is not muted. The behavior here is that Beep() is being treated as a null; nothing happens when it is executed. If the sound were simply muted, the code would at least delay until the indicated number of milliseconds had elapsed; instead, Beep() is skipped like it's not there at all in my code. Take, for example, the following code:
Sub beepTest()
k = Beep(523, 5000)
MsgBox("All Done!")
End Sub
On the old Windows 7 system, the computer would play a tone over the plugged in speakers for 5 seconds, then pop up with the message box. When running inside the Server 2012 RDP, however, the call to Beep() is essentially ignored, and the message box appears instantaneously, without a 5 second delay. This means that it can not be just an unplugged or muted sound device; if it were, there would be no sound audible, but there would still be a delay, similar to calling the Sleep() function. I wanted to know whether Microsoft had possibly removed support for this API in Server 2012, which would explain why my code isn't working as intended.

Windows 7: poor GUI response in my program while downloading data; is there some way to improve this?

I've written a program that (among other things) downloads multiple large files from a server on the LAN, using TCP. This program runs fine under Linux, MacOS/X, and generally under Windows as well (it uses Qt for the GUI and straight sockets calls for networking), but on certain Windows machines the download appears to be too much for the machine to handle, and I'm wondering if anyone has any ideas as to why that is and what can be done about it.
When downloading files, my program spawns a separate I/O thread that basically just sits in a loop, downloading data over TCP and writing it to a file, writing 128KB per call to QFile:write(). Each file is typically several hundred megabytes long, and a typical download session writes out several dozen of these files. Note that the I/O thread runs independently of the GUI thread, so I wouldn't expect it to affect GUI's performance much if at all -- especially not when running on a multicore PC.
The PC in question is a Core-2Duo Quad Q6600 running at 2.40GHz, with 4GB of RAM. It's running Windows 7 Ultimate SP1, 32-bit. It is receiving data over a Gigabit Ethernet connection and writing it to files on the NTFS-formatted boot partition of the 232GB internal Hitachi ATA drive.
The symptom is that sometimes during a download (seemingly at random) the program's GUI will become non-responsive for 10 to 30 seconds at a time, and often the title bar of the window will have "(not responding)" appended to it. The symptom will then clear up again and the download will proceed normally again. Another symptom is that the desktop is extremely sluggish during the download... for example, if I click on the "Start" button, the Start menu will take ~30 seconds to populate, instead of being populated near-instantaneously as I would expect.
Note that Task Manager shows plenty of free memory, but it does show short spikes of CPU usage to 100% one one of the 4 cores, at the same time the problems are seen.
The data is arriving over Gigabit Ethernet, and if I have my program just receive the data and throw it away (without writing it to the hard drive), the machine can maintain a constant download rate of about 96MB/sec without breaking a sweat. If I write the received data to a file, however, the download rate decreases to about 37MB/sec, and the symptoms described above start to appear.
The interesting thing is that just for curiosity's sake I added this call to my I/O thread's entry function, just before the beginning of its event loop:
SetThreadPriority(GetCurrentThread(), THREAD_PRIORITY_BELOW_NORMAL);
When I did that, the "(not responding)" symptoms cleared, but then download speed was reduced to only ~25MB/sec.
So my questions are:
Does anyone know what might be causing the sporadic hangups of the GUI when the hard drive is under a heavy write-load?
Why does lowering the I/O thread's priority cause the download rate to drop so much, given that there are three idle cores on the machine? I would think that even a lower-priority thread would have plenty of CPU available in this situation.
Is there any way to get a maximum download rate without causing Windows' desktop responsiveness and/or my app's GUI responsiveness to suffer problems?

Without seeing any code is hard to answer but this seems to be something related to processors and the fact that your download thread is not leaving any space for other threads to performs other operations.
It seems it never waits and that the driver of the network card is not well written.
Are you sure your thread is entering in an idle state when there is no data incoming?
In OS with a single processor a for (;;) {} will consume 100% cpu and if it talks continuously with the kernel it may stops other processes or other threads for doing that, especially if there is a bug or a very bad behaviour in some network card driver in your case.
Probably putting the thread priority below normal you are asking the OS to use your thread less often, this gives by a magical combination of things that allow things to not hang too much.
Check the code, maybe you are forgetting something?
Check if adding a sleep(0) to force the OS to yield to another thread sometime will make things better, but this is a temporary fix, you should find why your thread is consuming 100% cpu, if it is.

Program runs slow on just a couple of computers

I have a program that I run on multiple network PCs. When I compiled the most recent version, it runs extremely slowly on 2 PCs on the network, but runs fine for everyone else.
This used to happen with my old dev PC when I had an additional 2gb RAM installed. When I would remove the additional 2gb and recompile, it would then work fine for everyone.
Now, I am on a completely new machine and am having the same issue. I've tried to rebuild the project after rebooting, but still have the same issue.
For all other PCs, the program loads in about 3-5 seconds. On these 2 PCs, it takes anywhere from 45 seconds to 1.5 mins to load...
One of the PCs is an older Dell Dimension 8200, but the other is a newer OptiPlex that is identical to several other PCs on the network, so this is what is really making it so confusing.
For now, I've had to revert to the old version so it will run correctly for everyone.
Does anyone have any idea of anything to try?
Thanks in advance!!!
Edit:
Ok, it was an exhausting day yesterday trying various things to solve this issue. Here is what I tried and where the problem begins:
Using the new program
Went back to old versions of all updated components, but still had the same issue
Using the old program
I decided to go back to the drawing board and start from the old version of the application and incrementally add the new features a small piece at a time.
Recompiled the old version using the old components - program works fine
Updated to new DevExpress components - program works fine
Updated to new ESBPCS components - program works fine
Updated to new DeepSoftware components - program works fine
Ok, so now we know there is nothing with the component sets I've updated...
Added 1 image to each of 2 image lists - program works fine
Added new database table - program works fine
Added code to open and close the new table - program works fine
Added new action to action list and added a menu item and toolbar button to new action (action does nothing at this point) - program works fine
Added a new BLANK form to the application and added code to open the new form - BAM!!!
So, adding just one form to the application is what's causing the issue! I removed all the code for the opening of the form, commented out the uses clauses and removed the uses entry from the project source and everything is back to normal!
Anybody have any idea about this?
Thanks!
Edit 2:
For #Warren P - here is my .DPR source:
program Scheduler;
uses
ExceptionLog,
Forms,
SchedulerMainUnit in 'SchedulerMainUnit.pas' {FrmMain},
SchedulerDBInfoUnit in 'SchedulerDBInfoUnit.pas' {FrmDBInfo},
SchedulerHistoryUnit in 'SchedulerHistoryUnit.pas' {FrmHistory},
SchedulerOptionsUnit in 'SchedulerOptionsUnit.pas' {FrmOptions},
SchedulerExtVersionUnit in 'SchedulerExtVersionUnit.pas' {FrmExtVersion},
SchedulerSplashUnit in 'SchedulerSplashUnit.pas' {FrmSplash},
SchedulerInfoUnit in 'SchedulerInfoUnit.pas' {FrmInfo},
SchedulerShippedUnit in 'SchedulerShippedUnit.pas' {FrmShipped}; {<-- This is the new form with the issue}
{$R *.res}
begin
Application.Initialize;
Application.Title := 'SmartWool WIP Scheduling Assistant';
Application.CreateForm(TFrmMain, FrmMain);
Application.CreateForm(TFrmDBInfo, FrmDBInfo);
Application.CreateForm(TFrmHistory, FrmHistory);
Application.CreateForm(TFrmOptions, FrmOptions);
Application.CreateForm(TFrmExtVersion, FrmExtVersion);
Application.Run;
end.
And here is the intialization section of the main form to create the splash:
initialization
FrmSplash:=TFrmSplash.Create(Application);
FrmSplash.Show;
FrmSplash.Refresh;
Edit 3:
Anybody??? Please?

It could be that the program is waiting for timeouts when trying to access resources that are not available on that machine such as network drives or Internet hosts.
Try running Process Monitor when starting up your program and look for file open calls. Filter the output so it only shows your process.
http://technet.microsoft.com/en-us/sysinternals/bb896645

Performance problems initially can seem very daunting at first.
I have been on many teams where people have tried to guess at a reason for performance problems. This sometimes works, but is far less effective than actually measuring the code.
When reproducible on a development machine, I would recommend a profiler.
There was a previous question that asked about
Delphi Profiling tools which has several possible tools you could use.
When you can't reproduce the problem on your development machine, then it becomes a bit more difficult, but not impossible. Typically I have found that problems are related to an application dependency that is different, and not performing well. Understanding the external influences on your application can help pinpoint the problem.
Specifically common external problems in some of my applications.
Network
Database
Application Servers
Installation or Data File Location (i.e. Disk Performance)
Virus and Malware Scanners
Other application interring with yours such as a virus.
To monitor for items related to the network (i.e. Database, web services, etc...)
I typically use Wireshark which allows me to see if resources are responding in expected times. My most common problem is poor performing DNS and can found using Wireshark.
You can use the AutoRuns program to determine everything that starts up when your computer does, it's useful in determine differences between machines.
But most of all I have logging that can be turned on in my applications and this allows me to isolate the problem to a specific area of code. This narrowing down to a specific section of code reduces the guessing, and allows you to focus on a few possible problems.

I created a log function for this that I call at specific places (in your case especially during startup). It adds a timestamp to each log text and stores them in a TMemo that is regularly saved. Not only very helpful when debugging, but may also shed some light on your problem.

Are you using code signing - ie Microsoft Authenticode? If so, then outdated certificate authorities on the computers can cause significant delays to startup.

First, I would try to defragment the hard disk. If still slow, I would check the power supply. Maybe your hard disk are getting insufficient energy.

Check if there is the same antivirus software on those 2 problematic computers. If so, then your Delphi application may match byte pattern used in some virus made in Delphi. Update virus definitions to solve it, or report false alarm to antivirus company, or change antivirus software.
Check if there isn't any printer installed on those 2 problematic computers. If it is so, then add any printer and try again.

Idea 1:
One reason I have seen for very slow application load time, is when printing or reporting system components like Developer Express Express Print, are in your application.
The problem I saw when using Developer Express Printing components, is that I had an offline or non-responsive network printer in my list of printers (check the control panel printer icon) that was not responding. Some of those Developer Express components seem to read some information from each printer you have installed, and the solution was to go to those clients, and delete old printers from their control panel, that were no longer being used. Each not-responding network printer added up to 60 seconds for a TCP Timeout, to the startup time of my application.
Update - Idea 2:
Download MS DebugView and install it on the machine that runs slowly. Now go back to your main development PC, open the IDE, open your main project file (right click on the project, view project source in project viewer), this will show you the contents of your main project source file (.dpr). go to the main begin....end. block. Now set a breakpoint on the main begin statement, and single step INTO (not OVER) and you will see all the module initialization sections. In each one add this: OutputDebugString('ModuleName').
Now when you run this inside the Delphi Ide you will see messages, and see how far apart they come in, and understand what is taking a long time to initialize. Instead of installing the delphi ide onto the machine that runs slowly, Debug View (which is less than 400kb single executable) will be run, and it will show you these debug messages, along with a nice time display (##.# seconds) for each message.
MS Debug view is here.

Are you allowing the forms to be constructed on initialization within the DPR source? If so, you may do well to consider whether or not you want those forms sucking up memory the entire time, more-over if you want those forms to be wasting the application's time on load.
A rule of thumb: If the form is used a LOT during the application's execution, allow it to be constructed when the application loads (this will work out faster over-all than constructing the instance "on-demand").
If the form is not used very often at all (for example, a Dialog or an About Box), delete the "Application.CreateForm" line from the DPR source, and instead construct your instance on request...
var
LForm: TfrmAbout;
begin
Application.CreateForm(LForm, TfrmAbout);
try
LForm.ShowModal;
finally
LForm.Free;
end;
end;
Now that form (which may not even be displayed during the program's execution) is not sucking up system resources, and will not slow down the application's load time.
It may not solve your problem 100%, but it should certainly help!

Invoke Blue Screen of Death using Managed Code

Just curious here: is it possible to invoke a Windows Blue Screen of Death using .net managed code under Windows XP/Vista? And if it is possible, what could the example code be?
Just for the record, this is not for any malicious purpose, I am just wondering what kind of code it would take to actually kill the operating system as specified.

The keyboard thing is probably a good option, but if you need to do it by code, continue reading...
You don't really need anything to barf, per se, all you need to do is find the KeBugCheck(Ex) function and invoke that.
http://msdn.microsoft.com/en-us/library/ms801640.aspx
http://msdn.microsoft.com/en-us/library/ms801645.aspx
For manually initiated crashes, you want to used 0xE2 (MANUALLY_INITIATED_CRASH) or 0xDEADDEAD (MANUALLY_INITIATED_CRASH1) as the bug check code. They are reserved explicitly for that use.
However, finding the function may prove to be a bit tricky. The Windows DDK may help (check Ntddk.h) - I don't have it available at the moment, and I can't seem to find decisive info right now - I think it's in ntoskrnl.exe or ntkrnlpa.exe, but I'm not sure, and don't currently have the tools to verify it.
You might find it easier to just write a simple C++ app or something that calls the function, and then just running that.
Mind you, I'm assuming that Windows doesn't block you from accessing the function from user-space (.NET might have some special provisions). I have not tested it myself.

I do not know if it really works and I am sure you need Admin rights, but you could set the CrashOnCtrlScroll Registry Key and then use a SendKeys to send CTRL+Scroll Lock+Scroll Lock.
But I believe that this HAS to come from the Keyboard Driver, so I guess a simple SendKeys is not good enough and you would either need to somehow hook into the Keyboard Driver (sounds really messy) or check of that CrashDump has an API that can be called with P/Invoke.
http://support.microsoft.com/kb/244139
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\i8042prt\Parameters
Name: CrashOnCtrlScroll
Data Type: REG_DWORD
Value: 1
Restart

I would have to say no. You'd have to p/invoke and interact with a driver or other code that lives in kernel space. .NET code lives far removed from this area, although there has been some talk about managed drivers in future versions of Windows. Just wait a few more years and you can crash away just like our unmanaged friends.

As far as I know a real BSOD requires failure in kernel mode code. Vista still has BSOD's but they're less frequent because the new driver model has less drivers in kernel mode. Any user-mode failures will just result in your application being killed.
You can't run managed code in kernel mode. So if you want to BSOD you need to use PInvoke. But even this is quite difficult. You need to do some really fancy PInvokes to get something in kernel mode to barf.
But among the thousands of SO users there is probably someone who has done this :-)

You could use OSR Online's tool that triggers a kernel crash. I've never tried it myself but I imagine you could just run it via the standard .net Process class:
http://www.osronline.com/article.cfm?article=153

I once managed to generate a BSOD on Windows XP using System.Net.Sockets in .NET 1.1 irresponsibly. I could repeat it fairly regularly, but unfortunately that was a couple of years ago and I don't remember exactly how I triggered it, or have the source code around anymore.

Try live videoinput using directshow in directx8 or directx9, most of the calls go to kernel mode video drivers. I succeded in lots of blue screens when running a callback procedure from live videocaptureing source, particulary if your callback takes a long time, can halt the entire Kernel driver.

It's possible for managed code to cause a bugcheck when it has access to faulty kernel drivers. However, it would be the kernel driver that directly causes the BSOD (for example, uffe's DirectShow BSODs, Terence Lewis's socket BSODs, or BSODs seen when using BitTorrent with certain network adapters).
Direct user-mode access to privileged low-level resources may cause a bugcheck (for example, scribbling on Device\PhysicalMemory, if it doesn't corrupt your hard disk first; Vista doesn't allow user-mode access to physical memory).
If you just want a dump file, Mendelt's suggestion of using WinDbg is a much better idea than exploiting a bug in a kernel driver. Unfortunately, the .dump command is not supported for local kernel debugging, so you would need a second PC connected over serial or 1394, or a VM connected over a virtual serial port. LiveKd may be a single-PC option, if you don't need the state of the memory dump to be completely self-consistent.

This one doesn't need any kernel-mode drivers, just a SeDebugPrivilege. You can set your process critical by NtSetInformationProcess, or RtlSetProcessIsCritical and just kill your process. You will see same bugcheck code as you kill csrss.exe, because you set same "critical" flag on your process.

Unfortunately, I know how to do this as a .NET service on our server was causing a blue screen. (Note: Windows Server 2008 R2, not XP/Vista).
I could hardly believe a .NET program was the culprit, but it was. Furthermore, I've just replicated the BSOD in a virtual machine.
The offending code, causes a 0x00000f4:
string name = string.Empty; // This is the cause of the problem, should check for IsNullOrWhiteSpace
foreach (Process process in Process.GetProcesses().Where(p => p.ProcessName.StartsWith(name, StringComparison.OrdinalIgnoreCase)))
{
Check.Logging.Write("FindAndKillProcess THIS SHOULD BLUE SCREEN " + process.ProcessName);
process.Kill();
r = true;
}
If anyone's wondering why I'd want to replicate the blue screen, it's nothing malicious. I've modified our logging class to take an argument telling it to write direct to disk as the actions prior to the BSOD weren't appearing in the log despite .Flush() being called. I replicated the server crash to test the logging change. The VM duly crashed but the logging worked.
EDIT: Killing csrss.exe appears to be what causes the blue screen. As per comments, this is likely happening in kernel code.

I found that if you run taskkill /F /IM svchost.exe as an Administrator, it tries to kill just about every service host at once.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio