Give all possible resources to a program - windows

I created a program in C# to work with 2.5 million records in Oracle Express (local instance), parse/split those records and create an additional 5 million records.
I added some code to print times on the screen and it seems fairly fast. It is doing all the processing for 1K records every 9 seconds. Which means it takes more than 6 hours to finish.
Now, with Task Manager I can see the program is using 6% of CPU (max) and around 50MB of memory. I understand the OS, and Oracle itself need resources to operate but..... is there a way to tell this little program "hey, it's ok, go ahead and use at least 50% of CPU, there are 4GB of RAM so knock yourself out"?
Note: One of the reasons I'm using a local instance with Oracle Express is to reduce the network bottleneck. Also I might not run this process quite often but I was intrigued to see if this was at all possible.
Please forgive my noobness,
Thanks!

The operating system will give your program all the resources it needs, the reason your process is not consuming all the CPU is probably because it's waiting for the IO sub system more than the processor.
If you want to see if you can consume more CPU cycles try writing a program that runs a short infinite loop as fast as possible and you will see the difference in CPU usage.

A number of thoughts, not really answers I guess, but.
You could up the priority of the applications thread, however, its possible that the code maybe less efficient than you think, so..
Have you run a profiler on it?
If its currently a single threaded app, you could look to see if you could parse it in batches and therefore run them in parallel.
Without knowing a lot of detail of the splitting of records, is it possible to off hand that more to oracle to do? eg, would matter less about network etc or local or otherwise.
If you're apps drawing/updating a screen or UI then it will almost certainly slow the progress of the work down. An example. I ran an app which sorted about 10k emails into around 250k lines into a database, if I added an item to a listbox each line the time went from short to rediculous eg, crash out got bored. So, again, offloading to a thread to do the work with as few UI updates to do as possible can help.

Related

Elimination of run time variation over repeated executions of the same program

I am trying to design an Online Programming Contest Judge, and one of the things that I need to ensure is that when the same code is compiled (assuming the requirement),
given the same input, it should take exactly the same amount of time for the program to execute, each time this is done.
Currently, I am using a simple python script that
has 2 threads, one of which invokes a blocking system call that starts the execution of the test code, and the other keeps track of time and sends a kill signal to the
child process after the time limit expires. Incidentally, I am doing this inside a virtual machine for reason of security, and convenience (setting up a proper chroot is
way too complicated, and more risky).
However, given identical conditions (ie, when I restore a snapshot), I still get a variation in the time taken for execution in range of approximately 50ms on either side. As this prevents setting strict time limits, is there anyway to eliminate this variation?
I'm not an expert in that field, but I don't think you can do it. Even if you restore the snapshot inside the VM, the state of the "Outside" Machine is going to be pretty different. You have two OSs running, each one which multiple process which are probably going to compete for the resources at some point. If it's a website or a PC with an internet connection, you can get hit by different amounts of connections (or request), and that will make process start running and consume requests etc... If some application tries to access the hard disk, the initial position of the physical disk matters a lot for seek time, etc...
If you want a "deterministic" limit, you might wanna check if you can count how many instructions were executed by a certain process, or something like that.
Anyways, I've participated in several programming contents, and as far as I know, they don't care about the 50 ms differences... If you do a proper algorithm, you can get inside the time with a really big margin. So I'd advise you to live with it, and just include that in the rules.

Drastic performance inprovement in .NET CF after app gets moved out of the foreground. Why?

I have a large (500K lines) .NET CF (C#) program, running on CE6/.NET CF 3.5 (v.3.5.10181.0). This is running on a FreeScale i.Mx31 (ARM) # 400MHz. It has 128MB RAM, with ~80MB available to applications. My app is the only significant one running (this is a dedicated, embedded system). Managed memory in use (as reported by GC.Collect) is about 18MB.
To give a better idea of the app size, here's some stats culled from .NET CF Remote Performance Monitor after staring up the application:
GC:
Garbage Collections 131
Bytes Collected by GC 97,919,260
Managed Bytes in use after GC 17,774,992
Total Bytes in use after GC 24,117,424
GC Compactions 41
JIT:
Native Bytes Jitted: 10,274,820
Loader:
Classes Loaded 7,393
Methods Loaded 27,691
Recently, I have been trying to track down a performance problem. I found that my benchmark after running the app in two different startup configurations would run in approximately 2 seconds (slow case) vs. 1 second (fast case). In the slow case, the time for the benchmark could change randomly from EXE run to EXE run from 1.1 to 2 seconds, but for any given EXE run, would not change for the life of the application. In other words, you could re-run the benchmark and the time for the test stays the same until you restart the EXE, at which point a new time is established and consistent.
I could not explain the 1.1 to 2x slowdown via any conventional mechanism, or by narrowing the slowdown to any particular part of the benchmark code. It appeared that the overall process was just running slower, almost like a thread was spinning and taking away some of "my" CPU.
Then, I randomly discovered that just by switching away from my app (the GUI loses the foreground) to another app, my performance issue disappears. It stays gone even after returning my app to the foreground. I now have a tentative workaround where my app after startup launches an auxiliary app with a 1x1 size window that kills itself after 5ms. Thus the aux app takes the foreground, then relinquishes it.
The question is, why does this speed up my application?
I know that code gets pitched when a .NET CF app loses the foreground. I also notice that when performing a "GC Heap" capture with .NET CF Remote Performance Monitor, a Code Pitch is logged -- and this also triggers the performance improvement in my app. So I suspect somehow that code pitching is related or even responsible for fixing performance. But I'm at a loss as to figure out how to determine if that is really the case, or even to explain why pitching code could help in this way. Does pitching out lots of code somehow significantly help locality of reference of code pages (that are re-JITted, presumably near each other in memory) enough to help this much? (My benchmark spans probably 3 dozen routines and hundreds of lines of code.)
Most importantly, what can I do in my app to reliably avoid this slower condition. Any pointers to relevant .NET CF / JIT / Code pitching information would be greatly appreciated.
Your app going to the background auto-triggers a GC.Collect, which collects, may compact the GC Heap and may pitch code. Have you checked to see if a manual GC.Collect without going to the background gives the same behavior? It might not be pitching that's giving the perf gain, it might be collection or compaction. If a significant number of dead roots are swept up, walking the root tree may be getting faster. Can't say I've specifically seen this issue, so this is all conjecture.
Send your app a wm_hibernate before your performance loop. Will clean up things
We have a similar issue with our .NET CF application.
Over time, our application progressively slows down, eventually to a halt with what I anticipate is due to high CPU load, or as #wil-s says, as if thread is spinning consuming CPU. The only assumption / conclusion I've made to so far is either we have a rogue thread in our code, or there's an under the cover issue in .NET CF, maybe with the JITter.
Closing the application and re-launching returns our application to normal expected performance.
I am yet to implement code change to test issuing WM_HIBERNATE or launch a dummy app which quits itself (as above) to force a code pitch, but am fairly sure this will resolve our issue based on the above comments. (so many thanks for that)
However, I'm subsequently interested to know whether a root cause was ever found to this specific issue?
Incidentally and seemingly off topic (but bear with me), we're using a Freescale i.MX28 processor and are experiencing unpredictable FlashDisk corruption. Seeing 2K blocks of 0xFFs (erased blocks) in random files located on NAND Flash.
I'm mentioning this as I now believe the CPU and FlashDisk corruption issues are linked, due to this article as well as this one:
https://electronics.stackexchange.com/questions/26720/flash-memory-corruption-due-to-electricals
In the article, #jwygralak67 comments:
I recently worked through a flash corruption issue, on a WinCE system,
as part of a development team. We would sporadically find 2K blocks of
flash that were erased. (All bytes 0xFF) For about 6 months we tested
everything from ESD, to dirty power to EMI and RFI interference, we
bought brand new devices and tracked usage to make sure we weren't
exceeding the erase cycle limit and buring out blocks, we went through
our (application level) software with a fine toothed comb.
In the end it turned out to be an obscure bug in the very low level
flash driver code, which only occurred under periods of heavy CPU
load. The driver came from a 3rd party. We informed them of the issue
we found, but I don't know if they ever released a patch.
Unfortunately, we're yet to make contact with him.
With all of this in mind, potentially if we work around the high CPU load, maybe the corruption will no longer be present. Another conjecture case!
On that assumption however, this doesn't give a firm root cause for either situation, which I'm desperately seeking!
Any knowledge or insight, however small, would be very gratefully received.
#ctacke - we've spoken before regarding OpenNETCF via email, so I'm pleased to see your name!

How to gain control of a 5GB heap in Haskell?

Currently I'm experimenting with a little Haskell web-server written in Snap that loads and makes available to the client a lot of data. And I have a very, very hard time gaining control over the server process. At random moments the process uses a lot of CPU for seconds to minutes and becomes irresponsive to client requests. Sometimes memory usage spikes (and sometimes drops) hundreds of megabytes within seconds.
Hopefully someone has more experience with long running Haskell processes that use lots of memory and can give me some pointers to make the thing more stable. I've been debugging the thing for days now and I'm starting to get a bit desperate here.
A little overview of my setup:
On server startup I read about 5 gigabytes of data into a big (nested) Data.Map-alike structure in memory. The nested map is value strict and all values inside the map are of datatypes with all their field made strict as well. I've put a lot of time in ensuring no unevaluated thunks are left. The import (depending on my system load) takes around 5-30 minutes. The strange thing is the fluctuation in consecutive runs is way bigger than I would expect, but that's a different problem.
The big data structure lives inside a 'TVar' that is shared by all client threads spawned by the Snap server. Clients can request arbitrary parts of the data using a small query language. The amount of data request usually is small (upto 300kb or so) and only touches a small part of the data structure. All read-only request are done using a 'readTVarIO', so they don't require any STM transactions.
The server is started with the following flags: +RTS -N -I0 -qg -qb. This starts the server in multi-threaded mode, disable idle-time and parallel GC. This seems to speed up the process a lot.
The server mostly runs without any problem. However, every now and then a client request times out and the CPU spikes to 100% (or even over 100%) and keeps doing this for a long while. Meanwhile the server does not respond to request anymore.
There are few reasons I can think of that might cause the CPU usage:
The request just takes a lot of time because there is a lot of work to be done. This is somewhat unlikely because sometimes it happens for requests that have proven to be very fast in previous runs (with fast I mean 20-80ms or so).
There are still some unevaluated thunks that need to be computed before the data can be processed and sent to the client. This is also unlikely, with the same reason as the previous point.
Somehow garbage collection kicks in and start scanning my entire 5GB heap. I can imagine this can take up a lot of time.
The problem is that I have no clue how to figure out what is going on exactly and what to do about this. Because the import process takes such a long time profiling results don't show me anything useful. There seems to be no way to conditionally turn on and off the profiler from within code.
I personally suspect the GC is the problem here. I'm using GHC7 which seems to have a lot of options to tweak how GC works.
What GC settings do you recommend when using large heaps with generally very stable data?
Large memory usage and occasional CPU spikes is almost certainly the GC kicking in. You can see if this is indeed the case by using RTS options like -B, which causes GHC to beep whenever there is a major collection, -t which will tell you statistics after the fact (in particular, see if the GC times are really long) or -Dg, which turns on debugging info for GC calls (though you need to compile with -debug).
There are several things you can do to alleviate this problem:
On the initial import of the data, GHC is wasting a lot of time growing the heap. You can tell it to grab all of the memory you need at once by specifying a large -H.
A large heap with stable data will get promoted to an old generation. If you increase the number of generations with -G, you may be able to get the stable data to be in the oldest, very rarely GC'd generation, whereas you have the more traditional young and old heaps above it.
Depending the on the memory usage of the rest of the application, you can use -F to tweak how much GHC will let the old generation grow before collecting it again. You may be able to tweak this parameter to make this un-garbage collected.
If there are no writes, and you have a well-defined interface, it may be worthwhile making this memory un-managed by GHC (use the C FFI) so that there is no chance of a super-GC ever.
These are all speculation, so please test with your particular application.
I had a very similar issue with a 1.5GB heap of nested Maps. With the idle GC on by default I would get 3-4 secs of freeze on every GC, and with the idle GC off (+RTS -I0), I would get 17 secs of freeze after a few hundred queries, causing a client time-out.
My "solution" was first to increase the client time-out and asking that people tolerate that while 98% of queries were about 500ms, about 2% of the queries would be dead slow. However, wanting a better solution, I ended up running two load-balanced servers and taking them offline from the cluster for performGC every 200 queries, then back in action.
Adding insult to injury, this was a rewrite of an original Python program, which never had such problems. In fairness, we did get about 40% performance increase, dead-easy parallelization and a more stable codebase. But this pesky GC problem...

How to force workflow runtime to use more CPU power?

Hello
I've quite unordinary problem because I think that in my case workflow runtime doesn't use enough CPU power. Scenario is as follow:
I send a lot of messages to queues. I use EnqueueItem method from WorkflowRuntime class.
I create new instance of workflow with CreateWorkflow method of WorkflowRuntime class.
I wait until new workflow will be moved to the first state. Under normal conditions it takes dozens of second (the workflow is complicated). When at the same time messages are being sent to queues (as described in the point 1) it takes 1 minute or more.
I observe low CPU (8 cores) utilization, no more than 15%. I can add that I have separate process that is responsible for workflow logic and I communicate with it with WCF.
You've got logging, which you think is not a problem, but you don't know. There are many database operations. Those need to block for I/O. Having more cores will only help if different threads can run unimpeded.
I hate to sound like a stuck record, always trotting out the same answer, but you are guessing at what the problem is, and you're asking other people to guess too. People are very willing to guess, but guesses don't work. You need to find out what's happening.
To find out what's happening, the method I use is, get it running under a debugger. (Simplify the problem by going down to one core.) Then pause the whole thing, look at each active thread, and find out what it's waiting for. If it's waiting for some CPU-bound function to complete for some reason, fine - make a note of it. If it's waiting for some logging to complete, make a note. If it's waiting for a DB query to complete, note it. If it's waiting at a mutex for some other thread, note it.
Do this for each thread, and do it several times. Then, you can really say you know what it's doing. When you know what it's waiting for and why, you'll have a pretty good idea how to improve it. That's a variation on this technique.
What are you doing in the work item?
If you have any sort of cross thread synchronisation (Critical sections etc) then this could cause you to spend time stalling the threads waiting for resources to become free.
For example, If you are doing any sort of file access then you are going to spend considerable time blocked waiting for the loads to complete and this will leave your threads idle a lot of the time. You could throw more threads at the problem but then you'd end up generating more disk requests and the resource contention would become even more of a problem.
Thats a couple of potential ideas but I'd really need to know what you are doing before I can be more useful ...
Edit: in answer to your comments...
1) OK
2) You'd perform terribly with 2000 threads working flat out due to switching overhead. In fact running 20-25 threads on an 8 core machine may be a bad plan too because if you get them running at high speed then they will spend time stealing each other's runtime and regular context switches (software thread switches) are very expensive. They may not be as expensive as the waits your code is suffering.
3) Logging? Do you just submit them to an asynchronous queue that spits them out to disk when it has the opportunity or are they sychronous file writes? If they are aysnchronous can you guarantee that there isn't a maximum number of request that can be queued before you DO have to wait? And if you have to wait how many threads end up iin contention for the space that just opened up? There are a lot of ifs there alone.
4) Database operation even on the best database are likely to block if 2 threads make similar calls into the database simultaneously. A good database is designed to limit this but its quite likely that, at least some, clashing will happen.
Suffice to say you will want to get a good thread profiler to see where time is REALLY being lost. Failing that you will just have to live with the performance or attack the problem in a different way ...
WF3 performance is a little on the slow side. If you are using .NET 4 you will get a better performance moving to WF4. Mind you is means a rewrite as WF4 is a completely different product.
As to WF3. There is white paper here that should give you plenty of information to improve things from the standard settings. Look for things like increasing the number of threads used by the DefaultWorkflowSchedulerService or switching to the ManualWorkflowSchedulerService and disabling performance counters which are enabled by default.

Preventing a heavy process from sinking in the swap file

Our service tends to fall asleep during the nights on our client's server, and then have a hard time waking up. What seems to happen is that the process heap, which is sometimes several hundreds of MB, is moved to the swap file. This happens at night, when our service is not used, and others are scheduled to run (DB backups, AV scans etc). When this happens, after a few hours of inactivity the first call to the service takes up to a few minutes (consequent calls take seconds).
I'm quite certain it's an issue of virtual memory management, and I really hate the idea of forcing the OS to keep our service in the physical memory. I know doing that will hurt other processes on the server, and decrease the overall server throughput. Having that said, our clients just want our app to be responsive. They don't care if nightly jobs take longer.
I vaguely remember there's a way to force Windows to keep pages on the physical memory, but I really hate that idea. I'm leaning more towards some internal or external watchdog that will initiate higher-level functionalities (there is already some internal scheduler that does very little, and makes no difference). If there were a 3rd party tool that provided that kind of service is would have been just as good.
I'd love to hear any comments, recommendations and common solutions to this kind of problem. The service is written in VC2005 and runs on Windows servers.
As you mentioned, forcing the app to stay in memory isn't the best way to share resources on the machine. A quick solution that you might find that works well is to simply schedule an event that wakes your service up at a specific time each morning before your clients start to use it. You can just schedule it in the windows task scheduler with a simple script or EXE call.
I'm not saying you want to do this, or that it is best practice, but you may find it works well enough for you. It seems to match what you've asked for.
Summary: Touch every page in the process, on page at a time, on a regular basis.
What about a thread that runs in the background and wakes up once every N seconds. Each time the page wakes up, it attempts to read from address X. The attempt is protected with an exception handler in case you read a bad address. Then increment X by the size of a page.
There are 65536 pages in 4GB, 49152 pages in 3GB, 32768 pages in 2GB. Divide your idle time (overnight dead time) by how often you want (attempt) to hit each page.
BYTE *ptr;
ptr = NULL;
while(TRUE)
{
__try
{
BYTE b;
b = *ptr;
}
__except(EXCEPTION_EXECUTE_HANDLER)
{
// ignore, some pages won't be accessible
}
ptr += sizeofVMPage;
Sleep(N * 1000);
}
You can get the sizeOfVMPage value from the dwPageSize value in the returned result from GetSystemInfo().
Don't try to avoid the exception handler by using if (!IsBadReadPtr(ptr)) because other threads in the app may be modifying memory protections at the same time. If you get unstuck because of this it will almost impossible to identify why (it will most likely be a non-repeatable race condition), so don't waste time with it.
Of course, you'd want to turn this thread off during the day and only run it during your dead-time.
A third approach could be to have your service run a thread that does something trivial like incrementing a counter and then sleeps for a fairly long period, say 10 seconds. Thios should have minimal effect on other applications but keep at least some of your pages available.
The other thing to ensure is that your data is localized.
In other words: do you really need all 300 MiB of the memory before you can do anything? Can the data structures you use be rearranged so that any particular request could be satisfied with only a few megabytes?
For example
if your 300 MiB of heap memory contains facial recognition data. Can the data internally be arranged so that male and female face data are stored together? Or big-noes are separate from small-noses?
if it has some sort of logical structure to it can be it sorted? so that a binary search can be used to skip over a lot of pages?
if it's a propritary, in-memory, database engine, can the data be better indexed/clustered to not require so many memory page hits?
if they're image textures, can commonly used textures be located near each other?
Do you really need all 300 MiB of the memory before you can do anything? You cannot service request without all that data back in memory?
Otherwise: scheduled task at 6 ᴀᴍ to wake it up.
In terms of cost, the cheapest and easiest solution is probably just to buy more RAM for that server, and then you can disable the page file entirely. If you're running 32-bit Windows, just buy 4GB of RAM. Then the entire address space will be backed with physical memory, and the page file won't be doing anything anyway.

Resources