performance tuning where cpu not pinned and plenty of memory - performance

I'm benchmarking a windows server - web application that for argument sake has a single method called parseText().
Running a single instance take less than 10ms, however when I ramp it up to 10 simultaneous requests, things slow down drastically. Say 1 second per request.
The CPU is not pinned and there's plenty of memory available. So I'm confused as to what the bottleneck is.
One thought was that the memory latency or bus bandwidth could be an issue, but I'm not sure which perfmon counters would best indicate something like this.
Can someone suggest some counters to check that may shed some light on the matter?

My first guess would be either disk IO or mutexes.
For disk, Try adding physical disk, read bytes/sec and write bytes/sec and also read/sec write/sec (ie both total bytes and actual io operation counts for read and write) Make sure they aren't spiking. Could also add queue length if you are keen. You are looking for big shifts like 10Mb/sec or lots of small IOs.
For mutexs, which might be a side effect of memory allocation (very frequent memory allocation can cause this), try adding "system" and context switches/sec and maybe system calls/sec. These bounce a bit from general load, so get a feel first and then see what happens.
If you think it is caused by memory bandwidth (ie exhausting the FSB) then I don't think perfmon can measure that, you would need to switch to something more like vtune, which may or may not be an option for you. An example of exhausting main memory bandwidth would be a program that allocates large amounts of memory and then initialises each byte to some value, and does this LOTS. If you think this is your issue, you might need to isolate a routine using code profilers and ot her such tools, but this is hard if you are outside the program and just observing.

Related

How could I make a Go program use more memory? Is that recommended?

I'm looking for option something similar to -Xmx in Java, that is to assign maximum runtime memory that my Go application can utilise. Was checking the runtime , but not entirely if that is the way to go.
I tried setting something like this with func SetMaxStack(), (likely very stupid)
debug.SetMaxStack(5000000000) // bytes
model.ExcelCreator()
The reason why I am looking to do this is because currently there is ample amount of RAM available but the application won't consume more than 4-6% , I might be wrong here but it could be forcing GC to happen much faster than needed leading to performance issue.
What I'm doing
Getting large dataset from RDBMS system , processing it to write out in excel.
Another reason why I am looking for such an option is to limit the maximum usage of RAM on the server where it will be ultimately deployed.
Any hints on this would greatly appreciated.
The current stable Go (1.10) has only a single knob which may be used to trade memory for lower CPU usage by the garbage collection the Go runtime performs.
This knob is called GOGC, and its description reads
The GOGC variable sets the initial garbage collection target percentage. A collection is triggered when the ratio of freshly allocated data to live data remaining after the previous collection reaches this percentage. The default is GOGC=100. Setting GOGC=off disables the garbage collector entirely. The runtime/debug package's SetGCPercent function allows changing this percentage at run time. See https://golang.org/pkg/runtime/debug/#SetGCPercent.
So basically setting it to 200 would supposedly double the amount of memory the Go runtime of your running process may use.
Having said that I'd note that the Go runtime actually tries to adjust the behaviour of its garbage collector to the workload of your running program and the CPU processing power at hand.
I mean, that normally there's nothing wrong with your program not consuming lots of RAM—if the collector happens to sweep the garbage fast enough without hampering the performance in a significant way, I see no reason to worry about: the Go's GC is
one of the points of the most intense fine-tuning in the runtime,
and works very good in fact.
Hence you may try to take another route:
Profile memory allocations of your program.
Analyze the profile and try to figure out where the hot spots
are, and whether (and how) they can be optimized.
You might start here
and continue with the gazillion other
intros to this stuff.
Optimize. Typically this amounts to making certain buffers
reusable across different calls to the same function(s)
consuming them, preallocating slices instead of growing them
gradually, using sync.Pool where deemed useful etc.
Such measures may actually increase the memory
truly used (that is, by live objects—as opposed to
garbage) but it may lower the pressure on the GC.

Xcode Memory Utilized

So in xcode, the Debug Navigator shows CPU Usage and Memory usage. When you click on Memory it says 'Memory Utilized'.
In my app I am using the latest Restkit (0.20.x) and every time I make a GET request using getObjectsAtPath (which doesn't even return a very large payload), the memory utilized increases about 2mb. So if I refresh my app 100 times, the Memory Utilized will have grown over 200mb.
However, when I run the Leaks tool, the Live Bytes remain fairly small and do not increase with each new request. Live bytes remains below 10mb the whole time.
So do I have a memory issue or not? Memory Utilized grows like crazy, but Live Bytes suggests everything is okay.
You can use Heapshot Analysis to evaluate the situation. If that shows no growth, then the memory consumption may be virtual memory which may (for example) reside in a cache/store which may support eviction and recreation -- so you should also identify growth in Virtual Memory regions.
If you keep making requests (e.g. try 200 refreshes), the memory will likely decrease at some point - or you will have memory warnings and ultimately allocation requests may fail. Determine how memory is reduced, if this is the case. Otherwise, you will need to determine where it is created and possibly referenced.
Also, test on a device in this case. The simulator is able to utilise much more memory than a device simply because it has more to work with. Memory constraints are not simulated.

How data and memory affects execution speed?

If my application is allotted 500MB of RAM, but operates on 2GB of data, how will this work? What would be the impact on execution speed?
Don't know - depends on what you're doing and how your application operates on data.
You can't fit 10 pounds of anything into a 5 pound bag. You'll have to page or operate on a stream or something.
If you exhaust the memory available to you, you're likely to see an OutOfMemoryError.
Not enough information to give a real answer.
Every time it needed to access data that is outside the current 500 megabyte window, it would need to swap out to disk. That swapping process takes some time, depending on disk speed, and would potentially slow down your application.

Memory management for intentionally memory intensive applications

Note: I am aware of the question Memory management in memory intensive application, however that question appears to be about applications that make frequent memory allocations, whereas my question is about applications intentionally designed to consume as much physical memory as is safe.
I have a server application that uses large amounts of memory in order to perform caching and other optimisations (think SQL Server). The application runs on a dedicated machine, and so can (and should) consume as much memory as it wants / is able to in order to speed up and increase throughput and response times without worry of impacting other applications on the system.
The trouble is that if memory usage is underestimated, or if load increases its possible to end up with nasty failures as memory allocations fail - in this situation obviously the best thing to do is to free up memory in order to prevent the failure at the expense of performance.
Some assumptions:
The application is running on a dedicated machine
The memory requirements of the application exceed the physical memory on the machine (that is, if additional memory was available to the application it would always be able to use that memory to in some way improve response times or throughput)
The memory is effectively managed in a way such that memory fragmentation is not an issue.
The application knows what memory can be safely freed, and what memory should be freed first for the least performance impact.
The app runs on a Windows machine
My question is - how should I handle memory allocations in such an application? In particular:
How can I predict whether or not a memory allocation will fail?
Should I leave a certain amount of memory free in order to ensure that core OS operations remain responsive (and don't in that way adversely impact the applications performance), and how can I find out how much memory that is?
The core objective is to prevent failures as a result of using too much memory, while at the same time using up as much memory as possible.
I'm a C# developer, however my hope is that the basic concepts for any such app are the same regardless of the language.
In linux, the memory usage percentage is divided into following levels.
0 - 30% - no swapping
30 - 60% - swap dirty pages only
60 - 90% - swap clean pages also based on LRU policy.
90% - Invoke OOM(Out of memory) killer and kill the process consuming maximum memory.
check this - http://linux-mm.org/OOM_Killer
In think windows might have similar policy, so you can check the memory stats and make sure you never get to the max threshold.
One way to stop consuming more memory is to go to sleep and give more time for memory cleanup threads to run.
That is a very good question, and bound to be subjective as well, because the very nature of the fundamental of C# is that all memory management is done by the runtime, i.e. Garbage Collector. The Garbage Collector is a non-deterministic entity that manages and sweeps the memory for reclaiming, depending on how often the memory gets fragmented, the GC will kick in hence to know in advance is not easy thing to do.
To properly manage the memory sounds tedious but common sense applies, such as the using clause to ensure an object gets disposed. You could put in a single handler to trap the OutOfMemory Exception but that is an awkward way, since if the program has run out of memory, does the program just seize up, and bomb out, or should it wait patiently for the GC to kick in, again determining that is tricky.
The load of the system can adversely affect the GC's job, almost to a point of a Denial of Service, where everything just grind to a halt, again, since the specifications of the machine, or what is the nature of that machine's job is unknown, I cannot answer it fully, but I'll assume it has loads of RAM..
In essence, while an excellent question, I think you should not worry about it and leave it to the .NET CLR to handle the memory allocation/fragmentation as it seems to do a pretty good job.
Hope this helps,
Best regards,
Tom.
Your question reminds me of an old discussion "So what's wrong with 1975 programming ?". The architect of varnish-cache argues, that instead of telling the OS to get out of the way and manage all memory yourself, you should rather cooperate with the OS and let it understand what you intend to do with memory.
For example, instead of simply reading data from disk, you should use memory-mapped files. This allows the OS to apply its LRU algorithm to write-back data to disk when memory becomes scarce. At the same time, as long as there is enough memory, your data will stay in memory. Thus, your application may potentially use all memory, without risking getting killed.

What is the performance hit of Performance Counters

When considering using performance counters as my companies' .NET based site, I was wondering how big the overhead is of using them.
Do I want to have my site continuously update it's counters or am I better off to only do when I measure?
The overhead of setting up the performance counters is generally not high enough to worry about (setting up a shared memory region and some .NET objects, along with CLR overhead because the CLR actually does the management for you). Here I'm referring to classes like PerformanceCounter.
The overhead of registering the perfromance counters can be decently slow, but generally is not a concern because it is intended to happen once at setup time because you want to change machine-wide state. It will be dwarfed by any copying that you do. It's not generally something you want to do at runtime. Here I'm referring to PerformanceCounterInstaller.
The overhead of updating a performance counter generally comes down to the cost of performing an Interlocked operation on the shared memory. This is slower than normal memory access but is a processor primitive (that's how it gets atomic operations across the entire memory subsystem including caches). Generally this cost is not high to worry about. It could be 10 times a normal memory operation, potentially worse depending on the update and what contention is like across threads and CPUs. But consider this, it's literally impossible to do any better than interlocked operations for cross-process communication with atomic updates, and no locks are held. Here I refer to PerformanceCounter.Increment and similar methods.
The overhead of reading a performance counter is generally a read from shared memory. As others have said, you want to sample on a reasonable period (just like any other sampling) but just think of PerfMon and try to keep the sampling on a human scale (think seconds instead of milliseconds) and you proably won't have any problems.
Finally, an appeal to experience: Performance counters are so lightweight that they are used everywhere in Windows, from the kernel to drivers to user applications. Microsoft relies on them internally.
Advice: The real question with performance counters is the learning curve in understanding (which is moderate) and one measuring the right things (seems easy but often you get it wrong).
The performance impact is negligible in updating. Microsoft's intent is that you always write to the performance counters. It's the monitoring of (or capturing of) those performance counters that will cause a degradation of performance. So, only when you use something like perfmon to capture the data.
In effect, the performance counter objects will have the effect of only "doing it when you measure."
I've tested these a LOT.
On an old compaq 1Ghz 1 processor machine, I was able to create about 10,000 counters and monitor them remotely for about 20% CPU usage. These aren't custom counters, just checking CPU or whatever.
Basically, you can monitor all the counters on any decent newer machine with very little impact.
The instantiation of the object can take a long time tho, a few seconds to a few minutes. I suggest you multithread this for all the counters you collect otherwise your app will sit there forever creating these objects. Not sure what MS does once you create it that takes so long, but you can do it for 1000 counters with 1000 threads in the same time you can do it for 1 counter and 1 thread.
A performance counter is just a pointer to 4/8 bytes in shared memory (aka memory mapped file), so their cost is very similar to that of accessing an int/long variabile.
I agree with famoushamsandwich, but would add that as long as your sampling rate is reasonable (5 seconds or more) and you monitor a reasonable set of counters, then the impact of measuring is negligible as well (in most cases).
The thing that I have found is that it is not that slow for the majority of applications. I wouldn't put one in a tight loop, or something that is called thousands of times a second.
Secondly, I found that programmatically creating the performance counters is very slow, so make sure that you create them before hand and not in code.

Resources