How to diagnose performance issues inside mongodb by using mongostats - performance

I've been using mongostats to diagnose overall activity inside my mongodb instance. How can I use it to also diagnose performance issues / degradation?
One field I'm really interested in learning more about is locked % and expected behavior based on the results from all the other fields.
I feels this feature is kinda vague and needs to be flushed out a bit more.

The locked % is the % of time the global write lock (remember, mongo has a process wide write lock) is taken per sample. This percentage will increase when you increase the number of writes (inserts, updates, removes, db.eval(), etc.). A high value means the database is spending a lot of time being locked waiting for writes to finish and as a result as no queries can complete until the lock is released. As such the overall query throughput will be reduced (sometimes dramatically).
"faults" means that mongo is trying to hit data that is mapped to the virtual memory space but not in physical memory. Basically it means it's hitting disk rather than memory and is an indication you do not have enough RAM (or, for example, your index isn't right balanced). If memory serves this is available on Linux only though. This should be as close to 0 as you can get it.
"qr:qw" are the read and write query queues and if these are not zero it means the server is receiving more queries than it is able to process. This isn't necessarily a problem unless this number is consistently high or growing. Indicates overal system performance is not high enough to support your query throughput.
Most other fields are pretty self explanatory or not that useful. netIn/Out is useful if you expect to be io bound. This will happen if these values approach your maximum network bandwith.

Related

ESENT internals: expected behavior of JetPrereadKeys()

I have an application working with a significant amount of data (100 GB+) stored in ESENT. The schema of the table is: 12-byte JET_bitColumnFixed keys and JET_coltypLongBinary values with a typical size of around 2 KiB. The page size is set to 32 KiB. I don't alter the default 1024 byte size threshold for external long values, so I think that these values are mostly being stored externally.
I am interested in improving the cold cache seek and retrieve performance, because the operations happen in batches and the keys are known in advance. As far as I understand, the JetPrereadKeys() API is designed to improve performance in such cases, but as it turns out, I don't see any changes in the actual behavior with or without this call.
More details follow:
In my case JetPrereadKeys() always reports an adequate number of pre-read keys, equal to the number of keys I have submitted when calling the API. The submitted keys are appropriately sorted, as stated in the documentation.
I tried both synchronous and asynchronous approaches, where the asynchronous approach is: send the pre-read call to a thread pool, while continuing to seek and retrieve data on the current thread.
I tried both available caching modes of ESENT, either where it uses MMAP or a dedicated page cache, by trying all available combinations of the JET_paramEnableViewCache and JET_paramEnableFileCache parameters.
I cannot, with a small exception, see any difference in the logged I/O operations with and without the pre-read. That is, I would expect this operation to result in a (preferably, asynchronous) fetch of the necessary internal nodes of a B-Tree. But the only thing I see is an occasional synchronous small read coming up from the stack of the JetPrereadKeys() itself. The size of the read is small, in the sense that I don't think that it could possibly prefetch all the required information.
If I debug the Windows Search service, I can break on various calls to JetPrereadKeys(). So there is at least one real-world example where this API is being called, presumably for a reason.
All my experiments were performed after a machine restart, to ensure that the database page cache is empty.
Questions:
What is the expected behavior of the JetPrereadKeys() in the described case?
Should I expect to see a different I/O pattern and better performance if I use this API? Should I expect to see a synchronous or an asynchronous pre-read of the data?
Is there an another approach that I could try to improve the I/O performance by somehow hinting ESENT about an upcoming batch?
The JetPrereadKeys() API does sync reads to the parent of leaf level, and then enqueues async IOs for all the leaf pages needed for the desired keys / records ...I think that answers #2. If your main table records (note the burst Long Values / LVs are stored in a separate tree) is shallow or entirely cached, this JetPrereadKeys() may not help. However if your primary tree on the table is large and deep, then this API can help dramatically... it just depends upon the shape and spread of your data you are retrieving. You can tell some basics about your table by dumping space and looking at the depth of the trees and getting sense of the "Data" pages, might I suggest:
esentutl /ms Your.Db /v /fName,Depth,Internal,Data
Lists name of table, the depth, how many internal pages and how many leaf level data pages. Separate lines will be listed for the main record tree by the tablename, and then the LVs / as "[Long Values]" below it.
Also note this preread keys does not extend to the burst LVs as well ... so there again, if you immediately read a burst LV column - you'll pin behind IO, unfortunately.
The default mode is for ESE to allocate and control it's own database buffer / page cache exclusively. The JET_paramEnableFileCache is primarily meant for (usually smaller) client processes that quit (or at least JetTerm/JetDetach their DB) and restart a lot ... so where ESE's private buffer cache will be lost on every quit ... but the JET_paramEnableFileCache is a param so the data may still be in the file cache if they quit recently. It is not recommended for large DBs though, because this causes data to be double cached in the ESE buffer cache and in the NTFS / ReFS file cache. The JET_paramEnableViewCache enhances the previous param, and ameliorates this double caching somewhat ... but it can only save memory / not double buffer on clean / un-modified page buffers. For big DBs, leave both these params off / false. Also if you do not use these params, then it is easier to test cold perf ... just copy a big (100 MB, maybe 1 or 2 GB) file around a couple times on your HD after your app quits (to clear HD cache), and your data will be cold. ;-)
So now that we have mentioned caching ... one final thing - that I think is probably your actual problem (if it isn't the "shape of your data" I mention above) ... open perfmon and find the "Database" and/or "Database ==> Instances" perf objects (these are for ESENT) and see how big your cache size is [either "Database Cache Size" or "Database Cache Size (MB)"] and see how big your available pool is / ["Database Cache % Available"]... you'll of course have to take this % and do math against your database cache size to get an idea ... BUT if this is low, this could be your problem ... this is because JetPrereadKeys will only use already available buffers, so you have to have a healthy / big enough available pool. Either increase JET_paramCacheSizeMin to be larger, or set JET_paramStartFlushThreshold / JET_paramStopFlushThreshold to keep your available cache to be a larger % of the total cache size... note they are set as proportion to JET_paramCacheSizeMax, so like setting:
paramCacheSizeMin = 500
paramCacheSizeMax = 100000
paramStartFlush.. = 1000
paramStopFlushT.. = 2000
would mean your start and stop thresholds are 1% and 2% respectively of your current cache size whatever it happens to be. So if cache is at 500 buffers (min), 5 and 10 would be your start/stop thresholds - i.e. the range your available pool would be in, if later it grew to 10000 buffers, then your available pool would range between 100 and 200 buffers. Anyways, you want these numbers to be a good enough range that you have plenty of buffers for all the leaf pages JetPrereadKeys may want.
I didn't explain every term in this email, because you looked pretty advanced above - talking B-tree internal nodes and such ... but if something isn't clear, just ask and I'll clear it up.
Thanks,
Brett Shirley [MSFT]
Extensible Storage Engine Developer
This posting is provided "AS IS" with no warranties, and confers no rights.
P.S. - One last thing you may enjoy playing around with: JetGetThreadStats / JET_THREADSTATS, it tells you some of our internal operations that we do under the API. You basically read the values before and after and JET API, and subtract them to get the # of operations for that JET API. So you will see cPagePreread in there ... this will be a good way to see if JetPrereadKeys is dispatching off the Async IOs that should help perf. Note that particular counter was unfortunately broken in an older OS, but I don't remember when it was broken and fixed ... win7 to win8, win8 to win8.1. If you are on Win10, then no problem it was definitely fixed by then. ;-) And also cPageRead is sync read pages (which may go up for the internal nodes)... I think you'll find these very instructive for the costs of various JET APIs.

How could I make a Go program use more memory? Is that recommended?

I'm looking for option something similar to -Xmx in Java, that is to assign maximum runtime memory that my Go application can utilise. Was checking the runtime , but not entirely if that is the way to go.
I tried setting something like this with func SetMaxStack(), (likely very stupid)
debug.SetMaxStack(5000000000) // bytes
model.ExcelCreator()
The reason why I am looking to do this is because currently there is ample amount of RAM available but the application won't consume more than 4-6% , I might be wrong here but it could be forcing GC to happen much faster than needed leading to performance issue.
What I'm doing
Getting large dataset from RDBMS system , processing it to write out in excel.
Another reason why I am looking for such an option is to limit the maximum usage of RAM on the server where it will be ultimately deployed.
Any hints on this would greatly appreciated.
The current stable Go (1.10) has only a single knob which may be used to trade memory for lower CPU usage by the garbage collection the Go runtime performs.
This knob is called GOGC, and its description reads
The GOGC variable sets the initial garbage collection target percentage. A collection is triggered when the ratio of freshly allocated data to live data remaining after the previous collection reaches this percentage. The default is GOGC=100. Setting GOGC=off disables the garbage collector entirely. The runtime/debug package's SetGCPercent function allows changing this percentage at run time. See https://golang.org/pkg/runtime/debug/#SetGCPercent.
So basically setting it to 200 would supposedly double the amount of memory the Go runtime of your running process may use.
Having said that I'd note that the Go runtime actually tries to adjust the behaviour of its garbage collector to the workload of your running program and the CPU processing power at hand.
I mean, that normally there's nothing wrong with your program not consuming lots of RAM—if the collector happens to sweep the garbage fast enough without hampering the performance in a significant way, I see no reason to worry about: the Go's GC is
one of the points of the most intense fine-tuning in the runtime,
and works very good in fact.
Hence you may try to take another route:
Profile memory allocations of your program.
Analyze the profile and try to figure out where the hot spots
are, and whether (and how) they can be optimized.
You might start here
and continue with the gazillion other
intros to this stuff.
Optimize. Typically this amounts to making certain buffers
reusable across different calls to the same function(s)
consuming them, preallocating slices instead of growing them
gradually, using sync.Pool where deemed useful etc.
Such measures may actually increase the memory
truly used (that is, by live objects—as opposed to
garbage) but it may lower the pressure on the GC.

Performance of CPU

While going through Computer organisation by Patterson,I encountered a question where I am completely stuck. Question is:
Suppose we know that an application that uses both a desktop client and a remote server is limited by network performance. For the following changes state whether only the throughput improves, both response time and throughput improve, or neither improves.
And the changes made are:
More memory is added to the computer
If we add more memory ,shouldn't the throughput and execution time will improve?
To be clear ,the definition of throughput and response time is explained in the book as:
Throughput: The amount of work done in a given time.
Response Time: time required to complete a task ,tasks are i/o device activities, Operating System overhead, disk access, memory access.
Assume the desktop client is your internet browser. And the server is the internet, for example the stackoverflow website. If you're having network performance problems, adding more RAM to your computer won't make browsing the internet faster.
More memory helps only when the application needs more memory. For any other limitation, the additional memory will simply remain unused.
You have to think like a text book here. If your only given constraint is network performance, then you have to assume that there are no other constraints.
Therefore, the question boils down to: how does increasing the memory affect network performance?
If you throw in other constraints such as the system is low on memory and actively paging, then maybe response time improves with more memory and less paging. But the only constraint given is network performance.
It wont make a difference as you are already bound by the network performance. Imagine you have a large tank of water and tiny pipe coming out it. Suppose you want to get more water within given amount of time (throughput). Does it make sense to add more water to the tank to achieve that? Its not, as we are bound by the width of the pipe. Either you add more pipes or you widen the pipe you have.
Going back to your question, if the whole system is bound by network performance you need to add more bandwidth, to see any improvement. Doing anything else is pointless.

In what applications caching does not give any advantage?

Our professor asked us to think of an embedded system design where caches cannot be used to their full advantage. I have been trying to find such a design but could not find one yet. If you know such a design, can you give a few tips?
Caches exploit the fact data (and code) exhibit locality.
So an embedded system wich does not exhibit locality, will not benefit from a cache.
Example:
An embedded system has 1MB of memory and 1kB of cache.
If this embedded system is accessing memory with short jumps it will stay long in the same 1kB area of memory, which could be successfully cached.
If this embedded system is jumping in different distant places inside this 1MB and does that frequently, then there is no locality and cache will be used badly.
Also note that depending on architecture you can have different caches for data and code, or a single one.
More specific example:
If your embedded system spends most of its time accessing the same data and (e.g.) running in a tight loop that will fit in cache, then you're using cache to a full advantage.
If your system is something like a database that will be fetching random data from any memory range, then cache can not be used to it's full advantage. (Because the application is not exhibiting locality of data/code.)
Another, but weird example
Sometimes if you are building safety-critical or mission-critical system, you will want your system to be highly predictable. Caches makes your code execution being very unpredictable, because you can't predict if a certain memory is cached or not, thus you don't know how long it will take to access this memory. Thus if you disable cache it allows you to judge you program's performance more precisely and calculate worst-case execution time. That is why it is common to disable cache in such systems.
I do not know what you background is but I suggest to read about what the "volatile" keyword does in the c language.
Think about how a cache works. For example if you want to defeat a cache, depending on the cache, you might try having your often accessed data at 0x10000000, 0x20000000, 0x30000000, 0x40000000, etc. It takes very little data at each location to cause cache thrashing and a significant performance loss.
Another one is that caches generally pull in a "cache line" A single instruction fetch may cause 8 or 16 or more bytes or words to be read. Any situation where on average you use a small percentage of the cache line before it is evicted to bring in another cache line, will make your performance with the cache on go down.
In general you have to first understand your cache, then come up with ways to defeat the performance gain, then think about any real world situations that would cause that. Not all caches are created equal so there is no one good or bad habit or attack that will work for all caches. Same goes for the same cache with different memories behind it or a different processor or memory interface or memory cycles in front of it. You also need to think of the system as a whole.
EDIT:
Perhaps I answered the wrong question. not...full advantage. that is a much simpler question. In what situations does the embedded application have to touch memory beyond the cache (after the initial fill)? Going to main memory wipes out the word full in "full advantage". IMO.
Caching does not offer an advantage, and is actually a hindrance, in controlling memory-mapped peripherals. Things like coprocessors, motor controllers, and UARTs often appear as just another memory location in the processor's address space. Instead of simply storing a value, those locations can cause something to happen in the real world when written to or read from.
Cache causes problems for these devices because when software writes to them, the peripheral doesn't immediately see the write. If the cache line never gets flushed, the peripheral may never actually receive a command even after the CPU has sent hundreds of them. If writing 0xf0 to 0x5432 was supposed to cause the #3 spark plug to fire, or the right aileron to tilt down 2 degrees, then the cache will delay or stop that signal and cause the system to fail.
Similarly, the cache can prevent the CPU from getting fresh data from sensors. The CPU reads repeatedly from the address, and cache keeps sending back the value that was there the first time. On the other side of the cache, the sensor waits patiently for a query that will never come, while the software on the CPU frantically adjusts controls that do nothing to correct gauge readings that never change.
In addition to almost complete answer by Halst, I would like to mention one additional case where caches may be far from being an advantage. If you have multiple-core SoC where all cores, of course, have own cache(s) and depending on how program code utilizes these cores - caches can be very ineffective. This may happen if ,for example, due to incorrect design or program specific (e.g. multi-core communication) some data block in RAM is concurrently used by 2 or more cores.

How to gain control of a 5GB heap in Haskell?

Currently I'm experimenting with a little Haskell web-server written in Snap that loads and makes available to the client a lot of data. And I have a very, very hard time gaining control over the server process. At random moments the process uses a lot of CPU for seconds to minutes and becomes irresponsive to client requests. Sometimes memory usage spikes (and sometimes drops) hundreds of megabytes within seconds.
Hopefully someone has more experience with long running Haskell processes that use lots of memory and can give me some pointers to make the thing more stable. I've been debugging the thing for days now and I'm starting to get a bit desperate here.
A little overview of my setup:
On server startup I read about 5 gigabytes of data into a big (nested) Data.Map-alike structure in memory. The nested map is value strict and all values inside the map are of datatypes with all their field made strict as well. I've put a lot of time in ensuring no unevaluated thunks are left. The import (depending on my system load) takes around 5-30 minutes. The strange thing is the fluctuation in consecutive runs is way bigger than I would expect, but that's a different problem.
The big data structure lives inside a 'TVar' that is shared by all client threads spawned by the Snap server. Clients can request arbitrary parts of the data using a small query language. The amount of data request usually is small (upto 300kb or so) and only touches a small part of the data structure. All read-only request are done using a 'readTVarIO', so they don't require any STM transactions.
The server is started with the following flags: +RTS -N -I0 -qg -qb. This starts the server in multi-threaded mode, disable idle-time and parallel GC. This seems to speed up the process a lot.
The server mostly runs without any problem. However, every now and then a client request times out and the CPU spikes to 100% (or even over 100%) and keeps doing this for a long while. Meanwhile the server does not respond to request anymore.
There are few reasons I can think of that might cause the CPU usage:
The request just takes a lot of time because there is a lot of work to be done. This is somewhat unlikely because sometimes it happens for requests that have proven to be very fast in previous runs (with fast I mean 20-80ms or so).
There are still some unevaluated thunks that need to be computed before the data can be processed and sent to the client. This is also unlikely, with the same reason as the previous point.
Somehow garbage collection kicks in and start scanning my entire 5GB heap. I can imagine this can take up a lot of time.
The problem is that I have no clue how to figure out what is going on exactly and what to do about this. Because the import process takes such a long time profiling results don't show me anything useful. There seems to be no way to conditionally turn on and off the profiler from within code.
I personally suspect the GC is the problem here. I'm using GHC7 which seems to have a lot of options to tweak how GC works.
What GC settings do you recommend when using large heaps with generally very stable data?
Large memory usage and occasional CPU spikes is almost certainly the GC kicking in. You can see if this is indeed the case by using RTS options like -B, which causes GHC to beep whenever there is a major collection, -t which will tell you statistics after the fact (in particular, see if the GC times are really long) or -Dg, which turns on debugging info for GC calls (though you need to compile with -debug).
There are several things you can do to alleviate this problem:
On the initial import of the data, GHC is wasting a lot of time growing the heap. You can tell it to grab all of the memory you need at once by specifying a large -H.
A large heap with stable data will get promoted to an old generation. If you increase the number of generations with -G, you may be able to get the stable data to be in the oldest, very rarely GC'd generation, whereas you have the more traditional young and old heaps above it.
Depending the on the memory usage of the rest of the application, you can use -F to tweak how much GHC will let the old generation grow before collecting it again. You may be able to tweak this parameter to make this un-garbage collected.
If there are no writes, and you have a well-defined interface, it may be worthwhile making this memory un-managed by GHC (use the C FFI) so that there is no chance of a super-GC ever.
These are all speculation, so please test with your particular application.
I had a very similar issue with a 1.5GB heap of nested Maps. With the idle GC on by default I would get 3-4 secs of freeze on every GC, and with the idle GC off (+RTS -I0), I would get 17 secs of freeze after a few hundred queries, causing a client time-out.
My "solution" was first to increase the client time-out and asking that people tolerate that while 98% of queries were about 500ms, about 2% of the queries would be dead slow. However, wanting a better solution, I ended up running two load-balanced servers and taking them offline from the cluster for performGC every 200 queries, then back in action.
Adding insult to injury, this was a rewrite of an original Python program, which never had such problems. In fairness, we did get about 40% performance increase, dead-easy parallelization and a more stable codebase. But this pesky GC problem...

Resources