Java 8 Concurrent Hash Map get Performance/Alternative - performance

I have a high throughput low latency application (3000 Request/Sec, 100ms per request), and we heavily use Java 8 ConcurrentHashMap for performing lookups. Usually these maps are updated by a single background thread and multiple threads read from these maps.
I am seeing a performance bottleneck, and on profiling I find ConcurrentHashMap.get as being the hotspot and taking majority of the time.
I another case, I see ConcurrentHashMap.computeIfAbsent being the hotspot, although the mapping-function has very small latency and the profile shows computeIfAbsent spending 90% of the time executing itself, and very less time in executing the mapping-function.
My question is there any way i could improve the performance? I have around 80 threads concurrently reading from CHM.

I have around 80 threads concurrently reading from CHM.
The simplest things to do are
if you have a CPU bound process, don't have more active threads than you have CPUs, otherwise this will only add overhead and if these threads hold a lock while not running, because you have too many threads, it will really not help.
increase the number of partitions. You will want to have at least 4x the number of segments/partitions and you have threads accessing a single map. However, you will get strange behaviour in CHM if you access it with more than 40 threads due to the way cache coherency works. I suggest using a more efficient data structure for higher degrees of concurrency. In Java 8 the concurrencyLevel is a hint, but it is better than leaving the default initialise size of 16.
don't spend so much time in CHM. Find a way to do useful work without hitting a shared resource and your threads will run much more efficiently.
If you have any latencies you can see in a low latency system, you have a problem IMHO.

Related

maxed CPU performance - stack all tasks or aim for less than 100%?

I have 12 tasks to run on an octo-core machine. All tasks are CPU intensive and each will max out a core.
Is there a theoretical reason to avoid stacking tasks on a maxed out core (such as overhead, swapping across tasks) or is it faster to queue everything?
Task switching is a waste of CPU time. Avoid it if you can.
Whatever the scheduler timeslice is set to, the CPU will waste its time every time slice by going into the kernel, saving all the registers, swapping the memory mappings and starting the next task. Then it has to load in all its CPU cache, etc.
Much more efficient to just run one task at a time.
Things are different of course if the tasks use I/O and aren't purely compute bound.
Yes it's called queueing theory https://en.wikipedia.org/wiki/Queueing_theory. There are many different models https://en.wikipedia.org/wiki/Category:Queueing_theory for a range of different problems I'd suggest you scan them and pick the one most applicable to your workload then go and read up on how to avoid the worst outcomes for that model, or pick a different, better, model for dispatching your workload.
Although the graph at this link https://commons.wikimedia.org/wiki/File:StochasticQueueingQueueLength.png applies to Traffic it will give you an idea of what is happening to response times as your CPU utilisation increases. It shows that you'll reach an inflection point after which things get slower and slower.
More work is arriving than can be processed with subsequent work waiting longer and longer until it can be dispatched.
The more cores you have the further to the right you push the inflection point but the faster things go bad after you reach it.
I would also note that unless you've got some really serious cooling in place you are going to cook your CPU. Depending on it's design it will either slow itself down, making your problem worse, or you'll trigger it's thermal overload protection.
So a simplistic design for 8 cores would be, 1 thread to manage things and add tasks to the work queue and 7 threads that are pulling tasks from the work queue. If the tasks need to be performed within a certain time you can add a TimeToLive value so that they can be discarded rather than executed needlessly. As you are almost certainly running your application in an OS that uses a pre-emptive threading model consider things like using processor affinity where possible because as #Zan-Lynx says task/context switching hurts. Be careful not to try to build your OS'es thread management again as you'll probably wind up in conflict with it.
tl;dr: cache thrash is Bad
You have a dozen tasks. Each will have to do a certain amount of work.
At an app level they each processed a thousand customer records or whatever. That is fixed, it is a constant no matter what happens on the hardware.
At the the language level, again it is fixed, C++, java, or python will execute a fixed number of app instructions or bytecodes. We'll gloss over gc overhead here, and page fault and scheduling details.
At the assembly level, again it is fixed, some number of x86 instructions will execute as the app continues to issue new instructions.
But you don't care about how many instructions, you only care about how long it takes to execute those instructions. Many of the instructions are reads which MOV a value from RAM to a register. Think about how long that will take. Your computer has several components to implement the memory hierarchy - which ones will be involved? Will that read hit in L1 cache? In L2? Will it be a miss in last-level cache so you wait (for tens or hundreds of cycles) until RAM delivers that cache line? Did the virtual memory reference miss in RAM, so you wait (for milliseconds) until SSD or Winchester storage can page in the needed frame? You think of your app as issuing N reads, but you might more productively think of it as issuing 0.2 * N cache misses. Running at a different multi-programming level, where you issue 0.3 * N cache misses, could make elapsed time quite noticeably longer.
Every workload is different, and can place larger or smaller demands on memory storage. But every level of the memory hierarchy depends on caching to some extent, and higher multi-programming levels are guaranteed to impact cache hit rates. There are network- and I/O-heavy workloads where very high multi-programming levels absolutely make sense. But for CPU- and memory-intensive workloads, when you benchmark elapsed times you may find that less is more.

Reaching limits of Apache Storm

We are trying to implement a web application with Apache Storm.
Applicationreceives a huge load of ad-requests (100 TPS - a hundred transactions / second ),makes some simple calculation on them and then stores the result in a NoSQL database with a maximum latency of 10 ms.
We are using Cassandra as a sink for its writing capabilities.
However, we have already overpassed the 8 ms requirement, we are in 100ms.
We tried to minimize the size of buffers (Disruptor buffers) and to well balance the topology, using the parallelism of bolts.
But we still in 20ms.
With 4 worker ( 8 cores / 16GB ) we are at 20k TPS which is still very low.
Is there any suggestions for optimization orare we just reaching the limits of Apache Storm(limits of Java)?
I don't know the platform you're using, but in C++ 10ms is eternity. I would think you are using the wrong tools for the job.
Using C++, serving some local query should take under a microsecond.
Non-local queries that touch multiple memory locations and/or have to wait for disk or network I/O, have no choice but taking more time. In this case parallelism is your best friend.
You have to find the bottleneck.
Is it I/O?
Is it CPU?
Is it memory bandwidth?
Is it memory access time?
After you've found the bottleneck, you can either improve it, async it and/or multiply (=parallelize) it.
There's a trade-off between low latency and high throughput.
If you really need to have high throughput, you should rely on batching adjusting size of buffers bigger, or using Trident.
Trying to avoid transmitting tuples to other workers helps low latency. (localOrShuffleGrouping)
Please don't forget to monitor GC which causes stop-the-world. If you need low-latency, it should be minimized.

What is the maximum number of threads that can be running in a Delphi application?

In a Delphi application, what is the maximum number of concurrent threads that can be running at one time ? Suppose that a single thread processing time is about 100 milliseconds.
The number of concurrent threads is limited by available resources. However, keep in mind that every thread uses a minimum amount of memory (usually 1MB by default, unless you specify differently), and the more threads you run, the more work the OS has to do to manage them, and the more time it takes just to keep switching between them so they have fair opportunity to run. A good rule of thumb is to not have more threads than there are CPUs available, since that will be the maximum number of threads that can physically run at any given moment. But you can certainly have more threads than CPUs, the OS will simply schedule them accordingly, which can degrade performance if you have too many running at a time. So you need to think about why you are using threads in the first place and plan accordingly to trade off between performance, memory usage, overhead, etc. Multi-threaded programming is not trivial, so do not treat it lightly.
This is memory dependent, there is no fixed limit to how many threads or other objects that you can create. At some point, if you allocate too much memory, you may get an "out of memory" exception, so you should think about how many threads you really need to invoke and go from there. Also keep in mind the more threads that you invoke, you should expect the processing time for all of the threads to decrease. So you may not get the performance that you're looking for if you have too many concurrent threads at once. I hope that this helps!

Optimal number of threads per core

Let's say I have a 4-core CPU, and I want to run some process in the minimum amount of time. The process is ideally parallelizable, so I can run chunks of it on an infinite number of threads and each thread takes the same amount of time.
Since I have 4 cores, I don't expect any speedup by running more threads than cores, since a single core is only capable of running a single thread at a given moment. I don't know much about hardware, so this is only a guess.
Is there a benefit to running a parallelizable process on more threads than cores? In other words, will my process finish faster, slower, or in about the same amount of time if I run it using 4000 threads rather than 4 threads?
If your threads don't do I/O, synchronization, etc., and there's nothing else running, 1 thread per core will get you the best performance. However that very likely not the case. Adding more threads usually helps, but after some point, they cause some performance degradation.
Not long ago, I was doing performance testing on a 2 quad-core machine running an ASP.NET application on Mono under a pretty decent load. We played with the minimum and maximum number of threads and in the end we found out that for that particular application in that particular configuration the best throughput was somewhere between 36 and 40 threads. Anything outside those boundaries performed worse. Lesson learned? If I were you, I would test with different number of threads until you find the right number for your application.
One thing for sure: 4k threads will take longer. That's a lot of context switches.
I agree with #Gonzalo's answer. I have a process that doesn't do I/O, and here is what I've found:
Note that all threads work on one array but different ranges (two threads do not access the same index), so the results may differ if they've worked on different arrays.
The 1.86 machine is a macbook air with an SSD. The other mac is an iMac with a normal HDD (I think it's 7200 rpm). The windows machine also has a 7200 rpm HDD.
In this test, the optimal number was equal to the number of cores in the machine.
I know this question is rather old, but things have evolved since 2009.
There are two things to take into account now: the number of cores, and the number of threads that can run within each core.
With Intel processors, the number of threads is defined by the Hyperthreading which is just 2 (when available). But Hyperthreading cuts your execution time by two, even when not using 2 threads! (i.e. 1 pipeline shared between two processes -- this is good when you have more processes, not so good otherwise. More cores are definitively better!) Note that modern CPUs generally have more pipelines to divide the workload, so it's no really divided by two anymore. But Hyperthreading still shares a lot of the CPU units between the two threads (some call those logical CPUs).
On other processors you may have 2, 4, or even 8 threads. So if you have 8 cores each of which support 8 threads, you could have 64 processes running in parallel without context switching.
"No context switching" is obviously not true if you run with a standard operating system which will do context switching for all sorts of other things out of your control. But that's the main idea. Some OSes let you allocate processors so only your application has access/usage of said processor!
From my own experience, if you have a lot of I/O, multiple threads is good. If you have very heavy memory intensive work (read source 1, read source 2, fast computation, write) then having more threads doesn't help. Again, this depends on how much data you read/write simultaneously (i.e. if you use SSE 4.2 and read 256 bits values, that stops all threads in their step... in other words, 1 thread is probably a lot easier to implement and probably nearly as speedy if not actually faster. This will depend on your process & memory architecture, some advanced servers manage separate memory ranges for separate cores so separate threads will be faster assuming your data is properly filed... which is why, on some architectures, 4 processes will run faster than 1 process with 4 threads.)
The answer depends on the complexity of the algorithms used in the program. I came up with a method to calculate the optimal number of threads by making two measurements of processing times Tn and Tm for two arbitrary number of threads ‘n’ and ‘m’. For linear algorithms, the optimal number of threads will be N = sqrt ( (mn(Tm*(n-1) – Tn*(m-1)))/(nTn-mTm) ) .
Please read my article regarding calculations of the optimal number for various algorithms: pavelkazenin.wordpress.com
The actual performance will depend on how much voluntary yielding each thread will do. For example, if the threads do NO I/O at all and use no system services (i.e. they're 100% cpu-bound) then 1 thread per core is the optimal. If the threads do anything that requires waiting, then you'll have to experiment to determine the optimal number of threads. 4000 threads would incur significant scheduling overhead, so that's probably not optimal either.
I thought I'd add another perspective here. The answer depends on whether the question is assuming weak scaling or strong scaling.
From Wikipedia:
Weak scaling: how the solution time varies with the number of processors for a fixed problem size per processor.
Strong scaling: how the solution time varies with the number of processors for a fixed total problem size.
If the question is assuming weak scaling then #Gonzalo's answer suffices. However if the question is assuming strong scaling, there's something more to add. In strong scaling you're assuming a fixed workload size so if you increase the number of threads, the size of the data that each thread needs to work on decreases. On modern CPUs memory accesses are expensive and would be preferable to maintain locality by keeping the data in caches. Therefore, the likely optimal number of threads can be found when the dataset of each thread fits in each core's cache (I'm not going into the details of discussing whether it's L1/L2/L3 cache(s) of the system).
This holds true even when the number of threads exceeds the number of cores. For example assume there's 8 arbitrary unit (or AU) of work in the program which will be executed on a 4 core machine.
Case 1: run with four threads where each thread needs to complete 2AU. Each thread takes 10s to complete (with a lot of cache misses). With four cores the total amount of time will be 10s (10s * 4 threads / 4 cores).
Case 2: run with eight threads where each thread needs to complete 1AU. Each thread takes only 2s (instead of 5s because of the reduced amount of cache misses). With four cores the total amount of time will be 4s (2s * 8 threads / 4 cores).
I've simplified the problem and ignored overheads mentioned in other answers (e.g., context switches) but hope you get the point that it might be beneficial to have more number of threads than the available number of cores, depending on the data size you're dealing with.
4000 threads at one time is pretty high.
The answer is yes and no. If you are doing a lot of blocking I/O in each thread, then yes, you could show significant speedups doing up to probably 3 or 4 threads per logical core.
If you are not doing a lot of blocking things however, then the extra overhead with threading will just make it slower. So use a profiler and see where the bottlenecks are in each possibly parallel piece. If you are doing heavy computations, then more than 1 thread per CPU won't help. If you are doing a lot of memory transfer, it won't help either. If you are doing a lot of I/O though such as for disk access or internet access, then yes multiple threads will help up to a certain extent, or at the least make the application more responsive.
Benchmark.
I'd start ramping up the number of threads for an application, starting at 1, and then go to something like 100, run three-five trials for each number of threads, and build yourself a graph of operation speed vs. number of threads.
You should that the four thread case is optimal, with slight rises in runtime after that, but maybe not. It may be that your application is bandwidth limited, ie, the dataset you're loading into memory is huge, you're getting lots of cache misses, etc, such that 2 threads are optimal.
You can't know until you test.
You will find how many threads you can run on your machine by running htop or ps command that returns number of process on your machine.
You can use man page about 'ps' command.
man ps
If you want to calculate number of all users process, you can use one of these commands:
ps -aux| wc -l
ps -eLf | wc -l
Calculating number of an user process:
ps --User root | wc -l
Also, you can use "htop" [Reference]:
Installing on Ubuntu or Debian:
sudo apt-get install htop
Installing on Redhat or CentOS:
yum install htop
dnf install htop [On Fedora 22+ releases]
If you want to compile htop from source code, you will find it here.
The ideal is 1 thread per core, as long as none of the threads will block.
One case where this may not be true: there are other threads running on the core, in which case more threads may give your program a bigger slice of the execution time.
One example of lots of threads ("thread pool") vs one per core is that of implementing a web-server in Linux or in Windows.
Since sockets are polled in Linux a lot of threads may increase the likelihood of one of them polling the right socket at the right time - but the overall processing cost will be very high.
In Windows the server will be implemented using I/O Completion Ports - IOCPs - which will make the application event driven: if an I/O completes the OS launches a stand-by thread to process it. When the processing has completed (usually with another I/O operation as in a request-response pair) the thread returns to the IOCP port (queue) to wait for the next completion.
If no I/O has completed there is no processing to be done and no thread is launched.
Indeed, Microsoft recommends no more than one thread per core in IOCP implementations. Any I/O may be attached to the IOCP mechanism. IOCs may also be posted by the application, if necessary.
speaking from computation and memory bound point of view (scientific computing) 4000 threads will make application run really slow. Part of the problem is a very high overhead of context switching and most likely very poor memory locality.
But it also depends on your architecture. From where I heard Niagara processors are suppose to be able to handle multiple threads on a single core using some kind of advanced pipelining technique. However I have no experience with those processors.
Hope this makes sense, Check the CPU and Memory utilization and put some threshold value. If the threshold value is crossed,don't allow to create new thread else allow...

What is the performance hit of Performance Counters

When considering using performance counters as my companies' .NET based site, I was wondering how big the overhead is of using them.
Do I want to have my site continuously update it's counters or am I better off to only do when I measure?
The overhead of setting up the performance counters is generally not high enough to worry about (setting up a shared memory region and some .NET objects, along with CLR overhead because the CLR actually does the management for you). Here I'm referring to classes like PerformanceCounter.
The overhead of registering the perfromance counters can be decently slow, but generally is not a concern because it is intended to happen once at setup time because you want to change machine-wide state. It will be dwarfed by any copying that you do. It's not generally something you want to do at runtime. Here I'm referring to PerformanceCounterInstaller.
The overhead of updating a performance counter generally comes down to the cost of performing an Interlocked operation on the shared memory. This is slower than normal memory access but is a processor primitive (that's how it gets atomic operations across the entire memory subsystem including caches). Generally this cost is not high to worry about. It could be 10 times a normal memory operation, potentially worse depending on the update and what contention is like across threads and CPUs. But consider this, it's literally impossible to do any better than interlocked operations for cross-process communication with atomic updates, and no locks are held. Here I refer to PerformanceCounter.Increment and similar methods.
The overhead of reading a performance counter is generally a read from shared memory. As others have said, you want to sample on a reasonable period (just like any other sampling) but just think of PerfMon and try to keep the sampling on a human scale (think seconds instead of milliseconds) and you proably won't have any problems.
Finally, an appeal to experience: Performance counters are so lightweight that they are used everywhere in Windows, from the kernel to drivers to user applications. Microsoft relies on them internally.
Advice: The real question with performance counters is the learning curve in understanding (which is moderate) and one measuring the right things (seems easy but often you get it wrong).
The performance impact is negligible in updating. Microsoft's intent is that you always write to the performance counters. It's the monitoring of (or capturing of) those performance counters that will cause a degradation of performance. So, only when you use something like perfmon to capture the data.
In effect, the performance counter objects will have the effect of only "doing it when you measure."
I've tested these a LOT.
On an old compaq 1Ghz 1 processor machine, I was able to create about 10,000 counters and monitor them remotely for about 20% CPU usage. These aren't custom counters, just checking CPU or whatever.
Basically, you can monitor all the counters on any decent newer machine with very little impact.
The instantiation of the object can take a long time tho, a few seconds to a few minutes. I suggest you multithread this for all the counters you collect otherwise your app will sit there forever creating these objects. Not sure what MS does once you create it that takes so long, but you can do it for 1000 counters with 1000 threads in the same time you can do it for 1 counter and 1 thread.
A performance counter is just a pointer to 4/8 bytes in shared memory (aka memory mapped file), so their cost is very similar to that of accessing an int/long variabile.
I agree with famoushamsandwich, but would add that as long as your sampling rate is reasonable (5 seconds or more) and you monitor a reasonable set of counters, then the impact of measuring is negligible as well (in most cases).
The thing that I have found is that it is not that slow for the majority of applications. I wouldn't put one in a tight loop, or something that is called thousands of times a second.
Secondly, I found that programmatically creating the performance counters is very slow, so make sure that you create them before hand and not in code.

Resources