Jersey, Java 8, Performance, Slow URLConnectionClientHandler.getInputStream - performance

We are currently facing very weird problem in our enterprise application.
We are using
- Jersey 1.17
- JDK 1.8 (which we have recently migrated)
In our application what we do is, we make REST http calls to get the data from our different application server. Everything is normal in terms of speed/performance unless the load is increased.
When we excute the method (CallXService in below trace) which internally makes 10 to 15 Rest calls to different application the two reading are captured like
7 seconds - when there is almost no load.
28 seconds - when there is more load.
When we deep dive into the calls of 28 seconds call we get following trace along with the time captured.
JerseyRestClient.CallXService:469(0 ms self time, 6143 ms total time)
WebResource.post:251(0 ms self time, 6143 ms total time)
WebResource.handle:680(0 ms self time, 6143 ms total time)
Client.handle:648(0 ms self time, 6143 ms total time)
URLConnectionClientHandler.handle:149(0 ms self time, 6143 ms total time)
URLConnectionClientHandler._invoke:249(10 ms self time, 6143 ms total time)
URLConnectionClientHandler.getInputStream:310(6091 ms self time, 6091 ms total time)
Note the time taken by getInputStream here.
Where as in first 7 seconds getInputStream is not taking much time.
What does that time tells
1) Slow responding server
2) Network speed while getting the resource
3) Or the problem of Java 8 and Jersey 1.17.
Any help is much appreciated. Thanks !

Related

JMeter - Throughput Shaping Timer does not keep the requests/sec rate

I am using Ultimate Thread Group and fixed 1020 threads count for entire test duration - 520 seconds.
I've made a throughput diagram as follows:
The load increses over 10 seconds so the spikes shouldn't be very steep. Since the max RPS is 405 and max response time is around 25000ms 1020 threads should be enough.
However, when I run the test (jmeter -t spikes-nomiss.jmx -l spikes-nomiss.csv -e -o spikes-nomiss -n) I have the following graph for hits/seconds.
The threads are stopped for few seconds and suddenly 'wake up'. I can't find a reason for it. The final minute has a lot higher frequency of the calls. I've set heap size to 2GBs and resources are available, the CPU usage does not extend 50% during peaks, and memory is around 80% (4Gbs of ram on the machine). Seeking any help to fix the freezes.
Make sure to monitor JMeter's JVM using JConsole as it might be the case JMeter is not capable of create spikes due to insufficient resources. The slowdowns can be caused by excessive Garbage Collection
It might be the case 1020 threads are not enough to reach the desired throughput as it depends mainly on your application response time. If your application response time is higher than 300 milliseconds - you will not be able to get 405 RPS using 1020 threads. It might be a better idea to consider using Concurrency Thread Group which can be connected to the Throughput Shaping Timer via Schedule Feedback function

WinDbg runaway command output explained

I have a production CPU issue, after days of regular activity suddenly the CPU starts to peak. I've saved the dump file and run the !runaway command to get the list of highest CPU time consuming threads. the output is below:
User Mode Time
Thread Time
21:110 0 days 10:51:39.781
19:f84 0 days 10:41:59.671
5:cc4 0 days 0:53:25.343
48:74 0 days 0:34:20.140
47:1670 0 days 0:34:09.812
13:460 0 days 0:32:57.640
8:14d4 0 days 0:19:30.546
7:d90 0 days 0:03:15.000
23:1520 0 days 0:02:21.984
22:ca0 0 days 0:02:08.375
24:72c 0 days 0:02:01.640
29:10ac 0 days 0:01:58.671
27:1088 0 days 0:01:44.390
As you can see, the output shows I've 2 threads: 21 & 19, that consumes more than 20 hours of CPU time combined ,I was able to track the callstack of 1 of those threads like so:
~21s
!CLRStack
the output doesn't matter at the moment, let's call it the "X callstack"
What I would like, is an explanation about the !runaway command output. from what I understand, a dump file is a snapshot of the current state of the application. so my questions are:
How can the runaway command shows 10:51 hours value for thread 21, when the dumping process only took a few seconds?
Does it mean that the specific "instance" of the X callstack I've found with the !CLRStack command is hang more than 10 hours? or it's the total time the 21 thread executed his whole X callstacks executions? If so, it seems strange that the 21 thread responsible for so many executions of the X callstacks. As I know the origin is a web request (the runtime should assign a random thread for each call)
I've a speculation that may answer those 2 questions:
Maybe the windbg calculate the time by taking the thread callstack actual time and dividing it by the scope of the dumping process, so if for example the specific execution of the X callstack took 1 second and the whole dumping process took 3 seconds (33%), while the process was running for total of 24 hours the output will show:
8 hours (33% of 24 hours)
Am I right, or completely got it wrong?
This answer is intended to be comprehensible for the OP. It's not intended to be correct into all bits and bytes.
[...] and dividing it by the scope of the dumping process [...]
This understanding is probably the root of all evil: dumping a process only gives you the state of the process at a certain point in time. The duration of dumping the process is 0.0 seconds, since all threads are suspended during the operation. (so, relative time for your process, nothing has changed and time is standing still; of course wall clock time changes)
You are thinking of dumping a process as monitoring it over a longer period of time, which is not the case. Dumping a process just takes time because it involves disk activity etc.
So no, there is no "scope" and thus you cannot (it's really hard) measure performance issues with crash dumps.
How can the runaway command shows 10:51 hours value for thread 21, [...]
How can your C# program know how long the program is running if you only have a timer event that fires every second? The answer is: it uses a variable and increases the value.
That's roughly how Windows does it. Windows is responsible for thread scheduling and each time it re-schedules threads, it updates a variable that contains the thread time.
When writing the crash dump, the information that was collected by the OS long time ago already, is included in the crash dump.
[...] when the dumping process only took a few seconds?
Since the crash dump is taken by a thread of WinDbg, the time for that is accounted on that thread. You would need to debug WinDbg and do !runaway on a WinDbg thread to see how much CPU time that took. Potentially a nice exercise and the .dbgdbg (debug the debugger) command may be new to you; other than that, this particular case is not really helpful.
Does it mean that the specific "instance" of the X callstack I've found with the !CLRStack command is hang more than 10 hours?
No. It means that at the point in time when you created the crash dump, that specific method was executed. Not more, not less.
This information is unrelated to !runaway, because the thread may have been doing something totally different for a long time, but that ended just a moment ago.
or it's the total time the 21 thread executed his whole X callstacks executions?
No. A crash dump does not contain such detailed performance data. You need a performance profiler like JetBrains dotTrace do get that information. A profiler will look at callstacks very often, then aggregate identical call stacks and derive CPU time per call stack.

Python 3 multiprocessing: optimal chunk size

How do I find the optimal chunk size for multiprocessing.Pool instances?
I used this before to create a generator of n sudoku objects:
processes = multiprocessing.cpu_count()
worker_pool = multiprocessing.Pool(processes)
sudokus = worker_pool.imap_unordered(create_sudoku, range(n), n // processes + 1)
To measure the time, I use time.time() before the snippet above, then I initialize the pool as described, then I convert the generator into a list (list(sudokus)) to trigger generating the items (only for time measurement, I know this is nonsense in the final program), then I take the time using time.time() again and output the difference.
I observed that the chunk size of n // processes + 1 results in times of around 0.425 ms per object. But I also observed that the CPU is only fully loaded the first half of the process, in the end the usage goes down to 25% (on an i3 with 2 cores and hyper-threading).
If I use a smaller chunk size of int(l // (processes**2) + 1) instead, I get times of around 0.355 ms instead and the CPU load is much better distributed. It just has some small spikes down to ca. 75%, but stays high for much longer part of the process time before it goes down to 25%.
Is there an even better formula to calculate the chunk size or a otherwise better method to use the CPU most effective? Please help me to improve this multiprocessing pool's effectiveness.
This answer provides a high level overview.
Going into detais, each worker is sent a chunk of chunksize tasks at a time for processing. Every time a worker completes that chunk, it needs to ask for more input via some type of inter-process communication (IPC), such as queue.Queue. Each IPC request requires a system call; due to the context switch it costs anywhere in the range of 1-10 μs, let's say 10 μs. Due to shared caching, a context switch may hurt (to a limited extent) all cores. So extremely pessimistically let's estimate the maximum possible cost of an IPC request at 100 μs.
You want the IPC overhead to be immaterial, let's say <1%. You can ensure that by making chunk processing time >10 ms if my numbers are right. So if each task takes say 1 μs to process, you'd want chunksize of at least 10000.
The main reason not to make chunksize arbitrarily large is that at the very end of the execution, one of the workers might still be running while everyone else has finished -- obviously unnecessarily increasing time to completion. I suppose in most cases a delay of 10 ms is a not a big deal, so my recommendation of targeting 10 ms chunk processing time seems safe.
Another reason a large chunksize might cause problems is that preparing the input may take time, wasting workers capacity in the meantime. Presumably input preparation is faster than processing (otherwise it should be parallelized as well, using something like RxPY). So again targeting the processing time of ~10 ms seems safe (assuming you don't mind startup delay of under 10 ms).
Note: the context switches happen every ~1-20 ms or so for non-real-time processes on modern Linux/Windows - unless of course the process makes a system call earlier. So the overhead of context switches is no more than ~1% without system calls. Whatever overhead you're creating due to IPC is in addition to that.
Nothing will replace the actual time measurements. I wouldn't bother with a formula and try a constant such as 1, 10, 100, 1000, 10000 instead and see what works best in your case.

Is QueryPerformanceCounter counter process specific?

https://msdn.microsoft.com/en-us/library/windows/desktop/dn553408(v=vs.85).aspx
https://msdn.microsoft.com/en-us/library/ms644904(VS.85).aspx
Imagine that I measure some part of code (20ms)
Context switching happend. And my thread was displaced by another thread which was executed (20 ms)
Then I receive quantum of time back from scheduler and perform some cals during 1ms.
If calculate elapsed time then what time will I receive? 41ms or 21 ms?
If calculate elapsed time then what time will I receive? 41ms or 21 ms?
QueryPerformanceCounter reports wall clock time. So the answer will be 41ms.

High Resolution GetMessageTime()?

WINAPI has a function GetMessageTime() that returns the time a message was generated in system time, which has a resolution of 10 to 16 ms. Is there an effective way to get the time an event occurred in interrupt time (100 ns precision), or in some other format with at least 1 ms precision?
Even with Raw Input, the message timing isn't delivered with < 10 ms resolution. (The timing data comes from the WM_INPUT message.) As far as I can tell from the keyboard driver sources, the timing data simply isn't collected with < 10 ms resolution.

Resources