Win32Native.Readfile waiting for synchronisation, a performance bottleneck?

Win32Native.Readfile waiting for synchronisation, a performance bottleneck? - winapi

I profiled an application. Basically every thread reads an XML file from a network share, deserializes an object, logs to local files, asynchronously logs to db and calls a web service.
Amount of Threads is about 14 on a 24 core machine.
Redgate profiler shows me the multithreaded application is waiting for synchronisation 70% of the time. Is this an alarming signal or to be expected? Further if you can give advice how to approach analysing such a profiler log please share your knowledge.

Waiting for synchronization just means that a thread is suspended while waiting for another thread to complete an operation. Whether or not you should be concerned about this depends on how long you expect the operation on that thread to take to reach completion.
If the stack indicates a read/write, then it may just mean the disk is slow, for example. Maybe you can minimize that by changing your code; maybe it's just a flaky network or disk drive.

Related

Web Server Performance Degradation

The web application is running on Springboot and deployed on WebLogic.
We have assigned 400 as max threads and JDBC to be 100 connections.
When we perform load testing on the web application, the performance is optimal when the load is low (the response time is less than 200ms for most of the http request that we called).
When we increase the load, we can see that the thread count increases and jdbc count also increases gradually but no where near to max. However, the response time is getting much longer and it could take more than 5 seconds to response.
CPU usage, thread count, memory, JDBC connection seems to be normal during these period.
Another observation is that during testing and we saw that the performance is degrading, we used another machine to make a http call to the server that is only retrieving text without any DB or logic, and even this simple http call will take 10s to respond. (And the server resources is still not MAX!)
So, we are wondering what keep them waiting ?
Any other possible bottleneck?

If the server doesn't lack resources like CPU/RAM/etc. only a profiler can tell you where your application spends the most time which might be in:
Waiting in a queue for next thread/db connection from the pool to be available
Slow database query
Inefficient functions/algorithms which a subject to optimization
WebLogic configuration not suitable for high loads
JVM configuration not suitable for high loads (i.e. system is doing garbage collection to often/too long)
So I would recommend re-running your test with profiler tool telemetry enabled and at the same time monitoring essential JVM metrics using i.e. JMXMon Sample Collector which can be used for monitoring your application-specific metrics as well. It's a plugin which can be installed using JMeter Plugins Manager

For a detailed approach on how ago about identifying poor thread performance I suggest you take look at the TSA Method by Brendan Gregg.

Jmeter threads stuck during the load test

I am running a load test using JMeter with 200 users for approx 1hr. So, the observation is that a few threads are stuck even after the duration completes. Like 60 out of 200 get stuck. When I take the thread dump and observe that these threads are in a Runnable state. Any suggestions for resolving this issue? And I do not see anything meaningful from the JMeter log file.

You will find an unexpected increase in response time at the end of that time.
This is because of the thread's insufficient ramp-down time. Some of your threads were active and made requests to the server and didn't receive the response but threads were closed forcefully. If your JMeter test is stopped forcefully, all the active threads will be closed immediately. So the requests generated by those threads will get higher response time.
You can use Ultimate Thread Group for graceful shutdown time(ramp-down time) of threads just like the ramp-up time.
Here is an example setting:

This is not a normal behaviour for a JMeter test, most probably it indicates that either JMeter engine is overloaded (not properly configured for high loads) or the machine where JMeter is running is overloaded (i.e. lacks RAM and starts intensive swapping)
Make sure to follow JMeter Best Practices (run your test in non-GUI mode, remove all Listeners and test elements you don't need, increase JMeter heap size, etc.)
Make sure to monitor the essential health metrics of the machine where JMeter is running (CPU, RAM, Network and Disk IO, Swap file usage). You can use JMeter PerfMon Plugin for this if you don't have any better software
It might be the case you'll have to switch to Distributed Testing, 200 virtual users doesn't seem to be a "high" load to me, but it depends on what exactly these users are doing, if they're uploading/downloading large files it may be sufficient to cause the problems
Going forward consider adding the thread dump and jmeter log file contents to your question as it doesn't contain any clues so we can only come up with "blind shot" answers

You may want to check your HTTP timeouts.
I usually set Connect Timeout to 5000 milliseconds, and Response Time out to 30000.
Your values may vary for your specific environment/ application.
In this way, if things go bad on the server under test, all requests terminate within the timeout (with errors).
You have also to consider that, if you are retrieving an HTML page with all its embedded objects, and the web server is stuck, you need to wait for multiple timeouts to expire before the operation terminate.

Monitor and create logs of W3WP.exe process CPU utilization and request URL's when CPU spikes more than 50%

In our Application there are more the 2000 pages which are deployed in prod server. Sometime when user browse some URL's the CPU spikes going more than 70%. I can not find when it's occurs and which URL create this. So can any one tell me best open source tool to Monitor and Create logs of W3WP.exe process CPU utilization and request URL's when CPU spikes more than 50%.

procdump + windbg
There is a sysinternals tool called procdump which can automatically create a memory dump of your process for analysis when cpu exceeds a threshold.
From the command line usage:
-c CPU threshold at which to create a dump of the process.
Once you have a process dump you will need to load it into windbg in order to analyze what's taking up all the cpu cycles. Covering off windbg is pretty big, but here's briefly what you need to do:
load the SOS dll (managed debug extension)
call the !runaway command to get list of long running threads
dive into a long running thread by selecting it and calling !clrstack command
There are many blogs on using windbg. Here is one example. A great resource on analyzing these types of issues is Tess Ferrandez's blog.
perfmon + procdump + windbg
Perfmon can help you see if the issue is related to high rates of memory allocation which is causing garbage collection. You can look at CPU for w3wp as well as allocation rates for the process and the number of Gen 2 collections occurring. Gen 2 collections mean Gen 1 and 0 are also collected, meaning it can be an expensive operation. Counters to look at:
# Gen 2 Collections
% Time in GC
Allocated Bytes/second
If you see some very high allocation rates, you will still need a memory dump (procdump) and windbg to analyse what the root cause is.
Again - Tess Ferrandez has a blog post on this flavor of high cpu. In this post the issue is allocating large objects onto the heap.
perfmon + appcmd
I haven't tried this myself but in theory it should work, and is simpler than other options - though will not produce same level of detail. You can configure perfmon alerts on cpu for w3wp.exe. The alerts can be configured to run a task. You can create a batch file which runs the appcmd IIS tool and tell it to dump all the running requests:
appcmd list requests > c:\temp\high-cpu-requests.txt
This way you will get a list of long running requests when the cpu is high, and hopefully be able to work out offending page from there.

IIS Advanced Logging may help you here.
Whilst it will not give you CPU Utilisation per request, it can log CPU utilisation in general. What you could do is try and match these spikes to the requests that come before it.

Windows 2003 server socket error 10055

I was running a very big application on Windows 2003 server. It creates almost 900 threads and a single thread who is operating on a socket. It's a C++ application which I had compiled with Visual Studio environment.
After almost 17-20 hours of testing, I get 10055 socket error while sending the data.
Apart from this error my application runs excellently without any error or issue. It's a quad core system with 4 GiB of RAM and this application occupies around 30-40% CPU (on all 4 CPUs) in all of its running.
Can anyone here help me to pass through this. I had searched almost everything on google regarding this error but could not get anything relevant to my case.

I think, it's impossible to say mo than:
Error 10055 means that Windows has run
out of TCP/IP socket buffers because
too many connections are open at once.
http://kbase.pscs.co.uk/index.php?article=93
https://wiki.pscs.co.uk/how_to:10055

I have seen this symptom before in an IOCP socket system. I had to throttle outgoing async socket sends so that not too much data gets queued in the kernel waiting to be sent on the socket.
Although the error text says this happens due to number of connections, that's not my experience. If you write a tight loop doing async sends on a single socket, with no throttling, you can hit this very quickly.
Possibly #Len Holgate has something to add here, he's my "goto guy" for Windows sockets problems.

It creates almost 900 threads
That's partially your problem. Each thread is likely using the default 1MB of stack. You start to approach a GB of thread overhead. Chances of running out of memory are high. The whole point of using IOCP is so that you don't have to create a "thread per connection". You can just create several threads (from 1x - 4x the number of CPUs) to listen on the completion port handler and have each thread service a different request to maximize scalability.
I recall reading an article linked off of Stack Overflow that buffers you post for pending IOCP operations are setup such that the operating system WILL NOT let the memory swap out from physical memory to disk. And then you can run out of system resources when the connection count gets high.
The workaround, if I recall correctly, is to post a 0 byte buffer (or was it a 1 byte buffer) for each socket connection. When data arrives, your completion port handler will return, and that's a hint to your code to post a larger buffer. If I can find the link, I'll share it.

Monitoring tools accuracy - Debugging application latency

we are having latency issues in one of our network application. Most of the time requests are being handled within 100ms. But sometime it can take up to a few seconds for no apparent reason.
So I hooked up some monitoring tools and looked up what was happening (Wireshark to monitor the network externally through port replication and Process Monitor to see what was happening on the local machine).
I was able to match tcp packets and they usually where within a millisecond of eachother in both logs file. But in one occurence, the last packet of a series was delayed by more then 250ms in Process Monitor compared to wireshark (and the application erratic behavior - due to latency - was being observed).
Since Wireshark was hooked up on another computer I'm quite sure that what was being monitored was accurate : all the packed did reach the network card on time.
As for Process Monitor I'm not totally sure about how it work : when is the network data being registered? Is it when it reach the network card? When it is made available to the application? When the application reads the data?
During these 250ms there were a few other events being registered which let me believe that Process Monitor was recording correctly and that this 250ms delay wasn't "created" by it.
Any help regarding the behavior of Process Monitor, the current method I use to dig down the problem or what you think could be the problem would be much appreciated.

Option 2
Perhaps you're experiencing the infamous 250ms delays that the GC cause from time to time (link). You can accurately measure GC suspensions using a specialized CLR host (link)
Option 1 - was ruled out
Since you are using TCP, I'd suggest that you'll turn on the NoDelay option on your socket just to eliminate the possibility that you're suffering from a clash between Nagle's Algorithm and the Delayed ACKs Algorithm. If you're experiencing "batching" of packets while sometimes a packet is "delayed" for about 200ms, then it just might be the issue.
A more in-depth explanation of this behavior can be found here.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio