I'm having intermittent issues with WCF calls taking minutes where before they used to be sub-second.
I'm using the xxAsync methods and a new proxy with each call. Binding is Http and up till our latest code release this worked fine. We've reviewed the release and can't see anything that would directly explain the issue.
In the screen shot I have highlighted the key entries. Between these two lines we lose 2 minutes.
My question is - what is happening between/inside these two events that might lead me to the issue?
Related
We are working on an ASP.NET 5 Web API project that is in production now but we are experiencing an issue where it becomes unresponsive intermittently throughout the day.
A few notes about the application architecture. It is an ASP.NET Web API project using a MariaDB database on a separate EC2 instance within the same private network. The connection string uses the private IP of the database server to avoid any name resolution issues. The site is hosted via IIS 10.
The application itself has been developed carefully following the best practices provided by Microsoft. Heavy focus on async operations, minimizing query response times and offloading more expensive operations into background services.
The app is extremely responsive. It performs with sub 100ms responses on almost all requests, even the more complicated requests, and all the way up until it becomes unresponsive this high level of performance remains the same. We tend to see between 10-30 requests per second and 300-500 select queries per second at peak usage so not too extreme. However, randomly (2-3 times over a 24 hour period) it will begin hanging on requests and simply not respond to the request. During this time, the database is still extremely responsive and we are never over 300 connections out of our 512 connection limit.
The resources on the application server itself are never really taxed much at all. The CPU never gets above ~20% and the memory usage sits around 20-30%.
If I were to stop the site in IIS and start it again while this is happening, it will quickly come back online. If I don't it will be down for a few minutes until IIS finally kills it due to a failed health check. There are no real errors generated as a response to the issue other than typical errors caused by the hanging of the process such as connection terminated errors. The only thing I have seen before that gave me pause was the fact that there a few connection timeouts when getting the connection from the pool, but like I said, the connections to the server are never close to the limit.
Also, this app and version has been in production for months and it wasn't until the traffic volume started to grow that we started seeing these issues. At this point, I am at a loss for next steps of troubleshooting and I'm seeking suggestions.
In IIS App Pool advanced settings set Start Mode to AlwaysRunning
I never found a root cause for this issue, however, after updating to newer versions of .NET MVC this issue went away. My best guess is that changes with the Kestrel possibly resolved this issue, although, I have no idea what specific change that might have been. I have gone through the change logs a few times and didn't see anything that specifically jumped out at me.
We have a legacy VB6 app which has started, from time to time, hangs. We thought it may be to do with a shift to Citrix, but can now replicate the behaviour on a thick client on Win10. We don't think that we have seen this before on earlier Windows versions, but are still checking logs to confirm that.
We experience the behaviour when tabbing into a text box and then tabbing out. As we pass through it, we are making a simple ado call to lookup/validate some data in a text box. As part of the correct program running we are logging
“Opening Dataset: SELECT ... FROM ... ”
“Opened Dataset”
Between these 2 log statements is simple ado data retrieval code with which we have had no problems previously. It is in an ActiveX dll and is running synchronously. Most importantly is that between these 2 log statements there is no DoEvents or api call which would yield control. As far as we can see, it should be a purely synchronous operation.
When the system crashes, which happens sporadically, we can see other logging statements appear between these 2 which can be either resource status (e.g. how much memory, gdi/user objects - which would usually be found because a timer has triggered in the main form) or focus type events - which aren’t timer driven - at least in our codebase.
“Opening Dataset: SELECT ... FROM ... ”
“Resource Status: ...”
“Opened Dataset”```
or
“Opening Dataset: SELECT ... FROM ... ”
“TextItem.OnLostFocus Item1 ...”
“TextItem.Validate ...”
“TextItem.OnGotFocus Item2 ...
“Opened Dataset”
So my initial question is, in what scenario can what should be a synchronous operation be interrupted and appear to act asynchronously.
For example, and we aren’t doing this, I could imagine writing some unsafe code whereby by using a multimedia timer (on another thread) and supplying an AddressOf parameter to the address of a function on one of our modules, that that timer initiates execution of our code, separate to the correct control flow. Other than something like that, I just can’t see how synchronous vb6 code could be interrupted in this way.
I’d be really grateful of any thoughts, suggestions or advice. I’m really sorry if this is soo vague. It perhaps reflects how I’m struggling to get my head round this problem.
Just to say, we tracked this down to Windows 10 plus an old (out of support) socket component we are using. It looks like it is pumping the message queue "at the wrong time" and hence we are seeing UI events appear in the middle of a synchronous process. We don't see this behaviour on earlier Windows versions.
I don't know what may have changed in Win10 which would result in this, but we obviously need to upgrade.
In our case we had a few long running timers to pull status/changes from the DB which caused this. We are using ADO with SQL Native Client and MARS, which worked great up until Windows 10 where intermittent lock ups occurred. Logging and Windbg confirmed this was happening when 2 requests where hitting the ADO connection at the same time. The error from ADO was "Unable to open a logical session" error number -2147467259, and actually caused SQL Server 2014 (running on another machine) to block all other client queries from multiple different applications and machines until the locked up app was killed. I could not replicate this in the IDE as apparently that forces timers to work the way they always did. The fix was to async our ADO implementation and put a connection manager on top of the SQL connections to force requestors to wait their turn (basically taking the Win10 async'd timer feature back out). My only performance impact was the additional few milliseconds of delaying the timer fired SQL query when it collided with a another query.
I had pre-configured Geoserver and can successfully start it up but only on odd occasions it takes a very long time to initialise gws-gs.xml. It can take over 30 minutes before it initialises gwc-gs.xml and just hangs. Therefore, I tried to configure everything else but gwc-gs.xml and found a /gewc/rest/reload api call that also failed because of a known bug in version 12.15 that I am using. However, I have also read on forums that gwc reload api call is known to have no effect even after being called successfully. I will be very grateful if someone who has experienced similar issues can advise.
Sometimes, perhaps once every few hundred AJAX requests and/or where AJAX requests are executing in quick succession, I've seen a request hang for up to several minutes before it completes. The weird thing is that the request completes successfully AND neither my machine or the server are really doing anything either (e.g. CPU and other resources are not spiking during the "hang").
I've noticed this issue with various web services too, so it isn't just my own website. It also isn't database related as it has happened on non-database sites. It also only seems to show up in non-localhost environments.
When I personally use AJAX, I am also using jQuery, so this might also be a jQuery issue. I also mostly use Firefox, so I don't know if this is just a Firefox issue or if it is an issue any browser could have. I've run into the issue on multiple computers in multiple locations.
If you have run into this issue before and "fixed" it, I would appreciate the solution you came up with.
Use a HTTP debugger like Firebug or Fiddler to track the AJAX requests to see how much time it takes (possible timeout issue due to server setting?) & what HTTP response status code it returns when it fails. Use the HTTP response status code to troubleshoot the issue.
I am using a modified version of the TaskCloud example to try and read/write my own data.
While testing on a a deployed version, I've noticed that the round-trip response time is slow.
From my Android device, I have a 100ms ping response to appspot.com.
I have changed the AppEngine application to do nothing (The Google Dashboard shows insignificant Average Latency.
The problem is that the time it takes for HttpClient client .execute(post) is about 3 seconds.
(This is the time when an instance is already loaded)
Any suggestions would be greatly appreciated.
EDIT: I've watched the video of Google I/O showing the CloudTasks Android-AppEngine app, and you can see that refreshing the list (a single call to AppEngine) takes about 3 seconds as well. The guy is saying something about performance which I didn't fully get (debuggers are running at both ends?)
The video: http://www.youtube.com/watch?v=M7SxNNC429U&feature=related
Time location: 0:46:45
I'll keep investigating...
Thanks for your help so far.
EDIT 2: Back to this issue...
I've used shark packet sniffer to find out what is happening. Some of the time is spent negotiating a SSL connection for each server call. Using http (and ACSID) is faster than https (and SACSID).
new DefaultHttpClient() and new HttpPost() are used for each server call.
EDIT 3:
Looking at the sniffer logs again, there is an almost 2 seconds delay before the actual POST.
I have also found out that the issue exists with Android 2.2 (all versions) but is resolved with Android 2.3
EDIT 4: It's been resolved. Please see my answer below.
It's difficult to answer your question since no detail about your app is provided. Anyway you can try to use appstats tool provided by Google to analyze the bottleneck.
After using the Shark sniffer, I was able to understand the exact issue and I've found the answer in this question.
I have used Liudvikas Bukys's comment and solved the problem using the suggested line:
post.getParams().setBooleanParameter(CoreProtocolPNames.USE_EXPECT_CONTINUE, false);
Often the first call to your GAE app will take longer than subsequent calls. You should make yourself familiar with loading and warm-up requests and how GAE handles instances of your app: http://code.google.com/intl/de-DE/appengine/docs/adminconsole/instances.html
Some things you could also try:
make your app handle more than one request per instance (make sure your app is threadsafe!) http://code.google.com/intl/de-DE/appengine/docs/java/config/appconfig.html#Using_Concurrent_Requests
enable always on feature in app admin (this will cost you)