wcf operation times out without error - performance

I have a .NET 3.5 BasicHttpBinding no security WCF service hosted on IIS 6.0.
I have service throttling bumped up as per MS recommendations.
My service operation is getting called a few hundreds of time conccurrently, and at some point the client gets a timeout exception (59:00, that's whats set in the server and client timeouts).
If I raise the timeout it just hits the new limit.
It seems like the application just "freezes" somewhere and we have not been able to figure out how this happens.
WCF tracing on the server side doesn't come up with anything.
Any ideas regarding what could be the issue?
Thanks

I assume your WebService is not using the new async/await especially wrt the database calls. In that case its because you are blocking your limited threads.
In more detail. IIS/ASP.net only creates a limited number of threads to handle requests. The first...say 8 requests spin up threads and start working. At some point they will hit the database (I am assuming a traditional n-tier app). Those threads sleep. The next say...992 requests hit IIS and are held in a queue.
At some point the database calls return, process stuff...send data to the client. Another 8 requests are dequeued...hit the database...etc...
However each set of 8 requests takes a finite time to complete. With over 900 requests ahead of them, the last 100 or so threads will take at the very least 100 * latency * number of roundtrips before they can start up. If your latency * number of roundtrips is high...your last request will take a long time before it even gets dequeued, hence the timeout.
Two remedies exists. The first, create more threads....will use up all your memory and your IIS crashes. The second is to use .net 4.5 and async/await.
See here for more information

Related

ASP.NET 5 Web API application intermittently unresponsive

We are working on an ASP.NET 5 Web API project that is in production now but we are experiencing an issue where it becomes unresponsive intermittently throughout the day.
A few notes about the application architecture. It is an ASP.NET Web API project using a MariaDB database on a separate EC2 instance within the same private network. The connection string uses the private IP of the database server to avoid any name resolution issues. The site is hosted via IIS 10.
The application itself has been developed carefully following the best practices provided by Microsoft. Heavy focus on async operations, minimizing query response times and offloading more expensive operations into background services.
The app is extremely responsive. It performs with sub 100ms responses on almost all requests, even the more complicated requests, and all the way up until it becomes unresponsive this high level of performance remains the same. We tend to see between 10-30 requests per second and 300-500 select queries per second at peak usage so not too extreme. However, randomly (2-3 times over a 24 hour period) it will begin hanging on requests and simply not respond to the request. During this time, the database is still extremely responsive and we are never over 300 connections out of our 512 connection limit.
The resources on the application server itself are never really taxed much at all. The CPU never gets above ~20% and the memory usage sits around 20-30%.
If I were to stop the site in IIS and start it again while this is happening, it will quickly come back online. If I don't it will be down for a few minutes until IIS finally kills it due to a failed health check. There are no real errors generated as a response to the issue other than typical errors caused by the hanging of the process such as connection terminated errors. The only thing I have seen before that gave me pause was the fact that there a few connection timeouts when getting the connection from the pool, but like I said, the connections to the server are never close to the limit.
Also, this app and version has been in production for months and it wasn't until the traffic volume started to grow that we started seeing these issues. At this point, I am at a loss for next steps of troubleshooting and I'm seeking suggestions.
In IIS App Pool advanced settings set Start Mode to AlwaysRunning
I never found a root cause for this issue, however, after updating to newer versions of .NET MVC this issue went away. My best guess is that changes with the Kestrel possibly resolved this issue, although, I have no idea what specific change that might have been. I have gone through the change logs a few times and didn't see anything that specifically jumped out at me.

Limit concurrent queries in Spring JPA

I have a simple rest endpoint that executes Postgres procedure.
This procedure returns the current state of device.
For example:
20 devices.
Client app connect to API and make 20 responses to that endpoint every second.
For x clients there are x*20 requests.
For 2 clients 40 requests.
It causes a big cpu load on Postgres server only if there are many clients and/or many devices.
I didn’t create it but I need to redesign it.
How to limit concurrent queries to db only for it? It would be a hot fix.
My second idea is to create background worker that executes queries only one in the same time. Then the endpoint fetches data from memory.
I would try the simple way first. Try to reduce
the amount of database connections in the pool OR
the amount of working threads in the build-in Tomcat.
More flexible option would be to put the logic behind a thread pool limiting the amount of working threads. It is not trivial, if the Spring context and database is used inside a worker. Take a look on a Spring annotation #Async.
Offtopic: The solution we are discussing here looks like a workaround. The discussed solution alone will most probably increase the throughput only by factor 2 maybe 3. It is not JEE conform and it will be most probably not very stable. It is better to refactor the application avoiding such a problem. Another option would be to buy a new database server.
Update: JEE compliant solution would be to implement some sort of bulkhead pattern. It will limit the amount of concurrent running requests and reject it, if the some critical number is reached. The server application answers with "503 Service Unavailable". The client application catches this status and retries a second later (see "exponential backoff").

WCF - WebHttpBinding - RESTful - Performance Issue

first time poster so go easy on me.
I am currently trying to address a performance issue when hitting my web service after a one minute period of inactivity. Literally after one minute of THAT user not hitting the web service then the next call will take 15 seconds before actually hitting the service operation. If you keep making random (not the same service operation just so you guys don't think it is "caching" the call) service operation calls the service returns immediately (less than a second).
Here are some "timings" I decided to take so you can see how I came to the one minute of inactivity:
2:04PM
2:16PM --15 seconds
2:21PM --15 seconds
2:24PM --15 seconds
2:25PM --15 seconds
Again, if you hit the web service continuously without a one minute period of inactivity then ALL methods will return in less than a second.
Here are some details regarding my web service:
WCF, WebHttpBinding, RESTful, using HTTPs.
Basic Authentication + Custom Authentication using IDispatchMessageInspector. Authentication happens with EVERY call (except to the Initializer.aspx page).
Custom Initialization.aspx page has been created which is called every night after the Application Pool is recycled. This page caches a bunch of global data used by all users along with starting that compile.
Application Pool ONLY recycles every night at 2AM. Worker threads are never killed off because timeout is disabled.
I heard about ReliableSession but as the setting implies that sounds like it would only work for PerSession, not PerCall.
Is there any way to resolve this or am I stuck to resorting to "pinging" the server every 45 seconds using a dummy service operation?
Found out the issue. We have multiple domain controllers. When the user was getting authenticated it would start from the forest level and work its way down to the actual domain controller that server resided on. The firewalls that were put in place were blocking all domain controllers except what the server resided on.
So basically, it would fail to communicate to the N+ domain controllers until it finally reached the only one it could.
You can fix this a number of ways but we just created firewall rules to allow the web server to communicate to the domain controller the users needed to be authenticated against.

Load testing WCF services gives huge (>200 sec) responses

I have a service being load tested by a third party. A few minutes after starting, we start to see requests hanging for a very long period of time and the caller ultimately times out (after 60 seconds).
They are testing with 15 users with each user using two devices at once, so a total of 30 connections.
The service is a simple façade to a more complex operation, calling an external system. Benchmarking our communications to the external system looks as though everything is responding in the time we would expect (sub 200ms).
The IIS logs reveals a bunch of very high requests (> 200sec) which ultimately do return a 200 and have Win32 error code ERROR_NETNAME_DELETD (error 64). I have checked the Service Log and can match up the response to the request (based on the SOAP message id) and can see that we do eventually respond with the correct information (although the client has long given up).
Any ideas as to what could be causing this behavior? We're hosting in IIS using wsHttpBinding and we're using WS-Security with x509 certificates (message & transport encryption).
We don't have benchmark logging inside of our service but the code is a very simple mapping of the WCF request to the server request, making the request, and mapping the response to the WCF response. We do this manually and there is no parsing involved (straight assignments).
After a detailed investigation, including getting Microsoft support involved we were hitting up against the serviceThrottling defaults, specifically the maxConcurrentSessions. We determined this from perfmon - there is a counter for this. We were unsure as to why we saw this as the service behaved when called with a .NET client.
It turns out that the Java consumer of this application, using CXF, was not respecting the WSDL (specifically the bit about WS-SecureConversation) and closing sessions out when it closed its connection.
Our solution was to jack up the maxConcurrentSessions to a high number, set the inactivityTimeout down low (to a minute) to force session abandonment. In addition, we set establishSecurityContext to false to avoid the WSS negotiation consuming an additional session.
The solution is inelegant as the service logs are littered with errors about forced session closures, but it fixed the issue we were seeing here. Unfortunately we had a requirement for WS-Security so our solution needed to stick with that.
I hope this helps someone as this was an interesting and time consuming problem to pin down.

Ajax-calls: big differences between server runtime and client waiting time

I have two REST endpoints driving some navigation in a web site. Both create nearly the same response, but one gets its data straight from the db whereas the other has to ask a search engine (solr) first to get some data and then do the db calls.
If i profile both endpoints via JProfiler i get a higher runtime (approx. 60%) for the second one (about 31ms vs. 53ms). That's as expected.
Profile result:
If i view the same ajax calls from the client side i get a very different picture.
The faster of the both calls takes about 146 ms waiting and network time
The slower of the both calls takes about 1.4 seconds waiting and network
Frontend timing is measured via chrome developer tools. The server is a tomcat 7.0.30 running in STS 3.2. Client and server live on the same system, db and solr are external so there should be no network latency between tomcat and the browser. As a side note: The faster response has the bigger payload (2.6 vs 4.5 kb).
I have no idea why the slower of the both calls takes about 60% more server time but in sum nearly 1000% more "frontend time".
The question is: Is there any way i can figure out where this timing differences originate?
By default, the CPU views in JProfiler show times in the "Runnable" thread state. If a thread reads data from a socket connection or waits for some condition, that time is not included in the "Runnable" thread state.
In the upper right corner of the CPU views there is a thread state selector. If you change that to "All states", you will get times that you can compare with the wall clock times from the browser.

Resources