We deployed our WebAPI as an azure website under the standard plan and have turned on Always On. After getting multiple memory and CPU alerts we decided on checking the logs via xyz.scm.azurewebsites.net. It seems Always ON has a high response time. Could this be causing high memory and CPU issues. Sometimes the alerts come when none is even using the system and auto resolve within 5 mins.
The always on feature only invokes the root of your web app every 5 minutes.
If this is causing high memory or cpu it could be a memory leak within your application because if you don't use the always on feature your process gets recycled on idle.
You should check what your app does if you invoke it with the root path and determine why this is causing high response time.
Related
We have a memory problem with our Azure Windows App Service Plan (service level is P1v3 with 1 instance – this means 8 GB memory).
We are running two small .NET 6 App Services on it (some web APIs), that use custom containers – without problems.
They’re not in production and receive a very low number of requests.
However, when looking at the service plan’s memory usage in Diagnose and Solve Problems / Memory Analysis, we see an unexplained 80% memory percent usage – in a stable way:
And the real problem occurs when we try to start a third app service on the plan. We get this "out of memory" error in our log stream :
ERROR - Site: app-name-dev - Unable to start container.
Error message: Docker API responded with status code=InternalServerError,
response={"message":"hcsshim::CreateComputeSystem xxxx:
The paging file is too small for this operation to complete."}
So it looks like docker doesn’t have enough mem to start the container. Maybe because of the 80% mem usage ?
But our apps actually have very low memory needs. When running them locally on dev machines, we see about 50-150M memory usage (when no requests occur).
In Azure, the private bytes graph in “availability and performance” shows very moderate consumption for the biggest app of the two:
Unfortunately, the “Memory drill down” is unavailable:
(needless to say, waiting hours doesn’t change the message…)
Even more strange, stopping all App Services of the App Service Plan still show a Memory Percentage of 60% in the Plan.
Obviously some memory is being retained by something...
So the questions are:
Is it normal to have 60% memory percentage in an App Service Plan with no App Services running ?
If not, could this be due to a memory leak in our app ? But app services are ran in supposedly isolated containers, so I'm not sure this is possible. Any other explanation is of course welcome :-)
Why can’t we access the memory drill down ?
Any tips on the best way to fit "small" Docker containers with low memory usage in Azure App Service ? (or maybe in another Azure resource type...). It's a bit frustrating to be able to use ony 3GB out of a 8GB machine...
Further details:
First app is a .NET 6 based app, with its docker image based on aspnet:6.0-nanoserver-ltsc2022
Second app is also a .NET 6 based app, but has some windows DLL dependencies, and therefore is based on aspnet:6.0-windowsservercore-ltsc2022
Thanks in advance!
EDIT:
I added more details and changed the questions a bit since I was able to stop all app services tonight.
We are working on an ASP.NET 5 Web API project that is in production now but we are experiencing an issue where it becomes unresponsive intermittently throughout the day.
A few notes about the application architecture. It is an ASP.NET Web API project using a MariaDB database on a separate EC2 instance within the same private network. The connection string uses the private IP of the database server to avoid any name resolution issues. The site is hosted via IIS 10.
The application itself has been developed carefully following the best practices provided by Microsoft. Heavy focus on async operations, minimizing query response times and offloading more expensive operations into background services.
The app is extremely responsive. It performs with sub 100ms responses on almost all requests, even the more complicated requests, and all the way up until it becomes unresponsive this high level of performance remains the same. We tend to see between 10-30 requests per second and 300-500 select queries per second at peak usage so not too extreme. However, randomly (2-3 times over a 24 hour period) it will begin hanging on requests and simply not respond to the request. During this time, the database is still extremely responsive and we are never over 300 connections out of our 512 connection limit.
The resources on the application server itself are never really taxed much at all. The CPU never gets above ~20% and the memory usage sits around 20-30%.
If I were to stop the site in IIS and start it again while this is happening, it will quickly come back online. If I don't it will be down for a few minutes until IIS finally kills it due to a failed health check. There are no real errors generated as a response to the issue other than typical errors caused by the hanging of the process such as connection terminated errors. The only thing I have seen before that gave me pause was the fact that there a few connection timeouts when getting the connection from the pool, but like I said, the connections to the server are never close to the limit.
Also, this app and version has been in production for months and it wasn't until the traffic volume started to grow that we started seeing these issues. At this point, I am at a loss for next steps of troubleshooting and I'm seeking suggestions.
In IIS App Pool advanced settings set Start Mode to AlwaysRunning
I never found a root cause for this issue, however, after updating to newer versions of .NET MVC this issue went away. My best guess is that changes with the Kestrel possibly resolved this issue, although, I have no idea what specific change that might have been. I have gone through the change logs a few times and didn't see anything that specifically jumped out at me.
I have deployed a simple Spring boot app in Google App Engine Flexible. The app. has two APIs, one to add the user data into the DB (xxx.appspot.com/add) the other to get all the user data from the DB (xxx.appspot.com/all).
I wanted to see how GAE scales for the load, hence used JMeter to create a load with 100 user concurrency ramped up in 10 seconds and calls these two APIs in half a second delay, forever. While it runs fine for sometime (with just one instance), it starts to fail after 30 seconds or so with a "java.net.SocketException" or "The server responded with a status of 502".
After this error, when I try to access the same API from the browser, it displays,
Error: Server Error
The server encountered a temporary error and could not complete your
request. Please try again in 30 seconds.
The service is back to normal after 30 mins or so, and whenever the load test happens it repeats the same behavior as mentioned above. I expect GAE to auto-scale based on the load coming in to handle it without any down time (using multiple instances), instead it just crashes or blocks the service (without any information in the log). My app.yaml configuration is,
runtime: java
env: flex
service: hello-service
automatic_scaling:
min_num_instances: 1
max_num_instances: 10
I am a bit stuck with this one, Any help would be greatly appreciated. Thanks in advance.
The solution was to increase the resource configuration, details below.
Given that I did not set a resource parameter, it defaulted to the pre-defined values for both CPU and Memory. In this case, the default
memory was set at 0.6GB. App Engine Flex instances uses about 0.4GB
for overhead processes. Given Java is known to consume higher memory, there is a
great likelihood that the overhead processes consumed more than the
approximate 0.4GB value. Now instances in App Engine are restarted due
to a variety of reasons including optimization due to memory use. This
explains why your instances went off and it shows Tomcat is starting
up (they got restarted) and ends up in 502 error due to the nginx is
not able to complete the request. Fixing the above may lessen if not completely eliminate the 502s.
After I have specified the resources attribute and increased the configuration in app.yaml 502 error seems to be gone.
I am using Windows Azure SDK 2.2 and have created an Azure cloud service that uses an in-role cache.
I have 2 instances of the service running under normal conditions.
When the services scales (up to 3 instances, or back down to 2 instances), I get lots of DataCacheExceptions. These are often accompanied by Azure db connection failures from the process going in inside the cache. (If I don't find the entry I want in the cache, I get it from the db and put it into the cache. All standard stuff.)
I have implemented retry processes on the cache gets and puts, and use the ReliableSqlConnection object with a retry process for db connection using the Transient Fault Handling application block.
The retry process uses a fixed interval retrying every second for 5 tries.
The failures are typically;
Microsoft.ApplicationServer.Caching.DataCacheException: ErrorCode:SubStatus:There is a temporary failure. Please retry later
Any idea why the scaling might cause these exceptions?
Should I try a less aggressive retry policy?
Any help appreciated.
I have also noticed that I am getting a high percentage (> 70%) cache miss rate and when the system is struggling, there is high cpu utilisation (> 80%).
Well, I haven't been able to find out any reason for the errors I am seeing, but I have 'fixed' the problem, sort of!
When looking at the last few days processing stats, it is clear the high cpu usage corresponds with the cloud service having 'problems'. I have changed the service to use two medium instances instead of two small instances.
This seems to have solved the problem, and the service has been running quite happily, low cpu usage, low memory usage, no exceptions.
So, whilst still not discovering what the source of the problems were, I seem to have overcome them by providing a bigger environment for the service to run in.
--Late news!!! I noticed this morning that from about 06:30, the cpu usage started to climb, along with the time taken for the service to process as it should. Errors started appearing and I had to restart the service at 10:30 to get things back to 'normal'. Also, when restarting the service, the OnRoleRun process threw loads of DataCacheExceptions before it started running again, 45 minutes later.
Now all seems well again, and I will monitor for the next hours/days...
There seems to be no explanation for this, remote desktop to the instances show no exceptions in the event log, other logging is not showing application problems, so I am still stumped.
I have ASP.NET MVC app which accept file uploads and has result pooling using SignalR. The app hosted on Prod server with IIS7, 4 Gb Ram and two cores CPU.
The app on Dev server works perfectly but when I host it on Prod server with about 50 000 users per day the app become unrresponsible after five minutes of running. The web page request time increase dramatically and it takes about 30 seconds to load one page. I have tried to record all MvcApplication.Application_BeginRequest event call and got 9000 hits in 5 minutes. Not sure is this acceptable number of hits or not for app like this.
I have used ANTS Performance Profiler(not useful in Prod app profiling, slow and eats all memory) to profile code but profiler do not show any time delay issues in my code/MSSQL queries.
Also I have tried to monitored CPU and RAM spike problems but I didn't find any. CPU percentage sometimes goes to 15% but never up and memory usage is normal.
I suspect that there is something wrong with request or threads limits in ASP.NET/IIS7 but don't know how to profile it.
Could someone suggest any profiling solutions which could help in this situation? Tried to hunt the problem for two week already without any result :(
You may try using the MiniProfiler and more specifically the MiniProfiler.MVC3 NuGet package which is specifically created for ASP.NET MVC applications. It will show you all kind of useful information such as the time spend for different methods in the execution of the request.