High memory and CPU consumption for rails application on google cloud - passenger

I have a Compute engine on google cloud with 4 core CPU Ivy Brigde and 15 GB RAM and on that I have deployed my rails application.
Before this I had hosted my rails application on digital ocean and there I was getting good throughput and also the cpu and memory consumption was minimal.
It never crossed 3 GB memory consumption on Digital ocean and the CPU consumption max was around 50% - 55%.
On Digital Ocean I had a single instance with 4 core CPU and 8GB RAM and even I was running mysql,redis and sidekiq on the same instance and still it could handle the load easily.
But as I moved to google cloud I started facing the problems for the same code.
Actually I was expecting more throughput from the Google cloud as Google has data centers in Asia, but I started facing issue.
When I restart apache everything comes back to normal and again after 2 - 3 hours it goes on consuming memeory and CPU and finally instance stops responding to the requests anymore.
I checked the logs..... and there are no much increase in traffic, also I cheked logs during the load time to ensure whether someone is attacking the servers.
But all the request I found are from a valid browsers with valid user agents.
I don't understand why is this happening.
First I felt if it is a DDOS/DOS attack but din't find anything suspicious in the log (apache access logs and rails logs).
Please help me.
Hoping for some good solution that I can try and debug the issue.
Thanks :)

Related

Unexplained memory usage on Azure Windows App Service Plan - Drill down missing

We have a memory problem with our Azure Windows App Service Plan (service level is P1v3 with 1 instance – this means 8 GB memory).
We are running two small .NET 6 App Services on it (some web APIs), that use custom containers – without problems.
They’re not in production and receive a very low number of requests.
However, when looking at the service plan’s memory usage in Diagnose and Solve Problems / Memory Analysis, we see an unexplained 80% memory percent usage – in a stable way:
And the real problem occurs when we try to start a third app service on the plan. We get this "out of memory" error in our log stream :
ERROR - Site: app-name-dev - Unable to start container.
Error message: Docker API responded with status code=InternalServerError,
response={"message":"hcsshim::CreateComputeSystem xxxx:
The paging file is too small for this operation to complete."}
So it looks like docker doesn’t have enough mem to start the container. Maybe because of the 80% mem usage ?
But our apps actually have very low memory needs. When running them locally on dev machines, we see about 50-150M memory usage (when no requests occur).
In Azure, the private bytes graph in “availability and performance” shows very moderate consumption for the biggest app of the two:
Unfortunately, the “Memory drill down” is unavailable:
(needless to say, waiting hours doesn’t change the message…)
Even more strange, stopping all App Services of the App Service Plan still show a Memory Percentage of 60% in the Plan.
Obviously some memory is being retained by something...
So the questions are:
Is it normal to have 60% memory percentage in an App Service Plan with no App Services running ?
If not, could this be due to a memory leak in our app ? But app services are ran in supposedly isolated containers, so I'm not sure this is possible. Any other explanation is of course welcome :-)
Why can’t we access the memory drill down ?
Any tips on the best way to fit "small" Docker containers with low memory usage in Azure App Service ? (or maybe in another Azure resource type...). It's a bit frustrating to be able to use ony 3GB out of a 8GB machine...
Further details:
First app is a .NET 6 based app, with its docker image based on aspnet:6.0-nanoserver-ltsc2022
Second app is also a .NET 6 based app, but has some windows DLL dependencies, and therefore is based on aspnet:6.0-windowsservercore-ltsc2022
Thanks in advance!
EDIT:
I added more details and changed the questions a bit since I was able to stop all app services tonight.

Azure Website Kudu HTMLLog Analysis shows Always On with high response time

We deployed our WebAPI as an azure website under the standard plan and have turned on Always On. After getting multiple memory and CPU alerts we decided on checking the logs via xyz.scm.azurewebsites.net. It seems Always ON has a high response time. Could this be causing high memory and CPU issues. Sometimes the alerts come when none is even using the system and auto resolve within 5 mins.
The always on feature only invokes the root of your web app every 5 minutes.
If this is causing high memory or cpu it could be a memory leak within your application because if you don't use the always on feature your process gets recycled on idle.
You should check what your app does if you invoke it with the root path and determine why this is causing high response time.

Constant CPU usage and periodical API requests on Compute Engine Instance

Recently I have deployed a Compute Engine instance developed from LAMP template.
A few days after deployment I started to see constant CPU usage (~8%) and also periodical API requests (each ~30 seconds).
I have not performed any API activity (haven't created any applications) and I see ZERO CPU usage inside the VM.
Any ideas what is happenning?
The API requests solved by redeploying the whole machine, but the CPU usage still remains around 8%

Azure in role cache exceptions when service scales

I am using Windows Azure SDK 2.2 and have created an Azure cloud service that uses an in-role cache.
I have 2 instances of the service running under normal conditions.
When the services scales (up to 3 instances, or back down to 2 instances), I get lots of DataCacheExceptions. These are often accompanied by Azure db connection failures from the process going in inside the cache. (If I don't find the entry I want in the cache, I get it from the db and put it into the cache. All standard stuff.)
I have implemented retry processes on the cache gets and puts, and use the ReliableSqlConnection object with a retry process for db connection using the Transient Fault Handling application block.
The retry process uses a fixed interval retrying every second for 5 tries.
The failures are typically;
Microsoft.ApplicationServer.Caching.DataCacheException: ErrorCode:SubStatus:There is a temporary failure. Please retry later
Any idea why the scaling might cause these exceptions?
Should I try a less aggressive retry policy?
Any help appreciated.
I have also noticed that I am getting a high percentage (> 70%) cache miss rate and when the system is struggling, there is high cpu utilisation (> 80%).
Well, I haven't been able to find out any reason for the errors I am seeing, but I have 'fixed' the problem, sort of!
When looking at the last few days processing stats, it is clear the high cpu usage corresponds with the cloud service having 'problems'. I have changed the service to use two medium instances instead of two small instances.
This seems to have solved the problem, and the service has been running quite happily, low cpu usage, low memory usage, no exceptions.
So, whilst still not discovering what the source of the problems were, I seem to have overcome them by providing a bigger environment for the service to run in.
--Late news!!! I noticed this morning that from about 06:30, the cpu usage started to climb, along with the time taken for the service to process as it should. Errors started appearing and I had to restart the service at 10:30 to get things back to 'normal'. Also, when restarting the service, the OnRoleRun process threw loads of DataCacheExceptions before it started running again, 45 minutes later.
Now all seems well again, and I will monitor for the next hours/days...
There seems to be no explanation for this, remote desktop to the instances show no exceptions in the event log, other logging is not showing application problems, so I am still stumped.

ASP.NET MVC lost in finding botleneck

I have ASP.NET MVC app which accept file uploads and has result pooling using SignalR. The app hosted on Prod server with IIS7, 4 Gb Ram and two cores CPU.
The app on Dev server works perfectly but when I host it on Prod server with about 50 000 users per day the app become unrresponsible after five minutes of running. The web page request time increase dramatically and it takes about 30 seconds to load one page. I have tried to record all MvcApplication.Application_BeginRequest event call and got 9000 hits in 5 minutes. Not sure is this acceptable number of hits or not for app like this.
I have used ANTS Performance Profiler(not useful in Prod app profiling, slow and eats all memory) to profile code but profiler do not show any time delay issues in my code/MSSQL queries.
Also I have tried to monitored CPU and RAM spike problems but I didn't find any. CPU percentage sometimes goes to 15% but never up and memory usage is normal.
I suspect that there is something wrong with request or threads limits in ASP.NET/IIS7 but don't know how to profile it.
Could someone suggest any profiling solutions which could help in this situation? Tried to hunt the problem for two week already without any result :(
You may try using the MiniProfiler and more specifically the MiniProfiler.MVC3 NuGet package which is specifically created for ASP.NET MVC applications. It will show you all kind of useful information such as the time spend for different methods in the execution of the request.

Resources