Panels to have in Kibana dashboard for troubleshooting applications - elasticsearch

what are some good panels to have in kibana visualisation for developers to troubleshoot issues in applications? I am trying to create a dashboard that developers could use to pinpoint where the app is having issues. So that they could resolve it. These are a few factors that I have considered :
Cpu usage of pod, memory usage of pod, network in and out, application logs are the ones I have got in mind. Any other panels I could add to so that developers could get an idea where to check if something goes wrong in the app.
For example, application slowness could be because of high cpu consumption, app goes down could because OOM kill, request takes longer could be due to latency or cache issues etc Is there any other thing that I could take into consideration if yes please suggest?

So here a few things that we could add are:
Number of pods, deployments,daemonsets,statefulsets present in the cluster
cpu utilised by the pod(pod wise breakdown)
memory utilised by the pod(pod wise breakdown)
Network in/out
5.top memory/cpu consuming pods and nodes
Latency
persistence disk details
error logs as annotations in tsvb
Logstreams to check logs within dashboard.

Related

Losing Provenance records in Apache Nifi

We work with a lot of data and have a high throughput of files going through our Nifi instances. We have recently been losing providence records and don't seem to understand what the cause is.
Below are some details, if relevant:
We have our Providence database on its own drive in the cloud, and are not seeing any high IO usage or resource contention.
We have added additional threads to this, as well as 999k file handles.
If it means anything, providence data is kept for 2 weeks in our configuration.
We are on Nifi version 1.15.3, but are planning on an upgrade in the near future.
Any ideas on what the cause may be and how to remediate this? Thanks!

Unexplained memory usage on Azure Windows App Service Plan - Drill down missing

We have a memory problem with our Azure Windows App Service Plan (service level is P1v3 with 1 instance – this means 8 GB memory).
We are running two small .NET 6 App Services on it (some web APIs), that use custom containers – without problems.
They’re not in production and receive a very low number of requests.
However, when looking at the service plan’s memory usage in Diagnose and Solve Problems / Memory Analysis, we see an unexplained 80% memory percent usage – in a stable way:
And the real problem occurs when we try to start a third app service on the plan. We get this "out of memory" error in our log stream :
ERROR - Site: app-name-dev - Unable to start container.
Error message: Docker API responded with status code=InternalServerError,
response={"message":"hcsshim::CreateComputeSystem xxxx:
The paging file is too small for this operation to complete."}
So it looks like docker doesn’t have enough mem to start the container. Maybe because of the 80% mem usage ?
But our apps actually have very low memory needs. When running them locally on dev machines, we see about 50-150M memory usage (when no requests occur).
In Azure, the private bytes graph in “availability and performance” shows very moderate consumption for the biggest app of the two:
Unfortunately, the “Memory drill down” is unavailable:
(needless to say, waiting hours doesn’t change the message…)
Even more strange, stopping all App Services of the App Service Plan still show a Memory Percentage of 60% in the Plan.
Obviously some memory is being retained by something...
So the questions are:
Is it normal to have 60% memory percentage in an App Service Plan with no App Services running ?
If not, could this be due to a memory leak in our app ? But app services are ran in supposedly isolated containers, so I'm not sure this is possible. Any other explanation is of course welcome :-)
Why can’t we access the memory drill down ?
Any tips on the best way to fit "small" Docker containers with low memory usage in Azure App Service ? (or maybe in another Azure resource type...). It's a bit frustrating to be able to use ony 3GB out of a 8GB machine...
Further details:
First app is a .NET 6 based app, with its docker image based on aspnet:6.0-nanoserver-ltsc2022
Second app is also a .NET 6 based app, but has some windows DLL dependencies, and therefore is based on aspnet:6.0-windowsservercore-ltsc2022
Thanks in advance!
EDIT:
I added more details and changed the questions a bit since I was able to stop all app services tonight.

ubuntu server cpu utilisation increasing very quickly after installing ELK

I installed elasticsearch logstash and kibana in the ubuntu server. Before I starting these services the CPU utilization is less than 5% and after starting these services in the next minute the CPU utilization crossing 85%. I don't know why it is happening. Can anyone help me with this issue?
Thanks in advance.
There is not enough information in your question to give you a specific answer, but i will point out few possible scenarios and how to deal with them.
Did you wait long enough? sometimes there is a warmpup which is consuming higher CPU until all services are registered and finish to boot. if you have a fairly small machine it might consume higher CPU and take longer to finish.
folder write permissions. if any of the components of the ELK fails due to restricted access on needed directories either for logging, creating sub folders for sinceDB files or more it can cause it to go into an infinity loop and try again and again while it is consuming high CPU.
connection issues. ES should be the first component to start, if it fails, Kibana and Logstash will go and try to connect to the ES again and again until successful connection- which can cause high CPU.
bad logstash configuration. if logstash fails to read the file from the configurations or if you have a bad parsing, excessive parsing for example- your first "match" in the filter part will include the least common option it might consume high CPU.
For further investigation:
I suggest you to not start all of them together. start ES first. if everything goes well start Kibana and lastly start Logstash.
check the logs of all the ELK components to find error messages, failures, etc.
for a better answer I will need the yaml of all 3 components (ES, Kibana, Logstash)
I will need the logstash configuration file.
Would recommend you to analyse the CPU cycles consumed by each of the elasticsearch, logstash and kibana process.
Check specifically which process among the above is consuming the most memory/cpu via top command for example.
Start only ES first and allow it to settle and the node to be started completely before starting kibana and may be logstash after that.
Send me the logs for each and I can assist if there are any errors.

Azure Website Kudu HTMLLog Analysis shows Always On with high response time

We deployed our WebAPI as an azure website under the standard plan and have turned on Always On. After getting multiple memory and CPU alerts we decided on checking the logs via xyz.scm.azurewebsites.net. It seems Always ON has a high response time. Could this be causing high memory and CPU issues. Sometimes the alerts come when none is even using the system and auto resolve within 5 mins.
The always on feature only invokes the root of your web app every 5 minutes.
If this is causing high memory or cpu it could be a memory leak within your application because if you don't use the always on feature your process gets recycled on idle.
You should check what your app does if you invoke it with the root path and determine why this is causing high response time.

High memory and CPU consumption for rails application on google cloud

I have a Compute engine on google cloud with 4 core CPU Ivy Brigde and 15 GB RAM and on that I have deployed my rails application.
Before this I had hosted my rails application on digital ocean and there I was getting good throughput and also the cpu and memory consumption was minimal.
It never crossed 3 GB memory consumption on Digital ocean and the CPU consumption max was around 50% - 55%.
On Digital Ocean I had a single instance with 4 core CPU and 8GB RAM and even I was running mysql,redis and sidekiq on the same instance and still it could handle the load easily.
But as I moved to google cloud I started facing the problems for the same code.
Actually I was expecting more throughput from the Google cloud as Google has data centers in Asia, but I started facing issue.
When I restart apache everything comes back to normal and again after 2 - 3 hours it goes on consuming memeory and CPU and finally instance stops responding to the requests anymore.
I checked the logs..... and there are no much increase in traffic, also I cheked logs during the load time to ensure whether someone is attacking the servers.
But all the request I found are from a valid browsers with valid user agents.
I don't understand why is this happening.
First I felt if it is a DDOS/DOS attack but din't find anything suspicious in the log (apache access logs and rails logs).
Please help me.
Hoping for some good solution that I can try and debug the issue.
Thanks :)

Resources