High CPU Utilization issue on Moqui instance - amazon-ec2

We are facing one problem of random High CPU utilization on Production Server which makes application Not Responding. And we need to restart the application again. We have done initial level diagnostic and couldn’t conclude.
We are using following configuration for Production Server
Amazon EC2 8gb RAM(m4.large) ubuntu 14.04 LTS
Amazon RDS 2gb RAM(t2.small) Mysql database
Java heap size -Xms2048M -Xmx4096
Database Connection Pool size Minimum: 20 and Maximum: 150
MaxThreads 100
Below two results are of top command
1) At 6:52:50 PM
KiB Mem : 8173968 total, 2100304 free, 4116436 used, 1957228 buff/cache
KiB Swap: 1048572 total, 1047676 free, 896 used. 3628092 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
20698 root 20 0 6967736 3.827g 21808 S 3.0 49.1 6:52.50 java
2) At 6:53:36 PM
KiB Mem : 8173968 total, 2099000 free, 4116964 used, 1958004 buff/cache
KiB Swap: 1048572 total, 1047676 free, 896 used. 3627512 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+COMMAND
20698 root 20 0 6967736 3.828g 21808 S 200.0 49.1 6:53.36 java
Note:
Number of Concurrent users - 5 or 6 (at this time)
Number of requests between 6:52:50 PM and 6:53:36 PM - 4
Results shows CPU utilization is increase drastically.
Any suggestion or direction which can lead to solution??
Additionally, following is the cpu utilization graph for last week.
Thanks!

Without seeing a stack trace, I'd guess that the problem is likely Jetty, as there have been recent documented bugs in Jetty causing the behaviour you describe on EC2 (do a google search on this.). I would recommend you do a couple of stack trace dumps during 100% cpu, to confirm it is Jetty, then if you look at the Jetty documentation on this bug, hopefully you may find you simply need to update Jetty.

Related

Need help understanding my ElasticSearch Cluster Health

When querying my cluster, I noticed these stats for one of my nodes in the cluster. Am new to Elastic and would like the community's health in understanding the meaning of these and if I need to take any corrective measures?
Does the Heap used look on the higher side and if yes, how would I rectify it? Also any comments on the System Memory Used would be helpful - it feels like its on the really high side as well.
These are the JVM level stats
JVM
Version OpenJDK 64-Bit Server VM (1.8.0_171)
Process ID 13735
Heap Used % 64%
Heap Used/Max 22 GB / 34.2 GB
GC Collections (Old/Young) 1 / 46,372
Threads (Peak/Max) 163 / 147
This is the OS Level stats
Operating System
System Memory Used % 90%
System Memory Used 59.4 GB / 65.8 GB
Allocated Processors 16
Available Processors 16
OS Name Linux
OS Architecture amd64
As You state that you are new to Elasticsearch I must say you go through cluster as well as cat API you can find documentation at clusert API and cat API
This will help you understand more in depth.

pprof usage and interpretation

We believe our go app has a memory leak.
In order to find out what's going on, we are trying with pprof.
We are having a hard time though understanding the reads.
When connecting to go tool pprof http://localhost:6060/debug/pprof/heap?debug=1, a sample output is
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) text
Showing nodes accounting for 17608.45kB, 100% of 17608.45kB total
Showing top 10 nodes out of 67
flat flat% sum% cum cum%
12292.12kB 69.81% 69.81% 12292.12kB 69.81% github.com/acct/repo/vendor/github.com/.../funcA /../github.com/acct/repo/vendor/github.com/../fileA.go
1543.14kB 8.76% 78.57% 1543.14kB 8.76% github.com/acct/repo/../funcB /../github.com/acct/repo/fileB.go
1064.52kB 6.05% 84.62% 1064.52kB 6.05% github.com/acct/repo/vendor/github.com/../funcC /../github.com/acct/repo/vendor/github.com/fileC.go
858.34kB 4.87% 89.49% 858.34kB 4.87% github.com/acct/repo/vendor/golang.org/x/tools/imports.init /../github.com/acct/repo/vendor/golang.org/x/tools/imports/zstdlib.go
809.97kB 4.60% 94.09% 809.97kB 4.60% bytes.makeSlice /usr/lib/go/src/bytes/buffer.go
528.17kB 3.00% 97.09% 528.17kB 3.00% regexp.(*bitState).reset /usr/lib/go/src/regexp/backtrack.go
(Please forgive the clumsy obfuscation)
We interpret funcA to be consuming nearly 70% of memory - but this being around 12MB.
Now top though shows:
top - 18:09:44 up 2:02, 1 user, load average: 0,75, 0,56, 0,38
Tasks: 166 total, 1 running, 165 sleeping, 0 stopped, 0 zombie
%Cpu(s): 3,7 us, 1,6 sy, 0,0 ni, 94,3 id, 0,0 wa, 0,0 hi, 0,3 si, 0,0 st
KiB Mem : 16318684 total, 14116728 free, 1004804 used, 1197152 buff/cache
KiB Swap: 0 total, 0 free, 0 used. 14451260 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
4902 me 20 0 1,371g 0,096g 0,016g S 12,9 0,6 1:58.14 mybin
which suggests 1.371 GB of memory used....where is it gone???
Also, pprof docs are quite frugal. We are having difficulties even to understand how it should be used. Our binary is a daemon. For example:
If we start a reading with go tool pprof http://localhost:6060/debug/pprof/heap, is this a one time reading at this particular time or an aggregate over time?
Sometimes hitting text later again in interactive mode seems to report the same values. Are we actually looking at the same values? Do we need to restart go tool pprof... to get fresh values?
Is it a reading of the complete heap, or of some specific go routine, of a specific point in the stack....???
Finally, is this interpretation correct (from http://localhost:6060/debug/pprof/):
/debug/pprof/
profiles:
0 block
64 goroutine
45 heap
0 mutex
13 threadcreate
The binary has 64 open go routines and a total of 45MB of heap memory?

Should the stats reported by Go's runtime.ReadMemStats approximately equal the resident memory set reported by ps aux?

In Go Should the "Sys" stat or any other stat/combination reported by runtime.ReadMemStats approximately equal the resident memory set reported by ps aux?
Alternatively, assuming some memory may be swapped out, should the Sys stat be approximately greater than or equal to the RSS?
We have a long-running web service that deals with a high frequency of requests and we are finding that the RSS quickly climbs up to consume virtually all of the 64GB memory on our servers. When it hits ~85% we begin to experience considerable degradation in our response times and in how many concurrent requests we can handle. The run I've listed below is after about 20 hours of execution, and is already at 51% memory usage.
I'm trying to determine if the likely cause is a memory leak (we make some calls to CGO). The data seems to indicate that it is, but before I go down that rabbit hole I want to rule out a fundamental misunderstanding of the statistics I'm using to make that call.
This is an amd64 build targeting linux and executing on CentOS.
Reported by runtime.ReadMemStats:
Alloc: 1294777080 bytes (1234.80MB) // bytes allocated and not yet freed
Sys: 3686471104 bytes (3515.69MB) // bytes obtained from system (sum of XxxSys below)
HeapAlloc: 1294777080 bytes (1234.80MB) // bytes allocated and not yet freed (same as Alloc above)
HeapSys: 3104931840 bytes (2961.09MB) // bytes obtained from system
HeapIdle: 1672339456 bytes (1594.87MB) // bytes in idle spans
HeapInuse: 1432592384 bytes (1366.23MB) // bytes in non-idle span
Reported by ps aux:
%CPU %MEM VSZ RSS
1362 51.3 306936436 33742120

how much resource reserved on a mesos-slave

How does mesos-slave calculate its available resources. In web-ui, mesos-master shows 2.9G memory available on a slave, but I run "free -m":
free -m
total used free shared buffers cached
Mem: 3953 2391 1562 0 1158 771
-/+ buffers/cache: 461 3491
Swap: 4095 43 4052
and --resource parameter was not given.
I want to know how does mesos scheduler calculate resources available.
The function that calculates available resources that are offered by slaves can be seen here, in particular, the memory part is lines 98 to 114.
If the machine has more than 2GB of RAM Mesos will offer total - Gigabytes(1). In your case the machine has ~4GB, and that's why you're seeing ~3GB in the Web UI.

Elasticsearch High CPU When Idle

I'm fairly new to Elasticsearch and I've bumped into an issue that I'm having difficulties in even troubleshooting. My Elasticsearch (1.1.1) is currently spiking the cpu even though no searching or indexing is going on. CPU usage isn't always at 100%, but it jumps up there quite a bit and load is very high.
Previously, the indices on this node ran perfectly fine for months without any issue. This just started today and I have no idea what's causing it.
The problem persists even after I restart ES and I even restarted the server in pure desperation. No effect on the issue.
Here are some stats to help frame the issue, but I'd imagine there's more information that's needed. I'm just not sure what to provide.
Elasticsearch 1.1.1
Gentoo Linux 3.12.13
java version "1.6.0_27"
OpenJDK Runtime Environment (IcedTea6 1.12.7) (Gentoo build 1.6.0_27-b27)
OpenJDK 64-Bit Server VM (build 20.0-b12, mixed mode)
One node, 5 shards, 0 replicas
32GB RAM on system, 16GB Dedicated to Elasticsearch
RAM does not appear to be the issue here.
Any tips on troubleshooting the issue would be appreciated.
Edit: Info from top if it's helpful at all.
top - 19:56:56 up 3:22, 2 users, load average: 10.62, 11.15, 9.37
Tasks: 123 total, 1 running, 122 sleeping, 0 stopped, 0 zombie
%Cpu(s): 98.5 us, 0.6 sy, 0.0 ni, 0.7 id, 0.2 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem: 32881532 total, 31714120 used, 1167412 free, 187744 buffers
KiB Swap: 4194300 total, 0 used, 4194300 free, 12615280 cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
2531 elastic+ 20 0 0.385t 0.020t 3.388g S 791.9 64.9 706:00.21 java
As Andy Pryor mentioned, the background merging might have been what was causing the issue. Our index rollover had been paused and two of our current indices were over 200GB. Rolling them over appeared to have resolved the issue and we've been humming along just fine since.
Edit:
The high load when seemingly idle was determined to have been caused by merges on several very large indices that were not being rolled over on a weekly basis. This was a failure of an internal process to roll over indices on a weekly basis. After addressing this oversight the merge times were short and the high load subsided.

Resources