Elasticsearch High CPU When Idle - elasticsearch

I'm fairly new to Elasticsearch and I've bumped into an issue that I'm having difficulties in even troubleshooting. My Elasticsearch (1.1.1) is currently spiking the cpu even though no searching or indexing is going on. CPU usage isn't always at 100%, but it jumps up there quite a bit and load is very high.
Previously, the indices on this node ran perfectly fine for months without any issue. This just started today and I have no idea what's causing it.
The problem persists even after I restart ES and I even restarted the server in pure desperation. No effect on the issue.
Here are some stats to help frame the issue, but I'd imagine there's more information that's needed. I'm just not sure what to provide.
Elasticsearch 1.1.1
Gentoo Linux 3.12.13
java version "1.6.0_27"
OpenJDK Runtime Environment (IcedTea6 1.12.7) (Gentoo build 1.6.0_27-b27)
OpenJDK 64-Bit Server VM (build 20.0-b12, mixed mode)
One node, 5 shards, 0 replicas
32GB RAM on system, 16GB Dedicated to Elasticsearch
RAM does not appear to be the issue here.
Any tips on troubleshooting the issue would be appreciated.
Edit: Info from top if it's helpful at all.
top - 19:56:56 up 3:22, 2 users, load average: 10.62, 11.15, 9.37
Tasks: 123 total, 1 running, 122 sleeping, 0 stopped, 0 zombie
%Cpu(s): 98.5 us, 0.6 sy, 0.0 ni, 0.7 id, 0.2 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem: 32881532 total, 31714120 used, 1167412 free, 187744 buffers
KiB Swap: 4194300 total, 0 used, 4194300 free, 12615280 cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
2531 elastic+ 20 0 0.385t 0.020t 3.388g S 791.9 64.9 706:00.21 java

As Andy Pryor mentioned, the background merging might have been what was causing the issue. Our index rollover had been paused and two of our current indices were over 200GB. Rolling them over appeared to have resolved the issue and we've been humming along just fine since.
Edit:
The high load when seemingly idle was determined to have been caused by merges on several very large indices that were not being rolled over on a weekly basis. This was a failure of an internal process to roll over indices on a weekly basis. After addressing this oversight the merge times were short and the high load subsided.

Related

Need help understanding my ElasticSearch Cluster Health

When querying my cluster, I noticed these stats for one of my nodes in the cluster. Am new to Elastic and would like the community's health in understanding the meaning of these and if I need to take any corrective measures?
Does the Heap used look on the higher side and if yes, how would I rectify it? Also any comments on the System Memory Used would be helpful - it feels like its on the really high side as well.
These are the JVM level stats
JVM
Version OpenJDK 64-Bit Server VM (1.8.0_171)
Process ID 13735
Heap Used % 64%
Heap Used/Max 22 GB / 34.2 GB
GC Collections (Old/Young) 1 / 46,372
Threads (Peak/Max) 163 / 147
This is the OS Level stats
Operating System
System Memory Used % 90%
System Memory Used 59.4 GB / 65.8 GB
Allocated Processors 16
Available Processors 16
OS Name Linux
OS Architecture amd64
As You state that you are new to Elasticsearch I must say you go through cluster as well as cat API you can find documentation at clusert API and cat API
This will help you understand more in depth.

High CPU Utilization issue on Moqui instance

We are facing one problem of random High CPU utilization on Production Server which makes application Not Responding. And we need to restart the application again. We have done initial level diagnostic and couldn’t conclude.
We are using following configuration for Production Server
Amazon EC2 8gb RAM(m4.large) ubuntu 14.04 LTS
Amazon RDS 2gb RAM(t2.small) Mysql database
Java heap size -Xms2048M -Xmx4096
Database Connection Pool size Minimum: 20 and Maximum: 150
MaxThreads 100
Below two results are of top command
1) At 6:52:50 PM
KiB Mem : 8173968 total, 2100304 free, 4116436 used, 1957228 buff/cache
KiB Swap: 1048572 total, 1047676 free, 896 used. 3628092 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
20698 root 20 0 6967736 3.827g 21808 S 3.0 49.1 6:52.50 java
2) At 6:53:36 PM
KiB Mem : 8173968 total, 2099000 free, 4116964 used, 1958004 buff/cache
KiB Swap: 1048572 total, 1047676 free, 896 used. 3627512 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+COMMAND
20698 root 20 0 6967736 3.828g 21808 S 200.0 49.1 6:53.36 java
Note:
Number of Concurrent users - 5 or 6 (at this time)
Number of requests between 6:52:50 PM and 6:53:36 PM - 4
Results shows CPU utilization is increase drastically.
Any suggestion or direction which can lead to solution??
Additionally, following is the cpu utilization graph for last week.
Thanks!
Without seeing a stack trace, I'd guess that the problem is likely Jetty, as there have been recent documented bugs in Jetty causing the behaviour you describe on EC2 (do a google search on this.). I would recommend you do a couple of stack trace dumps during 100% cpu, to confirm it is Jetty, then if you look at the Jetty documentation on this bug, hopefully you may find you simply need to update Jetty.

issues with consistent speed when using lein test

disclaimer - I am running this on a mid 2012 macbook air i7-3667U and 8gb ram with the 64bit jvm.
Running the test suite for an application lein t is running at what I would consider an abnormally slow speed. Most of the tests involve mongo db (creating and dropping tables/collections). I have moved to monngodb enterprise which allows running in memory. As I assumed that the bottleneck was the db io.
with a mongo.conf
storage:
engine: inMemory
dbPath: /Users/beoliver/data/testdb
inMemory:
engineConfig:
inMemorySizeGB: 1
mongo is started with the flag --conf ~/path/to/mongo.conf
I added the java flags to the project
:jvm-opts ["-XX:-OmitStackTraceInFastThrow" "-Xmx4g" "-Xms1g"]
to try and avoid extra swaps.
This appeared to fix the issue and the tests ran as:
time lein t
...
lein t 238.71s user 8.72s system 59% cpu 6:57.92 total
This is reasonable compared with the results from other team members.
But then re-running the tests again the speed is back to the original (half and hour mark).
lein t 252.53s user 13.76s system 16% cpu 26:52.45 total
cpu usage peaks at about 50% but for the most part is around <5% (this includes times when it idles at <1%)
Real memory size: 1.55 GB
Virtual memory size : 8.08 GB
Shared Memory Size: 18.0 MB
Private Memory Size : 1.67 GB
Has anyone had similar experiences? Suggestions? Is there a good way of profiling - better than starting at Activity monitor?

Adobe Experience Manager (AEM), Java garbage collection tuning and memory management

I am currently using the Adobe Experience Manager for a Client's site (Java language). It uses openJDK:
#java -version
java version "1.7.0_65"
OpenJDK Runtime Environment (rhel-2.5.1.2.el6_5-x86_64 u65-b17)
OpenJDK 64-Bit Server VM (build 24.65-b04, mixed mode)
It is running on Rackspace with the following:
vCPU: 4
Memory: 16GB
Guest OS: Red Hat Enterprise Linux 6 (64-bit)
Since it has been in production I have been experiencing very slow performance on the part of the application. It goes like this I launch the app, everything is smooth then 3 to 4 days later the CPU usage spikes to 400% (~4000 users/day hit the site). I got a few OOM exceptions (1 or 2) but mostly the site was exceptionally slow and never becomes an OOM exception. Since I am a novice at Java Memory management I started reading about how it works and found tools like jstat. When the system was overwhelmed the second time around, I ran:
#top
Got the PID of the java process and then pressed shift+H and noted the PIDs of the threads with high CPU percentage. Then I ran
#sudo -uaem jstat <PID>
Got a thread dump and converted the thread PIDs I wrote down previously and searched for their hex value in the dump. After all that, I finally found that it was not surprisingly the Garbage Collector that is flipping out for some reason.
I started reading a lot about Java GC tuning and came up with the following java options.
So restarted the application with the following options:
java
-Dcom.day.crx.persistence.tar.IndexMergeDelay=0
-Djackrabbit.maxQueuedEvents=1000000
-Djava.io.tmpdir=/srv/aem/tmp/
-XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=/srv/aem/tmp/
-Xms8192m -Xmx8192m
-XX:PermSize=256m
-XX:MaxPermSize=1024m
-XX:+UseParallelGC
-XX:+UseParallelOldGC
-XX:ParallelGCThreads=4
-XX:NewRatio=1
-Djava.awt.headless=true
-server
-Dsling.run.modes=publish
-jar crx-quickstart/app/cq-quickstart-6.0.0-standalone.jar start
-c crx-quickstart -i launchpad -p 4503
-Dsling.properties=conf/sling.properties
And it looks like it is performing much better but I think that it probably needs more GC tuning.
When I run:
#sudo -uaem jstat <PID> -gcutils
I get this:
S0 S1 E O P YGC YGCT FGC FGCT GCT
0.00 0.00 55.97 100.00 45.09 4725 521.233 505 4179.584 4700.817
after 4 days that I restarted it.
When I run:
#sudo -uaem jstat <PID> -gccapacity
I get this:
NGCMN NGCMX NGC S0C S1C EC
4194304.0 4194304.0 4194304.0 272896.0 279040.0 3636224.0
OGCMN OGCMX OGC OC PGCMN PGCMX
4194304.0 4194304.0 4194304.0 4194304.0 262144.0 1048576.0
PGC PC YGC FGC
262144.0 262144.0 4725 509
after 4 days that I restarted it.
These result are much better than when I started but I think it can get even better. I'm not really sure what to do next as I'm no GC pro so I was wondering if you guys would have any tips or advice for me on how I could get better app/GC performance and if anything is obvious like ratio's and sizes of youngGen and oldGen ?
How should I set the survivors and eden sizes/ratios ?
Should I change GC type like use CMS GC or G1 ?
How should I proceed ?
Any advice would be helpful.
Best,
Nicola
Young and Old area ratio are interms 1:3 but it could varies depends on the application usage on
short lived objects and long lived objects. If the short lived objects are more then the
young space could be extended for example 2:3 (young:old). Reason for increase in the ratio is
to avoid scavange garbage cycle. When more short lived objects are allocated then the young space
fill fast and lead to scavenge GC cycle inturn affects the application performance. When the ratio
increased then the current value then there are possibilities in the reduction of scavenge GC cycle.
When the young space increased automatically survivor and Eden space increase accordingly.
CMS policy used to reduce pause time of the application and G1 policy targeted for larger memories
with high throughput. Gc policy can be changed based on the need of the application.
Recommended Use Cases for G1 :
The first focus of G1 is to provide a solution for users running applications that require large heaps with limited GC latency.
This means heap sizes of around 6GB or larger, and stable and predictable pause time below 0.5 seconds.
As you use 8G heap size, you can test with G1 gc policy for the same environment in order to check the GC performance.

Can't compile using GCC on EC2

I am trying to compile a program using GCC on an AWS EC2 instance (c1.medium). The cc1plus processes are started correctly but after a while they stop using any CPU and the complete compilation process slows down and never finished.
In top I can see that the "wa" stat increases drastically at the same time as the compilation slows down.
Initially:
%Cpu(s): 88.1 us, 5.4 sy, 0.0 ni, 0.0 id, 0.5 wa, 0.0 hi, 0.0 si, 6.0 st
When the compilation processes slow down:
%Cpu(s): 0.2 us, 0.3 sy, 0.0 ni, 50.2 id, 49.3 wa, 0.0 hi, 0.0 si, 0.0 st
I have tried a lot of different instance types, all with the same result.
As I understand it a high wa/iowait means a slow disk. I have therefore also tried to compile the application on different mounts in the ec2 instance, but this does not result in an improvement.
Does anyone have any experience in compiling c/c++ applications on EC2 and know how to solve this problem?
UPDATE 2013-03-06 08:00
As requested in the comments:
$ gcc --version
gcc (Ubuntu/Linaro 4.7.2-2ubuntu1) 4.7.2
The solution was to use a machine with more than 8 GB of RAM. Apparently GCC used a lot of RAM for compiling this specific program.
Glad to see you found the solution yourself.
I've also noticed you can get this sort of hang-up behavior on a micro instance when doing processor heavy operations such as compiling code. Always do this kind of stuff on at least a small and then if necessary, convert back to a micro when you are done.

Resources