Elasticsearch load is not distributed evenly - elasticsearch

I am facing strange issue with Elasticsearch. I have 8 nodes with same configurations (16GB RAM and 8 core CPU).
One node "es53node6" has always high load as shown in the screenshot below. Also 5-6 nodes were getting stopped yesterday automatically after every 3-4 hours.
What could be the reason?
ES version : 5.3

there can be a fair share of reasons. Maybe all data is stored on that node (which should not happen by default), maybe you are sending all the requests to this single node.
Also, there is no automatic stopping of Elasticsearch built-in. You can configure Elasticsearch that it stops the JVM process when an out-of-memory exception occurs, but this is not enabled by default as it relies on a more recent JVM.
You can use the hot threads API to check where the CPU time is spent in Elasticsearch.

Related

GC Overhead limit exceeded in WSO2 EI 6.6

In WSO2 EI 6.6, proxy stopped working abruptly. upon analyzing observed an error in the wso2 carbon log "GC Overhead limit exceeded", after this error nothing happening in the EI.
Proxy logic is to get the data from Sql server table and form an xml and send it to an external API. Proxy runs every 5 mins interval and in every interval maximum of 5 records will be pushed to an API.
After restarting the wso2 carbon services, proxy are started working. currently we are restarting the services every 3 days to avoid this issue.
Need to know how to identify the potential issue and resolve this.
This mean the JVM has run out of allocated memory. There can be many reasons for this. For example, if you haven't allocated enough memory to the JVM you can easily run out of memory. If that's not the case you need to analyze a memory dump and see what's occupying the memory causing it to fill up.
Generally, when you see the mentioned error the JVM automatically creates a heap dump(heap-dump.hprof) in the <EI_HOME>/repository/logs directory. You can try analyzing the dump to find the root cause. If the server doesn't generate a memory dump, manually take a memory dump when it's occupied than the expected level and analyze it.

ubuntu server cpu utilisation increasing very quickly after installing ELK

I installed elasticsearch logstash and kibana in the ubuntu server. Before I starting these services the CPU utilization is less than 5% and after starting these services in the next minute the CPU utilization crossing 85%. I don't know why it is happening. Can anyone help me with this issue?
Thanks in advance.
There is not enough information in your question to give you a specific answer, but i will point out few possible scenarios and how to deal with them.
Did you wait long enough? sometimes there is a warmpup which is consuming higher CPU until all services are registered and finish to boot. if you have a fairly small machine it might consume higher CPU and take longer to finish.
folder write permissions. if any of the components of the ELK fails due to restricted access on needed directories either for logging, creating sub folders for sinceDB files or more it can cause it to go into an infinity loop and try again and again while it is consuming high CPU.
connection issues. ES should be the first component to start, if it fails, Kibana and Logstash will go and try to connect to the ES again and again until successful connection- which can cause high CPU.
bad logstash configuration. if logstash fails to read the file from the configurations or if you have a bad parsing, excessive parsing for example- your first "match" in the filter part will include the least common option it might consume high CPU.
For further investigation:
I suggest you to not start all of them together. start ES first. if everything goes well start Kibana and lastly start Logstash.
check the logs of all the ELK components to find error messages, failures, etc.
for a better answer I will need the yaml of all 3 components (ES, Kibana, Logstash)
I will need the logstash configuration file.
Would recommend you to analyse the CPU cycles consumed by each of the elasticsearch, logstash and kibana process.
Check specifically which process among the above is consuming the most memory/cpu via top command for example.
Start only ES first and allow it to settle and the node to be started completely before starting kibana and may be logstash after that.
Send me the logs for each and I can assist if there are any errors.

Analytics server needing high application storage memory

Our Application went live just a few months back. We have configured 2 mobile analytics server with 8gb of Ram space and 50GB of SAN space each. We have observed that Analytics server is utilizing a huge SAN space it's already 85% consumed on each server. Here are few more details how it is configured.
Number of Active Shards 24
Number of Nodes 2
Number of Data Nodes 2
MFP version : Server version: 7.1.0.00.20160801-2314
I have also noticed that document count is huge number almost 500K the memory it is taking is 28gb.
Is this the expectation or this is some sort of configuration issue. Is there any way to clean up and release some memory.
Elasticsearch (on which MobileFirst Operational Analytics is built) is very memory-intensive, and the memory usage is a function of how much data you have stored. 500K documents is not very much in the grand scheme of things, but the amount of SAN space and memory that uses depends on what is in the documents. You don't mention what version (and iFix level) of MobileFirst Platform Foundation you're using, and it's difficult to guide you without knowing that information. But, as a start, if you are collecting server logs in Operational Analytics, I'd recommend you stop doing that unless you truly need the server logs in Operational Analytics for some reason - in your application runtime, set the JNDI property "wl.analytics.logs.forward" to "false" (assuming you're using MobileFirst Platform Foundation 7.1). Then, in the Analytics Dashboard, set the TTL value for "TTL_ServerLogs" to a very small value, and check the box to apply the TTL value to existing documents (to do this, you must be running a more recent iFix level of MobileFirst Platform Foundation, as older builds didn't include this checkbox - again, assuming you're using 7.1). This should purge the existing server logs, which should free up some memory and SAN space. While you're in that panel, you may wish to set the other TTL values to a value appropriate for your environment, if you have not done so already.
If you're running something older than 7.1, or the build you're running doesn't have the checkbox to apply TTL values retroactively, the process to purge existing data is more complicated - in that case, please open a PMR and the support team can guide you.
If you can't purge data (e.g., if you have to keep collecting server logs, or save old data for a long time), you should add additional nodes to your Elasticsearch cluster to distribute the load over the additional nodes, so the resource utilization of each node will be less.

Grails Quartz job performance degrades / slows over time

We have a situation where we have a Grails 2.3.11 based application that uses Quartz (version 2.2.1 / Grails Quartz plugin version 1.0.2) jobs to do certain long running processes (1-5 minutes) in background so that a polling service allows the browser to fetch the progress. This is used primarily for import and export of data from the application. For example, when the application first starts, the export for 200,000+ rows takes approx 2 minutes. The following day the export takes 3+ minutes. The third day the export takes more than 6 minutes.
We have narrowed the problem down to just the Quartz jobs. When the system is in the degraded state all other web pages respond with nearly identical response times as when the system is in optimal condition. It appears that the Quartz jobs tend to slowdown linearly or incrementally over the period of 2 to 3 days. This may be usage related or time, for which we are uncertain.
We are familiar with the memory leak bug reported by Burt Beckwith and added the fix to our code. We were experiencing the memory leak before but now memory management appears to be health, even when the job performance is 5-10x slower than
The jobs use GORM for most of the queries. We've optimized some to use criterias with projects so they are light weight but haven't been able to change all the logic over so there are a number of Gorm objects. In the case of the exports we've changed the queries to be read-only. The logic also clears out the hibernate session appropriately to limit the number of objects in memory.
Here are a few additional details:
The level-2 cache is disabled
Running on Tomcat 7
Using MySQL 5.6
Using Java JDK 1.7.0_72
Linux
System I/O, swapping and CPU are all well within reason
Heap and Permgen memory usage is reasonable
Garbage collection is stable and reasonably consistent based on usage levels
The issue occurs even when there is only a single user
We have done period stack/thread dump analysis
We have been profiling the application with xRebel (development) and AppDynamics (production) as well we have Java Melody installed into the application
We had this problem with Grails 1.3.8 but recently upgraded to 2.3.11 which may have exasperated the problem.
Any suggestions would be greatly appreciated.
Thanks,
John

Marklogic latency : Document not found

I am working on a clustered marklogic environment where we have 10 Nodes. All nodes are shared E&D Nodes.
Problem that we are facing:
When a page is written in marklogic its takes some time (upto 3 secs) for all the nodes in the cluster to get updated & its during this time if I then do a read operation to fetch the previously written page, its not found.
Has anyone experienced this latency issue? and looked at eliminating it then please let me know.
Thanks
It's normal for a new document to only appear after the database transaction commits. But it is not normal for a commit to take 3-sec.
Which version of MarkLogic Server?
Which OS and version?
Can you describe the hardware configuration?
How large are these documents? All other things equal, update time should be proportional to document size.
Can you reproduce this with a standalone host? That should eliminate cluster-related network latency from the transaction, which might tell you something. Possibly your cluster network has problems, or possibly one or more of the hosts has problems.
If you can reproduce the problem with a standalone host, use system monitoring to see what that host is doing at the time. On linux I favor something like iostat -Mxz 5 and top, but other tools can also help. The problem could be disk I/O - though it would have to be really slow to result in 3-sec commits. Or it might be that your servers are low on RAM, so they are paging during the commit phase.
If you can't reproduce it with a standalone host, then I think you'll have to run similar system monitoring on all the hosts in the cluster. That's harder, but for 10 hosts it is just barely manageable.

Resources