What is OS Load in Elasticsearch node stat? - elasticsearch

In Elasticsearch node stat API when I send a query for OS stat:
curl -XGET "http://esls1.ping-service.com:9200/_nodes/stats/os"
In the response I get a metric load_average:
"load_average": [0,0.04,0.13]
What does is means?

That is the currently calculated average load of the system and how this is obtained is specific to the operating system Elasticsearch is installed on.
ES uses Sigar to get this kind of information. The three numbers represent average loads calculated for 1 minute, 5 minutes and 15 minutes intervals.
For linux, for example, Sigar uses /proc/loadavg to get this information from the system. You can find more about this specific calculation in this SO post.
For AIX, Sigar is using perfstat_cpu_total subroutine, if I'm not mistaken to get the same information.

Sigar is not used in Elasticsearch anymore since the first Beta of 2.0.0: github.com/elastic/elasticsearch/pull/12010 github.com/elastic/elasticsearch/issues/11034
since then, they switched to generic OS load metrics. similar to what you see with the top command. see here for an explanation what this means: https://askubuntu.com/questions/532845/what-is-system-load
beware: this means if you run ES in a docker container, the load shown will be actually from the host machine, and not from only the docker container!

Related

Datadog monitoring Disk usage

I want to use datadog for monitoring my EC2 Instance Disk utilization and create alerts for it. I am using system.disk.in_use metric but I am not getting my root mount point in from sectionavg:system.disk.in_use{device:/dev/loop0} by {host} and my root mount point is /dev/root. I can see every loop mount point in the list but can't see the root. due to this, the data I am getting in the monitor is different than the actual server, for example, df -hT is showing 99% root in the server but on datadog monitoring it is showing 60%.
I am not too familiar with how to use datadog, can someone please help?
Try to research about it but not able to resolve the issue.
You can also try to use the device label to read in only the root volume such as:
avg:system.disk.in_use{device_label:/} by {host}
I personally found the metric system.disk.in_use to equal the total and instead added a formula that calculated the utilization using system.disk.total and system.disk.free to be more accurate.

Elasticsearch bulk update is extremely slow

I am indexing a large amount of daily data ~160GB per index into elasticsearch. I am facing this case where I need to update almost all the docs in the indices with a small amount of data(~16GB) which is of the format
id1,data1
id1,data2
id2,data1
id2,data2
id2,data3
.
.
.
My update operations start happening at 16000 lines per second and in over 5 minutes it comes down to 1000 lines per second and doesnt go up after that. The update process for this 16GB of data is currently longer than the time it takes for my entire indexing of 160GB to happen
My conf file for the update operation currently looks as follows
output
{
elasticsearch {
action => "update"
doc_as_upsert => true
hosts => ["host1","host2","host3","host4"]
index => "logstash-2017-08-1"
document_id => "%{uniqueid}"
document_type => "daily"
retry_on_conflict => 2
flush_size => 1000
}
}
The optimizations I have done to speed up indexing in my cluster based on the suggestions here https://www.elastic.co/guide/en/elasticsearch/guide/current/indexing-performance.html are
Setting "indices.store.throttle.type" : "none"
Index "refresh_interval" : "-1"
I am running my cluster on 4 instances of the d2.8xlarge EC2 instances. I have allocated 30GB of heap to each nodes.
While the update is happening barely any cpu is used and the load is very less as well.
Despite everything the update is extremely slow. Is there something very obvious that I am missing that is causing this issue? While looking at the threadpool data I find that the number of threads working on bulk operations are constantly high.
Any help on this issue would be really helpful
Thanks in advance
There are a couple of rule-outs to try here.
Memory Pressure
With 244GB of RAM, this is not terribly likely, but you can still check it out. Find the jstat command in the JDK for your platform, though there are visual tools for some of them. You want to check both your Logstash JVM and the ElasticSearch JVMs.
jstat -gcutil -h7 {PID of JVM} 2s
This will give you a readout of the various memory pools, garbage collection counts, and GC timings for that JVM as it works. It will update every 2 seconds, and print headers every 7 lines. Spending excessive time in the FCT is a sign that you're underallocated for HEAP.
I/O Pressure
The d2.8xlarge is a dense-storage instance, and may not be great for a highly random, small-block workload. If you're on a Unix platform, top will tell you how much time you're spending in IOWAIT state. If it's high, your storage isn't up to the workload you're sending it.
If that's the case, you may want to consider provisioned IOP EBS instances rather than the instance-local stuff. Or, if your stuff will fit, consider an instance in the i3 family of high I/O instances instead.
Logstash version
You don't say what version of Logstash you're using. Being StackOverflow, you're likely to be using 5.2. If that's the case, this isn't a rule-out.
But, if you're using something in the 2.x series, you may want to set the -w flag to 1 at first, and work your way up. Yes, that's single-threading this. But the ElasticSearch output has some concurrency issues in the 2.x series that are largely fixed in the 5.x series.
With elasticsearch version 6.0 we had an exactly same issue of slow updates on aws and the culprit was slow I/O. Same data was upserting on a local test stack completely fine but once in cloud on ec2 stack, everything was dying after an initial burst of speedy inserts lasting only for few minutes.
Local test stack was a low-spec server in terms of memory and cpu but contained SSDs.
s3 stack was EBS volumes with default gp2 300 IOPS.
Converting the volumes to type io1 with 3000 IOPS solved the issue and everything got back on track.
I am using amazon aws elasticsearch service version 6.0 . I need heavy write/insert from serials of json file to the elasticsearch for 10 billion items . The elasticsearch-py bulk write speed is really slow most of time and occasionally high speed write . i tried all kinds of methods , such as split json file to smaller pieces , multiprocess read json files , parallel_bulk insert into elasticsearch , nothing works . Finally , after I upgraded io1 EBS volume , everything goes smoothly with 10000 write IOPS .

Neo4j in Docker - Max Heap Size Causes Hard crash 137

I'm trying to spin up a Neo4j 3.1 instance in a Docker container (through Docker-Compose), running on OSX (El Capitan). All is well, unless I try to increase the max-heap space available to Neo above the default of 512MB.
According to the docs, this can be achieved by adding the environment variable NEO4J_dbms_memory_heap_maxSize, which then causes the server wrapper script to update the neo4j.conf file accordingly. I've checked and it is being updated as one would expect.
The problem is, when I run docker-compose up to spin up the container, the Neo4j instance crashes out with a 137 status code. A little research tells me this is a linux hard-crash, based on heap-size maximum limits.
$ docker-compose up
Starting elasticsearch
Recreating neo4j31
Attaching to elasticsearch, neo4j31
neo4j31 | Starting Neo4j.
neo4j31 exited with code 137
My questions:
Is this due to a Docker or an OSX limitation?
Is there a way I can modify these limits? If I drop the requested limit to 1GB, it will spin up, but still crashes once I run my heavy query (which is what caused the need for increased Heap space anyway).
The query that I'm running is a large-scale update across a lot of nodes (>150k) containing full-text attributes, so that they can be syncronised to ElasticSearch using the plug-in. Is there a way I can get Neo to step through doing, say, 500 nodes at a time, using only cypher (I'd rather avoid writing a script if I can, feels a little dirty for this).
My docker-compose.yml is as follows:
---
version: '2'
services:
# ---<SNIP>
neo4j:
image: neo4j:3.1
container_name: neo4j31
volumes:
- ./docker/neo4j/conf:/var/lib/neo4j/conf
- ./docker/neo4j/mnt:/var/lib/neo4j/import
- ./docker/neo4j/plugins:/plugins
- ./docker/neo4j/data:/data
- ./docker/neo4j/logs:/var/lib/neo4j/logs
ports:
- "7474:7474"
- "7687:7687"
environment:
- NEO4J_dbms_memory_heap_maxSize=4G
# ---<SNIP>
Is this due to a Docker or an OSX limitation?
NO Increase the amount of available RAM to Docker to resolve this issue.
Is there a way I can modify these limits? If I drop the requested
limit to 1GB, it will spin up, but still crashes once I run my heavy
query (which is what caused the need for increased Heap space
anyway).
The query that I'm running is a large-scale update across a lot of
nodes (>150k) containing full-text attributes, so that they can be
syncronised to ElasticSearch using the plug-in. Is there a way I can
get Neo to step through doing, say, 500 nodes at a time, using only
cypher (I'd rather avoid writing a script if I can, feels a little
dirty for this).
N/A This is a NEO4J specific question. It might be better to seperate this from the Docker questions listed above.
3.The query that I'm running is a large-scale update across a lot of nodes (>150k) containing full-text attributes, so that they can be syncronised to ElasticSearch using the plug-in. Is there a way I can get Neo to step through doing, say, 500 nodes at a time, using only cypher (I'd rather avoid writing a script if I can, feels a little dirty for this).
You can do this with the help of apoc plugin for neo4j, more specifically apoc.periodic.iterate
or apoc.periodic.commit
.
If you will use apoc.periodic.commit your first match should be specific like in example you mark which nodes have you already synced, because it sometimes fall in the loop:
call apoc.periodic.commit("
match (user:User) WHERE user.synced = false
with user limit {limit}
MERGE (city:City {name:user.city})
MERGE (user)-[:LIVES_IN]->(city)
SET user.synced =true
RETURN count(*)
",{limit:10000})
If you use apoc.periodic.iterate you can run it in parallel mode:
CALL apoc.periodic.iterate(
"MATCH (o:Order) WHERE o.date > '2016-10-13' RETURN o",
"with {o} as o MATCH (o)-[:HAS_ITEM]->(i) WITH o, sum(i.value) as value
CALL apoc.es.post(host-or-port,index-or-null,type-or-null,
query-or-null,payload-or-null) yield value return *", {batchSize:100, parallel:true})
Note that there is no need for second MATCH clause and apoc.es.post is a function for apoc that can send post requests to elastic search.
see documentation for more info

Elasticsearch get works half of the time

I recently ran into a problem with elasticsearch, versions 1.0.1, 1.2.2, 1.2.4, and 1.4.1.
When I try to get a document by ID GET http://localhost:9200/thing/otherthing/700254a4-4e72-46b9-adeb-d498159652cb It will return the document half the time, and the other half I will get a "found" : false error. (These switch off literally every other time, I do a get and it works, do another get and it doesn't).
These documents have no custom routing.
I have tried completely uninstalling elasticsearch and removing all files related to it, then re installing from the official repo to no avail, and google doesn't give me any similar problems or ideas on how to solve this.
The only thing I would think of that would cause a repeatable failure like this would be unassigned shards/replica sets which contain this information.
Do you know how many replica sets you have?
I believe the read is round-robin, so if you only have 2 replicas of the data (1 master + a replica set), and 1 has become unassigned (after being written to), then you might see a failure like this.

MongoDB Sharded, Replica'd Cluster, Query commands for all configuration / running statuses?

I find that I have gotten confused on all the different commands I can use as an admin to discover what might be wrong with my MongoDB cluster. For example, running the split command, I got an error that "pre-condition failed". I found the part of the code related to its error message, 13105, https://github.com/mongodb/mongo/blob/master/src/mongo/client/syncclusterconnection.cpp#L219 , but I'm still confused on what I am seeing.
So I wanted to systematically check every part of my cluster. These are the commands I have run and remember right off the bat, but I'm pretty sure I'm forgetting some! I was never a DBA before looking at Mongo, so it would really help me to have a debugging checklist of info to get.
So my question is, have I got all the commands I need to , to get the status and configuration of each aspect of my cluster from a Mongo Console? For example, I feel sure I was able to see the sizes of each shard chunk used, but I didn't get that yet.
thank you!
Update 2014-11-22, modified the list below to include my summary findings and #wdberkeley's answer.
//TIPS For MongoDB Debugging
//1.Get General info on configs passed and the IP's (in Bash)
ps aux | grep -in mongo
//1A.Alternatively, to see details of configs on a single server
use admin
db.serverCmdLineOpts()
//2.Shard chunk/ranges status (from the mongoS balancer node)
sh.status()
//2A.Shard Status, verbose output
sh.status(true)
//3.Replica Set Status (from a replica set node, NOT from a mongoS)
rs.status()
//4.Mongo Server info (from anywhere; VERY LONG output)
db.serverStatus()
//5.Log summary from configured file or copy/paste of system.output into a .log file (from Bash)
mloginfo myCapturedLogs.log --distinct
//6.Diagnostic tools (from Bash)
mongostat
mongotop
//7.Check Q&A / Reference sites.
Refs;
[6] The amazing mtools includes log parsing and log timeline visualization. https://github.com/rueckstiess/mtools
[7] Q&A / Reference sites' URLS; 7A.mongo Shell Quick Reference http://docs.mongodb.org/manual/reference/mongo-shell/ ; 7B.StackOverflow; 7C.The stackexchange solely for dba questions, eg https://dba.stackexchange.com/questions/48232/mongodb-config-servers-not-in-sync ; 8.Mongo Diagnostics FAQ http://docs.mongodb.org/manual/faq/diagnostics/#faq-memory
sh.status, rs.status, and db.serverStatus are the main ones. Verbose output (sh.status(true) should list all the chunk sizes for you. There are other potentially useful functions, e.g. to see the parsed configuration options, you'd use db.serverCmdLineOpts. There's a reference for mongo shell functions that you can look at to see if there's any more functions you're interested in. There's also some command-line tools like mongostat and mongotop that will give you useful information on the activity of your cluster.
If you post the error and how you caused it, I can try to give more specific advice on what's worth looking at for that error as well.

Resources