Elastic Search Index Status - elasticsearch

I am trying to setup a scripted reindex operation as suggested in: http://www.elasticsearch.org/blog/changing-mapping-with-zero-downtime/
To go with the suggestion of creating a new index, aliasing then deleting the old index I would need to have a way to tell when the indexing operation on the new index was complete. Ideally via the REST interface.
It has 80 million rows to index and can take a few hours.
I can't find anything helpful in the docs..

You can try with _stats : http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/indices-stats.html
Eg :
{
"_shards" : {
"total" : 10,
"successful" : 5,
"failed" : 0
},
"_all" : {
"primaries" : {
"docs" : {
"count" : 0,
"deleted" : 0
},
"store" : {
"size_in_bytes" : 575,
"throttle_time_in_millis" : 0
},
"indexing" : {
"index_total" : 0,
"index_time_in_millis" : 0,
"index_current" : 0,
"delete_total" : 0,
"delete_time_in_millis" : 0,
"delete_current" : 0,
"noop_update_total" : 0,
"is_throttled" : false,
"throttle_time_in_millis" : 0
},
I think, you can compare _all.total.docs.count and _all.total.indexing.index_current

Related

Elasticsearch max of field combined with a unique field

I have an index with two fields:
name: uuid
version: long
I now only want to count the documents (on a very large index [1 million+ entries]) where the version of the name is the highest. For e.g. a query on an index with the following documents:
{name="a", version=1}
{name="a", version=2}
{name="a", version=3}
{name="b", version=1}
... would return:
count=2
Is this somehow possible? I can not find a solution for this particular problem.
You are effectively describing a count of distinct names, which you can do with a cardinality aggregation.
Request:
GET test1/_search
{
"aggs" : {
"distinct_count" : {
"cardinality" : {
"field" : "name.keyword"
}
}
},
"size": 0
}
Response:
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 4,
"max_score" : 0.0,
"hits" : [ ]
},
"aggregations" : {
"distinct_count" : {
"value" : 2
}
}
}

Elasticsearch get snapshot size

I'm looking for a way to get the storage size of an specific Elasticsearch snapshot? The snapshots are located on a shared filesystem.
It seems there is no API for this?
In order to get the size or status of the elasticsearch snapshot, run snapshot status API snapshot status API
curl -X GET "localhost:9200/_snapshot/my_repository/my_snapshot/_status?pretty"
Note: Mention appropriate values in the above curl.
Sample Output:
"snapshots" : [
{
"snapshot" : "index-01",
"repository" : "my_repository",
"uuid" : "OKHNDHSKENGHLEWNALWEERTJNS",
"state" : "SUCCESS",
"include_global_state" : true,
"shards_stats" : {
"initializing" : 0,
"started" : 0,
"finalizing" : 0,
"done" : 2,
"failed" : 0,
"total" : 2
},
"stats" : {
"incremental" : {
"file_count" : 149,
"size_in_bytes" : 8229187919
},
"total" : {
"file_count" : 463,
"size_in_bytes" : 169401330819
},
"start_time_in_millis" : 1631622333285,
"time_in_millis" : 208851,
"number_of_files" : 149,
"processed_files" : 149,
"total_size_in_bytes" : 8229187919,
"processed_size_in_bytes" : 8229187919
},
"indices" : {
"graylog_130" : {
"shards_stats" : {
"initializing" : 0,
"started" : 0,
"finalizing" : 0,
"done" : 2,
"failed" : 0,
"total" : 2
},
"stats" : {
"incremental" : {
"file_count" : 149,
"size_in_bytes" : 8229187919
},
"total" : {
"file_count" : 463,
"size_in_bytes" : 169401330819
},
"start_time_in_millis" : 1631622333285,
"time_in_millis" : 208851,
"number_of_files" : 149,
"processed_files" : 149,
"total_size_in_bytes" : 8229187919,
"processed_size_in_bytes" : 8229187919
},
"shards" : {
"0" : {
"stage" : "DONE",
"stats" : {
"incremental" : {
"file_count" : 97,
"size_in_bytes" : 1807163337
},
"total" : {
"file_count" : 271,
"size_in_bytes" : 84885391182
},
"start_time_in_millis" : 1631622334048,
"time_in_millis" : 49607,
"number_of_files" : 97,
"processed_files" : 97,
"total_size_in_bytes" : 1807163337,
"processed_size_in_bytes" : 1807163337
}
},
"1" : {
"stage" : "DONE",
"stats" : {
"incremental" : {
"file_count" : 52,
"size_in_bytes" : 6422024582
},
"total" : {
"file_count" : 192,
"size_in_bytes" : 84515939637
},
"start_time_in_millis" : 1631622333285,
"time_in_millis" : 208851,
"number_of_files" : 52,
"processed_files" : 52,
"total_size_in_bytes" : 6422024582,
"processed_size_in_bytes" : 6422024582
}
}
}
}
In the above output, look for
"total" : {
"file_count" : 463,
"size_in_bytes" : 169401330819
}
Now convert size_in_bytes to GB, you will get the exact size of the snapshot in GB's Convert bytes to GB
You could get storage used by index using _cat API ( primary store size). First snapshot should be around index size.
For Incremental snapshots, it depends . This is because snapshots are taken in a segment level ( index-.. ) so it may be much smaller depending your indexing. Merges could cause new segments to form etc..
https://www.elastic.co/blog/found-elasticsearch-snapshot-and-restore Gives a nice overview
I need an exact solution of the used size on the storage.
Now I use the following approach: separate directories on index/snapshot level and so I can get the used storage size on system level (du command) for a specific index or snapshot.

Count the number of duplicates in elasticsearch

I have an application inserting a numbered sequence of logs into elasticsearch.
Under certain conditions, after stopping my application, I find that in elasticsearch there are more logs than I have actually generated.
This simple aggregation helped me find out that a few duplicates are present:
curl /logstash-*/_search?pretty -d '{
size: 0,
aggs: {
msgnum_terms: {
terms: {
field: "msgnum.raw",
min_doc_count: 2,
size: 0
}
}
}
}'
msgnum is the field containing the numeric sequence. Normally it should be unique and the resulting doc_counts never exceed 1. Instead I get something like:
{
"took" : 33,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 100683,
"max_score" : 0.0,
"hits" : [ ]
},
"aggregations" : {
"msgnum_terms" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [ {
"key" : "4097",
"doc_count" : 2
}, {
"key" : "4099",
"doc_count" : 2
...
...
...
}, {
"key" : "5704",
"doc_count" : 2
} ]
}
}
}
How can I count the exact number of duplicates in order to make sure that they are the only cause of mismatch between number of generated log lines and number of hits in elasticsearch?

elasticsearch: How to interpret log file (cluster went to yellow status)?

Elasticsearch 1.7.2 on CentOS, 8GB RAM, 2 node cluster.
We posted the whole log here: http://pastebin.com/zc2iG2q4
When we look at /_cluster/health , we see 2 unassigned shards:
{
"cluster_name" : "elasticsearch-prod",
"status" : "yellow", <--------------------------
"timed_out" : false,
"number_of_nodes" : 2,
"number_of_data_nodes" : 2,
"active_primary_shards" : 5,
"active_shards" : 8,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 2, <--------------------------
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0
In the log, we see:
marking and sending shard failed due to [failed to create shard]
java.lang.OutOfMemoryError: Java heap space
And other errors.
The only memory related config value we have is:
indices.fielddata.cache.size: 75%
We are looking to:
understand the log more completely
understand what action we need to take to address the situation now (recover) and prevent it in the future
Additional details:
1) ES_HEAP_SIZE is stock, no changes. (Further, looking around, it is not clear where best to change it.... /etc/init.d/elasticsearch ?)
2) Our jvm stats are below. (And please note, as a test, I modded "/etc/init.d/elasticsearch" and and added export ES_HEAP_SIZE=4g [in place of the existing "export ES_HEAP_SIZE" line] and restarted ES.... Comparing two identical nodes, one with the changed elasticsearch file, and one stock, the values below appear identical)
"jvm" : {
"timestamp" : 1448395039780,
"uptime_in_millis" : 228297,
"mem" : {
"heap_used_in_bytes" : 81418872,
"heap_used_percent" : 7,
"heap_committed_in_bytes" : 259522560,
"heap_max_in_bytes" : 1037959168,
"non_heap_used_in_bytes" : 50733680,
"non_heap_committed_in_bytes" : 51470336,
"pools" : {
"young" : {
"used_in_bytes" : 52283368,
"max_in_bytes" : 286326784,
"peak_used_in_bytes" : 71630848,
"peak_max_in_bytes" : 286326784
},
"survivor" : {
"used_in_bytes" : 2726824,
"max_in_bytes" : 35782656,
"peak_used_in_bytes" : 8912896,
"peak_max_in_bytes" : 35782656
},
"old" : {
"used_in_bytes" : 26408680,
"max_in_bytes" : 715849728,
"peak_used_in_bytes" : 26408680,
"peak_max_in_bytes" : 715849728
}
}
},
"threads" : {
"count" : 81,
"peak_count" : 81
},
"gc" : {
"collectors" : {
"young" : {
"collection_count" : 250,
"collection_time_in_millis" : 477
},
"old" : {
"collection_count" : 1,
"collection_time_in_millis" : 22
}
}
},
"buffer_pools" : {
"direct" : {
"count" : 112,
"used_in_bytes" : 20205138,
"total_capacity_in_bytes" : 20205138
},
"mapped" : {
"count" : 0,
"used_in_bytes" : 0,
"total_capacity_in_bytes" : 0
}
}
},
Solved.
The key here is the error "java.lang.OutOfMemoryError: Java heap space"
Another day, another gem from the ES docs:
https://www.elastic.co/guide/en/elasticsearch/guide/current/heap-sizing.html
says (emphasis mine):
The default installation of Elasticsearch is configured with a 1 GB heap. For just about every deployment, this number is far too small. If you are using the default heap values, your cluster is probably configured incorrectly.
Resolution:
Edit: /etc/sysconfig/elasticsearch
Set ES_HEAP_SIZE=4g // this system has 8GB RAM
Restart ES
And tada.... the unassigned shards are magically assigned, and the cluster goes green.

How to read Verbose Output from MongoDB-explain(1)

I have the following query.explain(1)-Output. It is a verbose output and my question is how to read that. How is the order of the operations? Does it starts with GEO_NEAR_2DSPHERE or with LIMIT? What does the field advanced express?
And most important, where is this documented? Could not find this in the mongoDB-manual :(
Query:
db.nodesWays.find(
{
geo:{
$nearSphere:{
$geometry:{
type: "Point",
coordinates: [lon, lat]
}
}
},
"amenity":"restaurant"
},
{name:1}
).limit(10).explain(1)
The output:
{
"cursor" : "S2NearCursor",
"isMultiKey" : false,
"n" : 10,
"nscannedObjects" : 69582,
"nscanned" : 69582,
"nscannedObjectsAllPlans" : 69582,
"nscannedAllPlans" : 69582,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 543,
"nChunkSkips" : 0,
"millis" : 606,
"indexBounds" : {
},
"allPlans" : [
{
"cursor" : "S2NearCursor",
"isMultiKey" : false,
"n" : 10,
"nscannedObjects" : 69582,
"nscanned" : 69582,
"scanAndOrder" : false,
"indexOnly" : false,
"nChunkSkips" : 0,
"indexBounds" : {
}
}
],
"server" : "DBTest:27017",
"filterSet" : false,
"stats" : {
"type" : "LIMIT",
"works" : 69582,
"yields" : 543,
"unyields" : 543,
"invalidates" : 0,
"advanced" : 10,
"needTime" : 69572,
"needFetch" : 0,
"isEOF" : 1,
"children" : [
{
"type" : "PROJECTION",
"works" : 69582,
"yields" : 543,
"unyields" : 543,
"invalidates" : 0,
"advanced" : 10,
"needTime" : 0,
"needFetch" : 0,
"isEOF" : 0,
"children" : [
{
"type" : "FETCH",
"works" : 69582,
"yields" : 543,
"unyields" : 543,
"invalidates" : 0,
"advanced" : 10,
"needTime" : 69572,
"needFetch" : 0,
"isEOF" : 0,
"alreadyHasObj" : 4028,
"forcedFetches" : 0,
"matchTested" : 10,
"children" : [
{
"type" : "GEO_NEAR_2DSPHERE",
"works" : 69582,
"yields" : 0,
"unyields" : 0,
"invalidates" : 0,
"advanced" : 4028,
"needTime" : 0,
"needFetch" : 0,
"isEOF" : 0,
"children" : [ ]
}
]
}
]
}
]
}
}
By looking at the stats array, the sequence should be
GEO_NEAR_2DSPHERE -> scans 69582 index objects.
Fetch and limit -> Fetches matched documents up to limited number of documents.
Projection -> Project to return only required fields.
The reason why MongoDB wrap all actions in LIMIT is to align with the query's syntax for easier interpretation.
The query uses an unknown index of type S2NearCursor. In addition to the index, it also retrieved whole document for further reduction on amenity. You may want to explore indexing that as well.
BTW, this is a known bug in MongoDB. It misses the index name when using S2NearCursor index.
As for detailed documentation, I myself also don't find much, but a few online blogs you can browse around.
explain.explain() – Understanding Mongo Query Behavior
Speeding Up Queries: Understanding Query Plans
I especially want to recommend you to pay attention to the last paragraph of the two blog posts. Tune, generate the query plan and try to explain the plan yourself. Doing this a number of rounds, you'll get some idea how it works.
Happy explaining. : )

Resources