This morning, we were alerted that a few machines in a 4 node Elasticsearch clusters are running low on disk space. We're using 5 shards, with single replication.
The status report shows index sizes that line up with the disk usage we're seeing. However, the confusing thing, is that the replicated shards are very out of line for shards 2 and 4. I understand that shard sizes can vary between replicas; however the size differences we are seeing are enormous:
"shards": {
...
"2": [
{
"routing": {
"state": "STARTED",
"primary": true,
"node": "LV__Sh-vToyTcuuxwnZaAg",
"relocating_node": null,
"shard": 2,
"index": "eventdata"
},
"state": "STARTED",
"index": {
"size_in_bytes": 87706293809
},
......
},
{
"routing": {
"state": "STARTED",
"primary": false,
"node": "U7eYVll7ToWS9lPkyhql6g",
"relocating_node": null,
"shard": 2,
"index": "eventdata"
},
"state": "STARTED",
"index": {
"size_in_bytes": 42652984946
},
Some interesting data points:
There's a lot of merging happening on the cluster at the moment. Could this be a factor? If merges make a copy of the index prior to merging, then that would explain a lot.
The bigger shard is almost exactly twice the size of the the smaller shard. coincidence? (I think not).
Why are we seeing indices at double the size of their replicas in our cluster? merging? some other reason?
Related
There is a microservice-based architecture wherein each service has a different type of entity. For example:
Service-1:
{
"entity_type": "SKU",
"sku": "123",
"ext_sku": "201",
"store": "1",
"product": "abc",
"timestamp": 1564484862000
}
Service-2:
{
"entity_type": "PRODUCT",
"product": "abc",
"parent": "xyz",
"description": "curd",
"unit_of_measure": "gm",
"quantity": "200",
"timestamp": 1564484863000
}
Service-3:
{
"entity_type": "PRICE",
"meta": {
"store": "1",
"sku": "123"
},
"price": "200",
"currency": "INR",
"timestamp": 1564484962000
}
Service-4:
{
"entity_type": "INVENTORY",
"meta": {
"store": "1",
"sku": "123"
},
"in_stock": true,
"inventory": 10,
"timestamp": 1564484864000
}
I want to write an Audit Service backed by elasticsearch, which will ingest all these entities and it will index based on entity_type, store, sku, timestamp.
Will elasticsearch be a good choice here? Also, how will the indexing work? So, for example, if I search for store=1, it should return all the different entities that have store as 1. Secondly, will I be able to get all the entities between 2 timestamps?
Will ES and Kibana (to visualize) be good choices here?
Yes. Your use case is pretty much exactly what is described in the docs under filter context:
In filter context, a query clause answers the question “Does this
document match this query clause?” The answer is a simple Yes or
No — no scores are calculated. Filter context is mostly used for
filtering structured data, e.g.
Does this timestamp fall into the range 2015 to 2016?
Is the status field set to published?
I'm still relatively new to Elasticsearch and, currently, I'm attempting to switch from Solr to Elasticsearch and am seeing a huge increase in CPU usage when ES is on our production website. The site sees anywhere from 10,000 to 30,000 requests to ES per second. Solr handles that load just fine with our current hardware.
The books index mapping: https://pastebin.com/bKM9egPS
A query for a book: https://pastebin.com/AdfZ895X
ES is hosted on AWS on an m4.xlarge.elasticsearch instance.
Our cluster is set up as follows (anything not included is default):
"persistent": {
"cluster": {
"routing": {
"allocation": {
"cluster_concurrent_rebalance": "2",
"node_concurrent_recoveries": "2",
"disk": {
"watermark": {
"low": "15.0gb",
"flood_stage": "5.0gb",
"high": "10.0gb"
}
},
"node_initial_primaries_recoveries": "4"
}
}
},
"indices": {
"recovery": {
"max_bytes_per_sec": "60mb"
}
}
Our nodes have the following configuration:
"_nodes": {
"total": 2,
"successful": 2,
"failed": 0
},
"cluster_name": "cluster",
"nodes": {
"####": {
"name": "node1",
"version": "6.3.1",
"build_flavor": "oss",
"build_type": "zip",
"build_hash": "####",
"roles": [
"master",
"data",
"ingest"
]
},
"###": {
"name": "node2",
"version": "6.3.1",
"build_flavor": "oss",
"build_type": "zip",
"build_hash": "###",
"roles": [
"master",
"data",
"ingest"
]
}
}
Can someone please help me figure out what exactly is happening so I can get this deployment finished?
I have a managed cluster hosted by elastio.co. Here is the configuration
|Platform => Amazon Web Services| |Memory => 4 GB|
|Storage => 96 GB| |SSD => Yes| |High availability => Yes 2 data centers|
Each index in this cluster contain log data of exactly one day. Average index size is 15 mb and average doc count is 15000. The cluster is not in any way under any kind of pressure (JVM, Indexing & Searching time, Disk Space all are in very comfort zone)
When I opened a previously closed index the cluster is turned RED. Here are some matrices I found querying the elasticsearch.
GET /_cluster/allocation/explain
{
"index": "some_index_name", # 1 Primary shard , 1 replica shard
"shard": 0,
"primary": true
}
Response :
"unassigned_info": {
"reason": "ALLOCATION_FAILED"
"failed_allocation_attempts": 3,
"details": "failed recovery, failure RecoveryFailedException[[some_index_name][0]: Recovery failed on {instance-*****}{Hash}{HASH}{IP}{IP}{logical_availability_zone=zone-1, availability_zone=***, region=***}]; nested: IndexShardRecoveryException[failed to fetch index version after copying it over]; nested: IndexShardRecoveryException[shard allocated for local recovery (post api), should exist, but doesn't, current files: []]; nested: IndexNotFoundException[no segments* file found in store(mmapfs(/app/data/nodes/0/indices/MFIFAQO2R_ywstzqrfbY4w/0/index)): files: []]; ",
"last_allocation_status": "no_valid_shard_copy"
},
"can_allocate": "no_valid_shard_copy",
"allocate_explanation": "cannot allocate because all found copies of the shard are either stale or corrupt",
"node_allocation_decisions": [
{
"node_name": "instance-***",
"node_decision": "no",
"store": {
"in_sync": false,
"allocation_id": "RANDOM_HASH",
"store_exception": {
"type": "index_not_found_exception",
"reason": "no segments* file found in SimpleFSDirectory#/app/data/nodes/0/indices/RANDOM_HASH/0/index lockFactory=org.apache.lucene.store.NativeFSLockFactory#346e1b99: files: []"
}
}
},
{
"node_name": "instance-***",
"node_attributes": {
"logical_availability_zone": "zone-0",
},
"node_decision": "no",
"store": {
"found": false
}
}
I've tried rerouting the shards to a node. Even setting data loss flag to true.
POST _cluster/reroute
{
"commands" : [
{"allocate_stale_primary" : {
"index" : "some_index_name", "shard" : 0,
"node" : "instance-***",
"accept_data_loss" : true
}
}
]
}
Response:
"acknowledged": true,
"state": {
"version": 338190,
"state_uuid": "RANDOM_HASH",
"master_node": "RANDOM_HASH",
"blocks": {
"indices": {
"restored_**: {
"4": {
"description": "index closed",
"retryable": false,
"levels": [
"read",
"write"
]
}
},
"restored_**": {
"4": {
"description": "index closed",
"retryable": false,
"levels": [
"read",
"write"
]
}
}
}
},
"routing_table": {
"indices": {
"SOME_INDEX_NAME": {
"shards": {
"0": [
{
"state": "INITIALIZING",
"primary": true,
"relocating_node": null,
"shard": 0,
"index": "SOME_INDEX_NAME",
"recovery_source": {
"type": "EXISTING_STORE"
},
"allocation_id": {
"id": "HASH"
},
"unassigned_info": {
"reason": "ALLOCATION_FAILED",
"failed_attempts": 4,
"delayed": false,
"details": "same as explanation above ^ ",
"allocation_status": "no_valid_shard_copy"
}
},
{
"state": "UNASSIGNED",
"primary": false,
"node": null,
"relocating_node": null,
"shard": 0,
"index": "some_index_name",
"recovery_source": {
"type": "PEER"
},
"unassigned_info": {
"reason": "INDEX_REOPENED",
"delayed": false,
"allocation_status": "no_attempt"
}
}
]
}
},
Any kind of suggestion is welcomed. Thanks and regards.
This occurs when the master-node is brought down abruptly.
Here are the steps I took to resolve the same issue, that I had encountered ,
Step 1: Check the allocation
curl -XGET http://localhost:9200/_cat/allocation?v
Step 2: Check the shard stores
curl -XGET http://localhost:9200/_shard_stores?pretty
Look out for "index", "shard" and "node" that has the error that you displayed.
The ERROR should be --> "no segments* file found in SimpleFSDirectory#/...."
Step 3: Now reroute that index as shown below
curl -XPOST 'http://localhost:9200/_cluster/reroute?master_timeout=5m' \
-d '{ "commands": [ { "allocate_empty_primary": { "index": "IndexFromStep2", "shard": ShardFromStep2 , "node": "NodeFromStep2", "accept_data_loss" : true } } ] }'
Step 4: Repeat Step2 and Step3 until you see this output.
curl -XGET 'http://localhost:9200/_shard_stores?pretty'
{
"indices" : { }
}
Your cluster should go green soon.
My cluster suddenly went to red. Because of an index shard allocation fail. when i run
GET /_cluster/allocation/explain
{
"index": "my_index",
"shard": 0,
"primary": true
}
output:
{
"shard": {
"index": "twitter_tracker",
"index_uuid": "mfXc8oplQpq2lWGjC1TxbA",
"id": 0,
"primary": true
},
"assigned": false,
"shard_state_fetch_pending": false,
"unassigned_info": {
"reason": "ALLOCATION_FAILED",
"at": "2018-01-02T08:13:44.513Z",
"failed_attempts": 1,
"delayed": false,
"details": "failed to create shard, failure IOException[failed to obtain in-memory shard lock]; nested: NotSerializableExceptionWrapper[shard_lock_obtain_failed_exception: [twitter_tracker][0]: obtaining shard lock timed out after 5000ms]; ",
"allocation_status": "no_valid_shard_copy"
},
"allocation_delay_in_millis": 60000,
"remaining_delay_in_millis": 0,
"nodes": {
"n91cV7ocTh-Zp58dFr5rug": {
"node_name": "elasticsearch-24-384-node-1",
"node_attributes": {},
"store": {
"shard_copy": "AVAILABLE"
},
"final_decision": "YES",
"final_explanation": "the shard can be assigned and the node contains a valid copy of the shard data",
"weight": 0.45,
"decisions": []
},
"_b-wXdjGRdGLEtvY76PDSA": {
"node_name": "elasticsearch-24-384-node-2",
"node_attributes": {},
"store": {
"shard_copy": "NONE"
},
"final_decision": "NO",
"final_explanation": "there is no copy of the shard available",
"weight": 0,
"decisions": []
}
}
}
What will be the solution? This is happened in my production node. My elasticsearch version 5.0. and i have two nodes
It is an issue that every Elastic Cluster developer will bump to anyway :)
Safeway to reroute your red index.
curl -XPOST 'localhost:9200/_cluster/reroute?retry_failed
This command will take some time, but you won't get allocation error while data transferring.
Here is issue explained wider.
I solved my issue with the following command.
curl -XPOST 'localhost:9200/_cluster/reroute?pretty' -d '{
"commands" : [ {
"allocate_stale_primary" :
{
"index" : "da-prod8-other", "shard" : 3,
"node" : "node-2-data-pod",
"accept_data_loss" : true
}
}
]
}'
So here u might lose the data. For me it works well. It was thrilling while running the command. luckily it worked well. For more details enter link description here Check this thread.
I'm having some trouble understanding some of the data in the response from the SonarQube GET api/measures/component_tree API.
Some metrics have a value attribute while others don't. I've figured out that the value displayed in the UI is the "value" unless it does not exist, then the value at the earliest period is used. The other periods are then basically deltas between measurements. Would anyone be able to provide some details around what the response values actually mean? Unfortunately, the actual API documentation that SonarQube provides doesn't give any detail around response data. Specifically, I'm wondering when a value attribute would and would not be there, what the index means since not all have the same indexes (ie. some go 1-4, others have just 3,4), and what the period data represents.
{
"metric": "new_lines_to_cover",
"periods": [
{
"index": 1,
"value": "572"
},
{
"index": 2,
"value": "572"
},
{
"index": 3,
"value": "8206"
},
{
"index": 4,
"value": "186574"
}
]
},
{
"metric": "duplicated_lines",
"value": "80819",
"periods": [
{
"index": 1,
"value": "-158"
},
{
"index": 2,
"value": "-158"
},
{
"index": 3,
"value": "-10544"
},
{
"index": 4,
"value": "-6871"
}
]
},
{
"metric": "new_line_coverage",
"periods": [
{
"index": 3,
"value": "3.9900249376558605"
},
{
"index": 4,
"value": "17.221615720524017"
}
]
},
The heuristic is very close from the truth:
if the metric starts with "new_", it means it's a metric that compute new elements on a period of time. Starting with 6.3, only the leak period is supported
otherwise, the "value" represents the raw value.
For example, to compute the number of issues:
violations computes the total number of issues
new_violations computes the number of new issues on the leak period
To know more about the leak period concept in SonarQube, please check this article.