My cluster suddenly went to red. Because of an index shard allocation fail. when i run
GET /_cluster/allocation/explain
{
"index": "my_index",
"shard": 0,
"primary": true
}
output:
{
"shard": {
"index": "twitter_tracker",
"index_uuid": "mfXc8oplQpq2lWGjC1TxbA",
"id": 0,
"primary": true
},
"assigned": false,
"shard_state_fetch_pending": false,
"unassigned_info": {
"reason": "ALLOCATION_FAILED",
"at": "2018-01-02T08:13:44.513Z",
"failed_attempts": 1,
"delayed": false,
"details": "failed to create shard, failure IOException[failed to obtain in-memory shard lock]; nested: NotSerializableExceptionWrapper[shard_lock_obtain_failed_exception: [twitter_tracker][0]: obtaining shard lock timed out after 5000ms]; ",
"allocation_status": "no_valid_shard_copy"
},
"allocation_delay_in_millis": 60000,
"remaining_delay_in_millis": 0,
"nodes": {
"n91cV7ocTh-Zp58dFr5rug": {
"node_name": "elasticsearch-24-384-node-1",
"node_attributes": {},
"store": {
"shard_copy": "AVAILABLE"
},
"final_decision": "YES",
"final_explanation": "the shard can be assigned and the node contains a valid copy of the shard data",
"weight": 0.45,
"decisions": []
},
"_b-wXdjGRdGLEtvY76PDSA": {
"node_name": "elasticsearch-24-384-node-2",
"node_attributes": {},
"store": {
"shard_copy": "NONE"
},
"final_decision": "NO",
"final_explanation": "there is no copy of the shard available",
"weight": 0,
"decisions": []
}
}
}
What will be the solution? This is happened in my production node. My elasticsearch version 5.0. and i have two nodes
It is an issue that every Elastic Cluster developer will bump to anyway :)
Safeway to reroute your red index.
curl -XPOST 'localhost:9200/_cluster/reroute?retry_failed
This command will take some time, but you won't get allocation error while data transferring.
Here is issue explained wider.
I solved my issue with the following command.
curl -XPOST 'localhost:9200/_cluster/reroute?pretty' -d '{
"commands" : [ {
"allocate_stale_primary" :
{
"index" : "da-prod8-other", "shard" : 3,
"node" : "node-2-data-pod",
"accept_data_loss" : true
}
}
]
}'
So here u might lose the data. For me it works well. It was thrilling while running the command. luckily it worked well. For more details enter link description here Check this thread.
Related
I'm still relatively new to Elasticsearch and, currently, I'm attempting to switch from Solr to Elasticsearch and am seeing a huge increase in CPU usage when ES is on our production website. The site sees anywhere from 10,000 to 30,000 requests to ES per second. Solr handles that load just fine with our current hardware.
The books index mapping: https://pastebin.com/bKM9egPS
A query for a book: https://pastebin.com/AdfZ895X
ES is hosted on AWS on an m4.xlarge.elasticsearch instance.
Our cluster is set up as follows (anything not included is default):
"persistent": {
"cluster": {
"routing": {
"allocation": {
"cluster_concurrent_rebalance": "2",
"node_concurrent_recoveries": "2",
"disk": {
"watermark": {
"low": "15.0gb",
"flood_stage": "5.0gb",
"high": "10.0gb"
}
},
"node_initial_primaries_recoveries": "4"
}
}
},
"indices": {
"recovery": {
"max_bytes_per_sec": "60mb"
}
}
Our nodes have the following configuration:
"_nodes": {
"total": 2,
"successful": 2,
"failed": 0
},
"cluster_name": "cluster",
"nodes": {
"####": {
"name": "node1",
"version": "6.3.1",
"build_flavor": "oss",
"build_type": "zip",
"build_hash": "####",
"roles": [
"master",
"data",
"ingest"
]
},
"###": {
"name": "node2",
"version": "6.3.1",
"build_flavor": "oss",
"build_type": "zip",
"build_hash": "###",
"roles": [
"master",
"data",
"ingest"
]
}
}
Can someone please help me figure out what exactly is happening so I can get this deployment finished?
I'm trying to backup an index in ElasticSearch 6.2.2 on Windows and restore it on ElasticSearch 6.2.3 on linux. I'm taking following steps but I never see the snapshots on the linux machine (based on https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-snapshots.html):
Register a snapshot repository on Windows
PUT /_snapshot/my_backup
{
"type": "fs",
"settings": {
"location": "C:\\EsBackup",
"compress": true
}
}
Create a snapshot on Windows
PUT /_snapshot/my_backup/bcp?wait_for_completion=true
{
"indices":"myindex"
}
Check that snapshot is created
GET /_snapshot/my_backup/bcp
It shows success
{
"snapshots": [
{
"snapshot": "bcp",
"uuid": "xGLvw4JnRqadcyg4fxH1HA",
"version_id": 6020299,
"version": "6.2.2",
"indices": [
"bcp"
],
"include_global_state": true,
"state": "SUCCESS",
"start_time": "2018-05-10T16:48:22.970Z",
"start_time_in_millis": 1525970902970,
"end_time": "2018-05-10T16:48:36.816Z",
"end_time_in_millis": 1525970916816,
"duration_in_millis": 13846,
"failures": [],
"shards": {
"total": 3,
"failed": 0,
"successful": 3
}
}
]
}
I copy the files to linux and check that size and number of files is the same. (It's around 200 files with size of 500MB). I copy everything to a folder ~/elasticsearch-6.2.3/backup
Initialize the repository
curl -X PUT "localhost:9200/_snapshot/my_backup" -H 'Content-Type: application/json' -d'{"type": "fs","settings": {"location": "~/elasticsearch-6.2.3/backup","compress": true}}'
And got acknowledgement: {"acknowledged":true}
So I check which snapshots are available
curl -X GET "localhost:9200/_snapshot/my_backup/_status"
But the list is empty {"snapshots":[]}
And restore
curl -X POST "localhost:9200/_snapshot/my_backup/bcp/_restore?pretty"
returns an error
{
"error" : {
"root_cause" : [
{
"type" : "snapshot_restore_exception",
"reason" : "[my_backup:bcp] snapshot does not exist"
}
],
"type" : "snapshot_restore_exception",
"reason" : "[my_backup:bcp] snapshot does not exist"
},
"status" : 500
}
What step am I missing? I tried to _verify the repository and it's valid. Version 6.2.2 and 6.2.3 should be compatible.
Add the following in elasticsearch.yml file:
path.repo: /path/to/repo
I'm not sure why, but the home path cannot use ~, but if stated as "/home/user/elasticsearch-6.2.3/backup" then it works
I have a managed cluster hosted by elastio.co. Here is the configuration
|Platform => Amazon Web Services| |Memory => 4 GB|
|Storage => 96 GB| |SSD => Yes| |High availability => Yes 2 data centers|
Each index in this cluster contain log data of exactly one day. Average index size is 15 mb and average doc count is 15000. The cluster is not in any way under any kind of pressure (JVM, Indexing & Searching time, Disk Space all are in very comfort zone)
When I opened a previously closed index the cluster is turned RED. Here are some matrices I found querying the elasticsearch.
GET /_cluster/allocation/explain
{
"index": "some_index_name", # 1 Primary shard , 1 replica shard
"shard": 0,
"primary": true
}
Response :
"unassigned_info": {
"reason": "ALLOCATION_FAILED"
"failed_allocation_attempts": 3,
"details": "failed recovery, failure RecoveryFailedException[[some_index_name][0]: Recovery failed on {instance-*****}{Hash}{HASH}{IP}{IP}{logical_availability_zone=zone-1, availability_zone=***, region=***}]; nested: IndexShardRecoveryException[failed to fetch index version after copying it over]; nested: IndexShardRecoveryException[shard allocated for local recovery (post api), should exist, but doesn't, current files: []]; nested: IndexNotFoundException[no segments* file found in store(mmapfs(/app/data/nodes/0/indices/MFIFAQO2R_ywstzqrfbY4w/0/index)): files: []]; ",
"last_allocation_status": "no_valid_shard_copy"
},
"can_allocate": "no_valid_shard_copy",
"allocate_explanation": "cannot allocate because all found copies of the shard are either stale or corrupt",
"node_allocation_decisions": [
{
"node_name": "instance-***",
"node_decision": "no",
"store": {
"in_sync": false,
"allocation_id": "RANDOM_HASH",
"store_exception": {
"type": "index_not_found_exception",
"reason": "no segments* file found in SimpleFSDirectory#/app/data/nodes/0/indices/RANDOM_HASH/0/index lockFactory=org.apache.lucene.store.NativeFSLockFactory#346e1b99: files: []"
}
}
},
{
"node_name": "instance-***",
"node_attributes": {
"logical_availability_zone": "zone-0",
},
"node_decision": "no",
"store": {
"found": false
}
}
I've tried rerouting the shards to a node. Even setting data loss flag to true.
POST _cluster/reroute
{
"commands" : [
{"allocate_stale_primary" : {
"index" : "some_index_name", "shard" : 0,
"node" : "instance-***",
"accept_data_loss" : true
}
}
]
}
Response:
"acknowledged": true,
"state": {
"version": 338190,
"state_uuid": "RANDOM_HASH",
"master_node": "RANDOM_HASH",
"blocks": {
"indices": {
"restored_**: {
"4": {
"description": "index closed",
"retryable": false,
"levels": [
"read",
"write"
]
}
},
"restored_**": {
"4": {
"description": "index closed",
"retryable": false,
"levels": [
"read",
"write"
]
}
}
}
},
"routing_table": {
"indices": {
"SOME_INDEX_NAME": {
"shards": {
"0": [
{
"state": "INITIALIZING",
"primary": true,
"relocating_node": null,
"shard": 0,
"index": "SOME_INDEX_NAME",
"recovery_source": {
"type": "EXISTING_STORE"
},
"allocation_id": {
"id": "HASH"
},
"unassigned_info": {
"reason": "ALLOCATION_FAILED",
"failed_attempts": 4,
"delayed": false,
"details": "same as explanation above ^ ",
"allocation_status": "no_valid_shard_copy"
}
},
{
"state": "UNASSIGNED",
"primary": false,
"node": null,
"relocating_node": null,
"shard": 0,
"index": "some_index_name",
"recovery_source": {
"type": "PEER"
},
"unassigned_info": {
"reason": "INDEX_REOPENED",
"delayed": false,
"allocation_status": "no_attempt"
}
}
]
}
},
Any kind of suggestion is welcomed. Thanks and regards.
This occurs when the master-node is brought down abruptly.
Here are the steps I took to resolve the same issue, that I had encountered ,
Step 1: Check the allocation
curl -XGET http://localhost:9200/_cat/allocation?v
Step 2: Check the shard stores
curl -XGET http://localhost:9200/_shard_stores?pretty
Look out for "index", "shard" and "node" that has the error that you displayed.
The ERROR should be --> "no segments* file found in SimpleFSDirectory#/...."
Step 3: Now reroute that index as shown below
curl -XPOST 'http://localhost:9200/_cluster/reroute?master_timeout=5m' \
-d '{ "commands": [ { "allocate_empty_primary": { "index": "IndexFromStep2", "shard": ShardFromStep2 , "node": "NodeFromStep2", "accept_data_loss" : true } } ] }'
Step 4: Repeat Step2 and Step3 until you see this output.
curl -XGET 'http://localhost:9200/_shard_stores?pretty'
{
"indices" : { }
}
Your cluster should go green soon.
I have issue with latest fsriver plugin.
I executed following command to index document
PUT _river/mynewriver2/_meta
{
"type": "fs",
"fs": {
"url": "d://tmp",
"update_rate": "1h",
"includes": [ "*.doc" , "*.xls", "*.txt" ]
},
"index": {
"index": "docs1",
"type": "doc1",
"bulk_size": 50
}
}
Inside d://tmp I have a simple txt file with person name.
But when I am executing the command to check document, I am not getting any document.
GET docs1/doc1/_search
output :
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 0,
"max_score": null,
"hits": []
}
}
In elasticsearch console, I have following log:
[2015-05-23 12:40:40,645][INFO ][cluster.metadata ] [Ulysses] [.marvel-2015.05.23] update_mapping [cluster_stats] (dynamic)
[2015-05-23 12:40:54,037][INFO ][cluster.metadata ] [Ulysses] [_river] creating index, cause [auto(index api)], templates [], shards [1]/[1], mappings [mynewriver2]
[2015-05-23 12:40:56,511][INFO ][cluster.metadata ] [Ulysses] [_river] update_mapping [mynewriver2] (dynamic)
[2015-05-23 12:40:57,023][INFO ][fr.pilato.elasticsearch.river.fs.river.FsRiver] [Ulysses] [fs][mynewriver2] Starting fs river scanning
[2015-05-23 12:40:57,309][INFO ][cluster.metadata ] [Ulysses] [docs1] creating index, cause [api], templates [], shards [5]/[1], mappings []
[2015-05-23 12:41:00,762][INFO ][cluster.metadata ] [Ulysses] [.marvel-2015.05.23] update_mapping [index_event] (dynamic)
I am running elasticsearch 1.5.2 in windows 7 ( 64 bit).
Since you're on a Windows system, it looks like the specified path is not correct according to the documentation, i.e. you should either use two back slashes instead of two forward slashes in your path OR a single forward slash. Can you try to delete your river and re-create it like this
PUT _river/mynewriver2/_meta
{
"type": "fs",
"fs": {
"url": "d:\\tmp",
"update_rate": "1h",
"includes": [ "*.doc" , "*.xls", "*.txt" ]
},
"index": {
"index": "docs1",
"type": "doc1",
"bulk_size": 50
}
}
or like this:
PUT _river/mynewriver2/_meta
{
"type": "fs",
"fs": {
"url": "d:/tmp",
"update_rate": "1h",
"includes": [ "*.doc" , "*.xls", "*.txt" ]
},
"index": {
"index": "docs1",
"type": "doc1",
"bulk_size": 50
}
}
This morning, we were alerted that a few machines in a 4 node Elasticsearch clusters are running low on disk space. We're using 5 shards, with single replication.
The status report shows index sizes that line up with the disk usage we're seeing. However, the confusing thing, is that the replicated shards are very out of line for shards 2 and 4. I understand that shard sizes can vary between replicas; however the size differences we are seeing are enormous:
"shards": {
...
"2": [
{
"routing": {
"state": "STARTED",
"primary": true,
"node": "LV__Sh-vToyTcuuxwnZaAg",
"relocating_node": null,
"shard": 2,
"index": "eventdata"
},
"state": "STARTED",
"index": {
"size_in_bytes": 87706293809
},
......
},
{
"routing": {
"state": "STARTED",
"primary": false,
"node": "U7eYVll7ToWS9lPkyhql6g",
"relocating_node": null,
"shard": 2,
"index": "eventdata"
},
"state": "STARTED",
"index": {
"size_in_bytes": 42652984946
},
Some interesting data points:
There's a lot of merging happening on the cluster at the moment. Could this be a factor? If merges make a copy of the index prior to merging, then that would explain a lot.
The bigger shard is almost exactly twice the size of the the smaller shard. coincidence? (I think not).
Why are we seeing indices at double the size of their replicas in our cluster? merging? some other reason?