Elasticsearch cluster health intermittently flaps between 'GREEN' and 'YELLOW'

Elasticsearch cluster health intermittently flaps between 'GREEN' and 'YELLOW' - elasticsearch

We are running a 7 node cluster with "ZERO" replicas, like this:
{
"cluster_name": "my_cluster",
"status": "green",
"timed_out": false,
"number_of_nodes": 7,
"number_of_data_nodes": 7,
"active_primary_shards": 3325,
"active_shards": 3325,
"relocating_shards": 0,
"initializing_shards": 0,
"unassigned_shards": 0,
"delayed_unassigned_shards": 0,
"number_of_pending_tasks": 0,
"number_of_in_flight_fetch": 0,
"task_max_waiting_in_queue_millis": 0,
"active_shards_percent_as_number": 100.0
}
elasticsearch cluster state changes from "Green" to "Yellow" intermittently. The other interesting thing I noticed was during this intermittent cluster state changes, there is shard initializing taking place, which correlates with the cluster state changes. Is this due to the cluster running with "ZERO" replicas?
What could cause the above behavior ?

1.find that indices with
http://IP_MASTER:9200/_cat/indices?v
2.find the node that has the shard of that indices is going assign and unassigned.
http://IP_MASTER:9200/_cat/shards?v
restart service elasticsearch on that node
if problem exist you have two option.
A. lucene check index (just check that shard)
java -cp lucene-core*.jar -ea:org.apache.lucene… org.apache.lucene.index.CheckIndex /mnt/nas/elasticsearch/graylog-production/nodes/0/indices/graylog_92/0/index/ -verbose -exorcise
if it say doesn't find the segment, try to find and cd on that path and run the command.
B. elasticsearch fix index (it check all index and is very slow)
index.shard.check_on_startup: fix
you should set this config on elasticsearch.yml of that node.

Related

Connect 3+ OpenDaylight controllers to mininet topology

I would like to ask
I have created a cluster according to this
https://docs.opendaylight.org/en/stable-magnesium/getting-started-guide/clustering.html
And i would like to verify it is working can someone help me how to do it?
Also is it able to connect this cluster or those 3 controllers to one mininet topology? Or it cant be done?
EDIT
I would like to ask why
Not all bundle are active?
Is there gonna be some problem with that ?

I'm not sure if you can specify multiple controllers on the mininet command
line, but it's worth a try. Otherwise you can try like this person explains
in this post setting up the controllers in a mininet .py config file.
To verify the cluster is working, there are many ways, but you can try some
rest calls to check the status of things. We have some examples in upstream
CSIT tests. If you install the feature odl-jolokia, you can send a GET to:
jolokia/read/org.opendaylight.controller:Category=Shards,name=member-1-shard-default-config,type=DistributedConfigDatastore
that is checking the default shard status for the config datastore. You'll get
some output like this:
content={
"request": {
"mbean": "org.opendaylight.controller:Category=Shards,name=member-1-shard-default-config,type=DistributedConfigDatastore",
"type": "read"
},
"status": 200,
"timestamp": 1588524930,
"value": {
"AbortTransactionsCount": 0,
"CommitIndex": 70,
"CommittedTransactionsCount": 0,
"CurrentTerm": 7,
"FailedReadTransactionsCount": 0,
"FailedTransactionsCount": 0,
"FollowerInfo": [],
"FollowerInitialSyncStatus": true,
"InMemoryJournalDataSize": 33,
"InMemoryJournalLogSize": 1,
"LastApplied": 70,
"LastCommittedTransactionTime": "1970-01-01 00:00:00.000",
"LastIndex": 70,
"LastLeadershipChangeTime": "2020-05-03 16:54:45.034",
"LastLogIndex": 70,
"LastLogTerm": 7,
"LastTerm": 7,
"Leader": "member-2-shard-default-config",
"LeadershipChangeCount": 1,
"PeerAddresses": "member-3-shard-default-config: akka.tcp://opendaylight-cluster-data#10.30.170.119:2550/user/shardmanager-config/member-3-shard-default-config, member-2-shard-default-config: akka.tcp://opendaylight-cluster-data#10.30.170.113:2550/user/shardmanager-config/member-2-shard-default-config",
"PeerVotingStates": "member-3-shard-default-config: true, member-2-shard-default-config: true",
"PendingTxCommitQueueSize": 0,
"RaftState": "Follower",
"ReadOnlyTransactionCount": 0,
"ReadWriteTransactionCount": 0,
"ReplicatedToAllIndex": 69,
"ShardName": "member-1-shard-default-config",
"SnapshotCaptureInitiated": false,
"SnapshotIndex": 69,
"SnapshotTerm": 7,
"StatRetrievalError": null,
"StatRetrievalTime": "557.3 \u03bcs",
"TxCohortCacheSize": 0,
"VotedFor": "member-2-shard-default-config",
"Voting": true
}
}
Lots of info there, but the raftstate says Follower, so you know this node
is one of the two followers. One node will be leader.
Another thing we check is syncstatus to make sure it's "true". Use this
URI:
jolokia/read/org.opendaylight.controller:Category=ShardManager,name=shard-manager-operational,type=DistributedOperationalDatastore
example output

Elasticsearch: what does "shard allocation" mean?

We encountered a production incident, that Elasticsearch cluster health check returned red status. The health check report shows marvel-2019.06.20 has 2 unassigned_shards, which seems the root cause.
curl -XGET 'localhost:9200/_cluster/health?level=indices&pretty'
{
"cluster_name" : "sap-jam-jam8",
"status" : "red",
"timed_out" : false,
"number_of_nodes" : 2,
"number_of_data_nodes" : 2,
"active_primary_shards" : 122,
"active_shards" : 239,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 7,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0,
"indices" : {
...
...
".marvel-2019.06.20" : {
"status" : "red",
"number_of_shards" : 1,
"number_of_replicas" : 1,
"active_primary_shards" : 0,
"active_shards" : 0,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 2
}
}
we checked the config of Elasticseach, found cluster.routing.allocation has been disabled.
curl -XGET 'localhost:9200/_cluster/settings?pretty'
{
"persistent" : { },
"transient" : {
"cluster" : {
"routing" : {
"allocation" : {
"enable" : "none"
}
}
}
}
}
As this stackoverflow post suggested, we forced a shard to be assigned, and this issue has gone.
curl -XPOST -d '{ "commands" : [ {
"allocate" : {
"index" : ".marvel-2014.05.21",
"shard" : 0,
"node" : "SOME_NODE_HERE",
"allow_primary":true
}
} ] }' http://localhost:9200/_cluster/reroute?pretty
After resolved this incident, I think it's necessary to figure out the basic concept shard allocation. I did some research, but the following questions are still confusing for me.
1. Why elasticsearch needs to assign shard to other nodes?
In my case, we have two elasticsearch nodes, A and B. Two shards have already been created in A, and consumed disk space.
When B is not available, why not just active those two shards in server A?
At least it return a yellow health status.
2. What's the procedures of assign a shard?
In the first question, we suppose both primary shard and replica has been created in server A. when saying assign shard to B, what does that mean?
Doest that mean copy shard from server A to server B?
3. How to explain this zero active shard?
Both primary shard and replicate has been created, but are not active. How is it possible? Besides disk storage, is there other overhead to activate a shard? e.g. Memory?
".marvel-2019.06.20" : {
"status" : "red",
"number_of_shards" : 1,
"number_of_replicas" : 1,
"active_primary_shards" : 0,
"active_shards" : 0, // both shards are inactive.
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 2
}
4. Is the following assumption true?
To make a shard active, Elasticsearch need do the following steps:
Create a shard.
Find a server, which has enough disk space and RAM to run it.
copy this shard from source server to destination server.
Activate this shard.
Reference
Elasticsearch blog: Red Elasticsearch Cluster? Panic no longer
Stack overflow: elasticsearch - what to do with unassigned shards

I'm no expert but are some thoughts:
You have 2 node cluster. ES will try to allocate shards on both nodes to provide HA in case that one of the node fails (and even for better read/write performance)
Related to 1. ES tries to allocate shards accros nodes to provide HA. If both shards are allocated on node A and you add node B to cluster ES will try to move replica shards to B.
Only index is created, not shards. So it tells you "ok, I've managed to create index but I can't find where to place its data". You can find why by running cat shards (https://www.elastic.co/guide/en/elasticsearch/reference/current/cat-shards.html). Shards do cost memory as theirs metadata has to be cached.
That's probably procedure for adding replica shards with possible few additional conditions (like try to create replica on node which doesn't contain primary shard and which has least load and can contain data...)

ElasticSearch v1.4 Recover Lost data

After restarting cluster I lose my data,How Can I recover it
ElasticSearch Version 1.4
{
"cluster_name": "mycluster",
"status": "yellow",
"timed_out": false,
"number_of_nodes": 1,
"number_of_data_nodes": 1,
"active_primary_shards": 113,
"active_shards": 113,
"relocating_shards": 0,
"initializing_shards": 0,
"unassigned_shards": 110
}

"status": "yellow" means that not all the replicas could be assigned (actually "unassigned_shards": 110), but all your primary shards and thus all your data is there: "active_primary_shards": 113,
Why are the replicas not allocated? Because you only have a single node in your cluster: "number_of_nodes": 1,. So either only one node is up or if there are multiple ones then they couldn't form a cluster (more details should be in the logs then).
Generally, your Elasticsearch version is ancient and a lot has been improved around resiliency. If you value your data, start planning an upgrade.

NoShardAvailableException after starting the Elasticsearch.bat

I have started elasticsearch.bat and I completed first indexing using Nest
ElasticClient.Index query.
Then I made my first query using
var results = ElasticClient.Search<Product>(body =>
body.Query(query =>
query.QueryString(qs => qs.Query(key))));
This is all I have done. Later I restarted elasticsearch console using elasticsearch.bat and now it keeps giving me error message NoShardAvailableException. I deleted and redownloaded a new elasticsearch.bat and i keep getting same error. How can I resolve it?
I am using 1.7.1 version and btw I installed Marvel plugin also.

Your problem is not related with a version, so updating will not resolve the issue. The issue is that shards cannot be assigned to nodes. As shown by your call, see "status": "red" and "unassigned_shards": 8:
{
"cluster_name": "elasticsearch",
"status": "red",
"timed_out": false,
"number_of_nodes": 2,
"number_of_data_nodes": 2,
"active_primary_shards": 8,
"active_shards": 16,
"relocating_shards": 0,
"initializing_shards": 0,
"unassigned_shards": 8,
"delayed_unassigned_shards": 0,
"number_of_pending_tasks": 0,
"number_of_in_flight_fetch": 0
}
First off, you can try reassigning the unassigned_shards, using (see es for more on this):
curl -XPOST 'localhost:9200/_cluster/reroute' -d '{"commands": [
{"allocate": {
"index": "{your_index_name}",
"shard": 3,
"node": "{your_assigning_node_ide}",
"allow_primary": true }
}]
}'
Which shards are unassigned? To see this, use:
curl -XGET http://localhost:9200/_cat/shards | grep UNASSIGNED | awk '{print $0}'
When you know which shards create the problem, you can start by trying to recover the indices, using (indices recovery:
curl -XGET http://localhost:9200/index1,index2/_recovery
I find the grep UNASSIGNED statement particularly useful if, a couple out of a lot, are unassigned. Sometimes it is just easier (of course depending on the ease of refilling you indices), to delete and refill you index, in that case (delete indices) :
curl -XDELETE 'http://localhost:9200/concept_cv,concept_pl,concept_pt/'
Then reinsert your data.
This issue most probably was due to incorrect shutdown from your cluster, possibly also OOM exceptions. For more information on status : red:
https://t37.net/how-to-fix-your-elasticsearch-cluster-stuck-in-initializing-shards-mode.html
http://elasticsearch-users.115913.n3.nabble.com/how-to-resolve-elasticsearch-status-red-td4020369.html

How do I find out whats wrong with elasticsearch replication when the status is red

I'm seeing very high CPU on my two elasticsearch nodes and profiling shows that its associated with elasticsearch replication.
I've executed the health status command:
curl -XGET 'http://localhost:9200/_cluster/health?pretty=true'
which returns this:
{
"cluster_name" : "elasticsearch",
"status" : "red",
"timed_out" : false,
"number_of_nodes" : 2,
"number_of_data_nodes" : 2,
"active_primary_shards" : 2003,
"active_shards" : 4006,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 34
}
I can see the status is red so there is some sort of problem which presumably is resulting in the high cpu.
But how do i find out what is actually wrong to i can rectify it?

I'd start with the debugging approach outlined here:
http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/_cluster_health.html
To recap, try running:
GET _cluster/health?level=indices
This will give you info about which indices are having issues - but in general a Red status means you have missing primary and replica shards which means you have missing data (not good).
I'd take a look at individual node health:
GET _nodes/stats
From there I'd focus on heap (memory) usage and disk usage - in particular look for a full disk. And then I'd be logging into each node separately to check disk usage.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Elasticsearch cluster health intermittently flaps between 'GREEN' and 'YELLOW' - elasticsearch

Related

Connect 3+ OpenDaylight controllers to mininet topology

Elasticsearch: what does "shard allocation" mean?

ElasticSearch v1.4 Recover Lost data

NoShardAvailableException after starting the Elasticsearch.bat

How do I find out whats wrong with elasticsearch replication when the status is red

Categories

Resources