Finding out on which data path shard is located in Elasticsearch - elasticsearch

I have multiple path.datas configured for my Elasticsearch cluster.
The official documentation states that only a single path is used for a single shard, so it's never splitted across multiple paths.
I'd like to find a way to finding out which path on which node is used for some specific shard (primary or replica), like index my-index primary shard 0 → node RQzJvAgLTDOnEnmIjYU9FA path /mnt/data1. Tried /_nodes, /_stats, /_segments, /_shard_stores, but there are no any references to paths.

You can find that info using the indices stats API by specifying the level=shards parameter
GET index/_stats?level=shards
will return a structure like this
"indices": {
"listings-master": {
"primaries": {
...
},
"total": {
...
},
"shards": {
"0": [
{
"shard_path": {
"state_path": "/app/data/nodes/0",
"data_path": "/app/data/nodes/0",
"is_custom_data_path": false
},
...
}
...

Not easily but but by doing a small python script I've the info I want, here the script
import json
with open('shard.json') as json_file:
data = json.load(json_file)
print(data.keys())
data=data['indices']
for indice in data:
#print(indice)
d1=data[indice]
shards=d1['shards']
#print(shards,type(shards),shards.keys())
for nshard in shards.keys():
shard=shards[nshard]
#print(shard,type(shard))
for elt in shard:
path=elt['shard_path']['data_path']
node=elt['routing']['node']
#print(repr(elt['shard_path']['data_path']))
#print("=========================")
print(indice,'\t',nshard,'\t',node,'\t',path)
They you obtain stuff like
log-2020.11.06 1 oxx /datassd/elasticsearch/nodes/0
log-2020.11.06 0 oxx /datassd/elasticsearch/nodes/0
log-2020.11.05 1 oxx /datassd/elasticsearch/nodes/0

Related

Redis clients can update cache simultaneously causing wrong state to be saved

I have been building a simple application that uses Redis as cache to store data regarding a game where each user has a score and after a user completes a task the score is updated for the user.
My problem is when a user completes a task his score is updated which means that it will update the record in redis by replacing the previous value with the new one (in my case it will replace the entire room object with the new one even though the room has not changed but only the score of the player inside the room has changed).
The thing is if multiple users complete a task at the same time they will send each at the same time the new record to redis and only the last one will receive the update.
For example:
In the redis cache this is the starting value: { roomId: "...", score:[{ "player1": 0 }, { "player2": 0 }] }
Player 1 completes a task and sends:
{ roomId: "...", score:[{ "player1": 1 }, { "player2": 0 }] }
At the same time Player 2 completes a task and sends:
{ roomId: "...", score:[{ "player1": 0 }, { "player2": 1 }] }
In the redis cache first it will be saved the value received from Player1 let's say and then the value from player 2 which means that the new value in the cache will be:
{ roomId: "...", score:[{ "player1": 0 }, { "player2": 1 }] }
Even though this is wrong because the correct value would be: { roomId: "...", score:[{ "player1": 1 }, { "player2": 1 }] } where both changes are present.
At the moment I am also using a pub/sub system to keep track of changes so that does are reflected to every server and each user connected to the server.
What can I do to fix this? For reference consider the following image as the architecture of the system:
The issue appears to be that you are interleaving one read/write set of operation with others, which leads to using stale data while updating keys. Fortunately, the fix is (relatively) easy: just combine your read/write chunk of operations into a single atomic unit, using either a Lua script, a transaction or, even easier, through a single RedisJSON command.
Here is an example using RedisJSON. Prepare your JSON key/document which will hold all the scores for the room first, using the JSON.SET command:
> JSON.SET room:foo $ '{ "roomId": "foo", "score": [] }'
OK
After that, use the JSON.ARRAPPEND command once you need to append an item to the score array:
> JSON.ARRAPPEND room:foo $.score '{ "player1": 123 }'
1
...
> JSON.ARRAPPEND room:foo $.score '{ "player2": 456 }'
2
Getting back the whole JSON document is as easy as running:
> JSON.GET room:foo
"{\"roomId\":\"foo\",\"score\":[{\"player1\":123},{\"player2\":456}]}"

Conditional indexing not working in ingest node pipelines

Am trying to implement an index template with datastream enabled and then set contains in ingest node pipelines. So that I could get metrics with below-mentioned index format :
.ds-metrics-kubernetesnamespace
I had tried this sometime back and I did these things as mentioned above and it was giving metrics in such format but now when I implement the same it's not changing anything in my index. I cannot see any logs in openshift cluster so ingest seems to be working fine(when I add a doc and test it works fine)
PUT _ingest/pipeline/metrics-index
{
"processors": [
{
"set": {
"field": "_index",
"value": "metrics-{{kubernetes.namespace}}",
"if": "ctx.kubernetes?.namespace==\"dev\""
}
}
]
}
This is the ingest node condition I have used for indexing.
metricbeatConfig:
metricbeat.yml: |
metricbeat.modules:
- module: kubernetes
enabled: true
metricsets:
- state_node
- state_daemonset
- state_deployment
- state_replicaset
- state_statefulset
- state_pod
- state_container
- state_job
- state_cronjob
- state_resourcequota
- state_service
- state_persistentvolume
- state_persistentvolumeclaim
- state_storageclass
- event
Since you're using Metricbeat, you have another way to do this which is much better.
Simply configure your elasticsearch output like this:
output.elasticsearch:
hosts: ["http://<host>:<port>"]
indices:
- index: "%{[kubernetes.namespace]}"
mappings:
dev: "metrics-dev"
default: "metrics-default"
or like this:
output.elasticsearch:
hosts: ["http://<host>:<port>"]
indices:
- index: "metrics-%{[kubernetes.namespace]}"
when.equals:
kubernetes.namespace: "dev"
default: "metrics-default"
or simply like this would also work if you have plenty of different namespaces and you don't want to manage different mappings:
output.elasticsearch:
hosts: ["http://<host>:<port>"]
index: "metrics-%{[kubernetes.namespace]}"
Steps to create datastreams in elastic stack:
create an ILM policy
Create an index template that has an index pattern that matches with the index pattern of metrics/logs.(Set number of primary shards/replica shards and mapping in index template)
Set a condition in ingest pipeline.(Make sure no such index exist)
If these conditions meet it will create a data stream and logs/metrics would have an index starting with .ds- and it will be hidden in index management.
In my case the issue was I did not have enough permission to create a custom index. When I checked my OpenShift logs I could find metricbeat was complaining about the privilege. So I gave Superuser permission and then used ingest node to set conditional indexing
PUT _ingest/pipeline/metrics-index
{
"processors": [
{
"set": {
"field": "_index",
"value": "metrics-{{kubernetes.namespace}}",
"if": "ctx.kubernetes?.namespace==\"dev\""
}
}
]
}

ElasticSearch BulkShardRequest failed due to org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor

I am storing logs into elastic search from my reactive spring application. I am getting the following error in elastic search:
Elasticsearch exception [type=es_rejected_execution_exception, reason=rejected execution of processing of [129010665][indices:data/write/bulk[s][p]]: request: BulkShardRequest [[logs-dev-2020.11.05][1]] containing [index {[logs-dev-2020.11.05][_doc][0d1478f0-6367-4228-9553-7d16d2993bc2], source[n/a, actual length: [4.1kb], max length: 2kb]}] and a refresh, target allocation id: WwkZtUbPSAapC3C-Jg2z2g, primary term: 1 on EsThreadPoolExecutor[name = 10-110-23-125-common-elasticsearch-apps-dev-v1/write, queue capacity = 200, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor#6599247a[Running, pool size = 2, active threads = 2, queued tasks = 221, completed tasks = 689547]]]
My index settings:
{
"logs-dev-2020.11.05": {
"settings": {
"index": {
"highlight": {
"max_analyzed_offset": "5000000"
},
"number_of_shards": "3",
"provided_name": "logs-dev-2020.11.05",
"creation_date": "1604558592095",
"number_of_replicas": "2",
"uuid": "wjIOSfZOSLyBFTt1cT-whQ",
"version": {
"created": "7020199"
}
}
}
}
}
I have gone through this site:
https://www.elastic.co/blog/why-am-i-seeing-bulk-rejections-in-my-elasticsearch-cluster
I thought adjusting "write" size in thread-pool will resolve, but it is mentioned as not recommended in the site as below:
Adjusting the queue sizes is therefore strongly discouraged, as it is like putting a temporary band-aid on the problem rather than actually fixing the underlying issue.
So what else can we do improve the situation?
Other info:
Elastic Search version 7.2.1
Cluster health is good and they are 3 nodes in cluster
Index will be created on daily basis, there are 3 shards per index
While you are right, that increasing the thread_pool size is not a permanent solution, you will be glad to know that elasticsearch itself increased the size of write thread_pool(use in your bulk requests) from 200 to 10k in just a minor version upgrade. Please see the size of 200 in ES 7.8, while 10k of ES 7.9 .
If you are using the ES 7.X version, then you can also increase the size to if not 10k, then at least 1k(to avoid rejecting the requests).
If you want a proper fix, you need to do the below things
Find out if it's consistent or just some short-duration burst of write requests, while gets cleared in some time.
If it's consistent, then you need to figure out if have all the write optimization is in place, please refer to my short-tips to improve index speed.
See, if you have reached the full-capacity of your data-nodes, and if yes, scale your cluster to handle the increased/legitimate load.

How to delete data from a particular shard

I have got a index with 5 primary shards and no replicas.
One of my shard(shard 1) is in unassigned state. When i checked the log file, i found out below error:
2obv65.nvd, _2vfjgt.fdx, _3e3109.si, _3dwgm5_Lucene45_0.dvm, _3aks2g_Lucene45_0.dvd, _3d9u9f_76.del, _3e30gm.cfs, _3cvkyl_es090_0.tim, _3e309p.nvd, _3cvkyl_es090_0.blm]]; nested: FileNotFoundException[_101a65.si]; ]]
When i checked the index, i could not find the 101a65.si file for the shard 1.
I am unable to locate the missing .si file. I tried a lot but could not assign the shard 1 again.
Is there any other way to make the shard 1 assign again? or do i need to delete the entire shard 1 data?
Please suggest.
Normally in the stack trace you should see the path to the corrupted shard, something like MMapIndexInput(path="path/to/es/db/nodes/node_number/indices/name_of_index/1/index/some_file) (here the 1 is the shard number)
Normally deleting path/to/es/db/nodes/node_number/indices/name_of_index/1 should help the shard recover. If you still see it unassigned try sending this command to your cluster (normally as per the documentation, it should work, though I'm not sure about ES 1.x syntax and commands):
POST _cluster/reroute
{
"commands" : [
{
"allocate" : {
"index" : "myIndexName",
"shard" : 1,
"node" : "myNodeName",
"allow_primary": true
}
}
]
}

Indexing tuples from storm to elasticsearch with elasticsearch-hadoop library does not work

I want to index documents into Elasticsearch from Storm, but I couldn't get any document to be indexed into Elasticsearch.
In my topology I have a KafkaSpout that emits a json like this { “tweetId”: 1, “text”: “hello” } to a EsBolt that is a native bolt from elasticsearch-hadoop library that writes the Storm Tuples to Elasticsearch (doc is here: https://www.elastic.co/guide/en/elasticsearch/hadoop/current/storm.html).
These are the configs for my EsBolt:
Map conf = new HashMap();
conf.put("es.nodes","127.0.0.1");
conf.put("es.port","9200");
conf.put("es.resource","twitter/tweet");
conf.put("es.index.auto.create","no");
conf.put("es.input.json", "true");
conf.put("es.mapping.id", "tweetId");
EsBolt elasticsearchBolt = new EsBolt("twitter/tweet", conf);
The first two configurations have these values by default, but I chose to set them explicitly. I have also tried without them, getting the same result.
And this is how I build my topology:
TopologyBuilder builder = new TopologyBuilder();
builder.setSpout(TWEETS_DATA_KAFKA_SPOUT_ID, kafkaSpout, kafkaSpoutParallelism)
.setNumTasks(kafkaSpoutNumberOfTasks);
builder.setBolt(ELASTICSEARCH_BOLT_ID, elasticsearchBolt, elasticsearchBoltParallelism)
.setNumTasks(elasticsearchBoltNumberOfTasks)
.shuffleGrouping(TWEETS_DATA_KAFKA_SPOUT_ID);
return builder.createTopology();
Before I run the topology locally I create the "twitter" index in Elasticsearch and a mapping "tweet" for this index.
This is what I get if I retrieve the mapping for my newly created type (curl -XGET 'http://localhost:9200/twitter/_mapping/tweet'):
{
"twitter": {
"mappings": {
"tweet": {
"properties": {
"text": {
"type": "string"
},
"tweetId": {
"type": "string"
}
}
}
}
}
}
I run the topology locally and this is what I get in my console when processing a tuple:
Processing received message FOR 6 TUPLE: source: tweets-data-kafka-spout:9, stream: default, id: {-8010897758788654352=-6240339405307942979}, [{"tweetId":"1","text":"hello"}]
Emitting: elasticsearch-bolt __ack_ack [-8010897758788654352 -6240339405307942979]
TRANSFERING tuple TASK: 2 TUPLE: source: elasticsearch-bolt:6, stream: __ack_ack, id: {}, [-8010897758788654352 -6240339405307942979]
BOLT ack TASK: 6 TIME: TUPLE: source: tweets-data-kafka-spout:9, stream: default, id: {-8010897758788654352=-6240339405307942979}, [{"tweetId":"1","text":"hello"}]
Execute done TUPLE source: tweets-data-kafka-spout:9, stream: default, id: {-8010897758788654352=-6240339405307942979}, [{"tweetId":"1","text":"hello"}] TASK: 6 DELTA:
So the tuples seems to be processed. However I don't have any document indexed in Elasticsearch.
I suppose I am doing something wrong when I set the configurations for EsBolt, maybe missing a configuration or something.
Documents will only be indexed once you reach the flush size, specified by es.storm.bolt.flush.entries.size
Alternately, you may set a TICK frequency that triggers a queue flush.
config.put(Config.TOPOLOGY_TICK_TUPLE_FREQ_SECS, 5);
By default, es-hadoop flushes on tick, as per the es.storm.bolt.tick.tuple.flush parameter.
I have also got the same issue, but when I looking for the es-Hadoop documents, I find because I was miss set the frequency that triggers a queue flush.Then I add a configurations to my store topology (es.storm.bolt.flush.entries.size ), it's fine.but when we setting the value for Config.TOPOLOGY_TICK_TUPLE_FREQ_SECS .it's throw an exception :java.lang.RuntimeException:java.lang.NullPointerException in bolt execute function. then we use debug mode to test my topology, I find the input tuple in bolt execute don't contain any entries, but this empty tuple is been triggered.
That's what I feel confusion. Don't the tuple will be emitted according to the setting time, Even though this tuple is empty after we set Config.TOPOLOGY_TICK_TUPLE_FREQ_SECS.i think which is a bug.
enter image description here
enter image description here
more information you can see:https://www.elastic.co/guide/en/elasticsearch/hadoop/current/storm.html

Resources