Investigating slow queries in ElasticSearch - performance

We are using elastic search version 5.4.1 in our production environments. The cluster setup is 3 data, 3 query, 3 master nodes. Of late we are observing a lot of slow queries in a particular data node and the [index][shard] present in that are just replicas.
I don't find many deleted docs or memory issues that could directly cause the slowness.
Any pointers on how to go about the investigation here would be helpful.
Thanks!

Many things are happening during one ES query. First, check the took field returned by ElasticSearch.
took – time in milliseconds for Elasticsearch to execute the search
However, the took field is the time that it
took ES to process the query on its side. It doesn't include
serializing the request into JSON on the client
sending the request over the network
deserializing the request from JSON on the server
serializing the response into JSON on the server
sending the response over the network
deserializing the response from JSON on the client
As such, I think you should try to identify the exact step that is slow.
Reference: Query timing: ‘took’ value and what I’m measuring

Related

Elasticsearch + Logstash - Indexing data from different sources when receiving an event

Good day,
I have a gotten into a bit of a headache when working on indexing some data in Elasticsearch and have some questions about a good approach.
As of now, an event is received on a Kafka topic with just a part of the data that should be stored in the document. The rest of the data needs to be collected after the event is received and is available from different APIs. To reduce the amount of work, it seems that Logstash could be a good approach.
Is there a way to configure Logstash to initiate data collection from different APIs and DBs when an event is received, and then assemble the document with the combined date, or am I stuck with writing time consuming custom logic for the indexing? I have searched around a bit, but couldn't find any good answer on the problem.
What you need in logstash is to lookup/enrich you message with data from external api's, right?
You could use logstash's http_filter plugin

How can we get the turnabout time for any Elasticsearch query?

I am trying to find out what is the actual time taken by my ES cluster to return the query result to the client.
I understand from googling and reading that took time which we get in ES response can't be taken as the correct measure. Is there any other way or logs which I can enable , because if I enable the slow logs it also gives to took time.
The reason why I am looking for this is while querying ES from my client I am seeing took time and the time by which I get the response is huge. And I am unable to trace/pinpoint the reason, is it because of lag in between servers or ES is consuming more time to send the response.

Elasticsearch delete_by_query version conflict

According to ES documentation document indexing/deletion happens as follows:
Request received at one of the nodes.
Request forwarded to the document's primary shard.
The operation performed on the primary shard and parallel requests sent to replica nodes.
Primary shard node waits for a response from replica nodes and then send the response to the node where the request was originally received.
Send the response back to the client.
Now in my case, I am sending a create document request to ES at time t and then sending a request to delete the same document (using delete_by_query) at approximately t+800 milliseconds. These requests are sent via a messaging system (internal implementation of kafka) which ensures that the delete request will be sent to ES only after receiving 200 OK response for the indexing operation from ES.
According to ES documentation, delete_by_query throws a 409 version conflict only when the documents present in the delete query have been updated during the time delete_by_query was still executing.
In my case, it is always guaranteed that the delete_by_query request will be sent to ES only when a 200 OK response has been received for all the documents that have to be deleted. Hence there is no possibility of an update/create of a document that has to be deleted during delete_by_query operation.
Please let me know if I am missing something or this is an issue with ES.
Possible reason could be due to the fact that when a document is created, it is not "committed" to the index immediately.
Elasticsearch indices operate on a refresh_interval, which defaults to 1 second.
This documentation around refresh cycles is old, but I cannot for the life of me find anything as descriptive in the more modern ES versions.
A few things you can try:
Send _refresh with your request
Add ?refresh=wait_for or ?refresh=true param
Note that refreshing the index on every indexing request is terrible for performance, which begs the question as to why you are trying to delete a document immediately after indexing it.
add
deleteByQueryRequest.setAbortOnVersionConflict(false);

How A Solr Request Takes Too Much Time

I have 3 nodes Solr data center. I am trying to redirect all queries to node1 using solr http api because i think i have problems with node2, node3. I will replace them. I enabled datastax solr slow query metric. I see two main problem.
Even though i set shard.shuffling.strategy=host
Document says that
host
Shards are selected based on the host that received the query.
and i expect that when request http://node1:8983/solr/.... , the coordinator_ip and node_ip columns in solr_slow_sub_query_log table will be the same. When i get the records, i see 80% percentage is node1. Is not that wrong? I expect 100% request use node1.
When i get records from solr_slow_sub_query_log, i see that the rows coordinator_id=node1 and node_ip=node2ornode3 has too much elapsed_millis such as 1300 seconds even though document says netty_client_request_timeout is 60 seconds.

elastic search index strategies under high traffic

We use ElasticSearch for our tool's real time metrics and analytics part. ElasticSearch is very cool and fast when we are query our data. (statiticial facets and terms facet)
But we have problem when we try to index our hourly data. We collect every our metric data from other services. First we collect data from other services and save them RabbitMQ process. But when queue worker runs our all hourly data not index to ES. Usually %40 of data index in ES and other them lost.
So what is your idea about when index ES under high traffic ?
I've posted answers to other similar questions:
Ways to improve first time indexing in ElasticSearch
Performance issues using Elasticsearch as a time window storage (latter part of my answer applies)
Additionally, instead of a custom 'queue worker' have you considered using a 'river'? For more information see:
http://www.elasticsearch.org/blog/the-river/
http://www.elasticsearch.org/guide/reference/river/

Resources