Figure out Indexing error in elasticsearch? - elasticsearch

I am using ES 1.x version and having trouble to find the errors while indexing some document.
Some documents are not getting indexed and all I saw is below lines in ES logs.
stop throttling indexing: numMergesInFlight=2, maxNumMerges=3
now throttling indexing: numMergesInFlight=4, maxNumMerges=3
I did a quick google and understood the high level of these errors but would like to understand below:
Will ES retry the documents which were throttled?
Is there is any way to know the documents which were throttled by enabling some detailed logging and if yes, then in which classes?
I don't see any error message, apart from above INFO logs. Is there is a way to enable verbose logging for indexing which shows what exactly is going on during indexing?

The throttling messages you see in the logs are not the issue. throttling is happening in the background in order for elasticsearch to protect against segments explosion. see explanation here: https://www.elastic.co/guide/en/elasticsearch/guide/current/indexing-performance.html#segments-and-merging
The throttling does not drop messages, but just slows down indexing, which cause a back pressure to the indexers and external queues.
When indexing fail you should get an error response for the index/bulk request. in order to tell what is the issue you should inspect the responses ES provide for the index/bulk requests. Logs might not tell the full story, as it depends on the log level configuration, which is per module in ES.
Another option is that you are able to index, but the logs don't have the timestamp you think it does. check _cat/indices in order to see if the docs count increases when you index. if the doc counts increases it means that the indexed docs are there, and you need to refine your searches.

elasticsearch does not do retries to the best of my knowledge, that is up to the client (though i havent used 1.x in quite some time)
logstash, for example, retries batches it gets 503 and 429 on exactly for these kinds of reasons https://github.com/logstash-plugins/logstash-output-elasticsearch/blob/master/lib/logstash/outputs/elasticsearch.rb#L55

Related

Is there any way to check the relocation progress of shards in elasticsearch?

Yesterday, I was adding a node to production elasticsearch cluster once I added it I can use /_cat/health api to check number of relocating shards. And there is another api /_cat/shards to check which shards are getting relocated. However, is there any way or api to check live progress of shards/data movement to the newly added node. Suppose there is a 13GB shards, we've added a node to es cluster can we check how much percent, GBs(MBs or KBs) has moved currently so that we can have a estimate of how much time it will take for reallocation.
Can this be implemented by on our own or suggest this to elasticsearch? If it can be implemented on our own, how to proceed or what pre-requisites I need to know?
you have
GET _cat/recovery?active_only=true&v
GET _cat/recovery?active_only=true&h=index,shard,source_node,target_node,bytes_percent,time
https://www.elastic.co/guide/en/elasticsearch/reference/current/cat-recovery.html
Take a look to the Pending Tasks API :
https://www.elastic.co/guide/en/elasticsearch/reference/current/tasks.html
The task management API returns information about tasks currently executing on one or more nodes in the cluster.
GET /_tasks
You can also see the reasons for the allocation using the allocation explain API:
https://www.elastic.co/guide/en/elasticsearch/reference/current/cluster-allocation-explain.html
GET _cluster/allocation/explain

how to setup elasticsearch to concurrently handle 20k POST request?

we are trying to collect performance metric from about 20k servers and POST the data to elasticsearch using the below curl command to analyse the data further
curl
-XPOST "$ELASTICSEARCH_URL/sariovm/sar/"
-H 'Content-Type: application/json'
-d '{ "#timestamp" : '\""$DATE3\""', "cpu" : '$cpu', "iowait" : '$iowait', "swapips" : '$swapips', "swapops" : '$swapops', "hostname" : "'$HOSTNAME'" }'
currently we tested it using 80+ POST request to elasticsearch and we have setup only single node to handle the request. How to setup elasticsearch to scale to handle 20K+ POST requests?
Assuming you are tracking 20k server metrics, it should be 20k requests per second since you want to aggregate without having an exact frequency in your use case, 20k servers sending CPU usage could happen all in the same time, why not.
You need to benchmark, and you should start with the default deployment, 3 nodes,1 master, green cluster, read more about what means the elasticsearch types of nodes, special attention to data node and ingestion node, in conclusion you need to start with the default deployment and benchmark, tune and keep benchmarking since every use case is special, yours looks like one where elasticsearch has made a great product for, read about beats, logstash and kibana.
In my personal opinion, if you don't have too much budget and you don't care about real real-time there are some other ways to handle this, like storing the 20k metrics per second in Kafka which is great to handle high io writing capacity, then logstash it to elasticsearch at the capacity your cluster supports, obviously this adds Kafka to your royal pains, problems we like because we know there is always a solution and fun times.
It really depends.
20K+ posts per what? Per second? Per hour? Per day? you'll need that information.
Also, by using a single node you're ignoring the biggest elasticsearch advantage to my opinion (which is, of course, the support of scaling out).
It also depends on the size of your post.
You'll need a lot more information to answer this question, but what I recommend (and what elastic recommends) is to simply try. Use some node and start trying and indexing,
and add resources until you reach your goal

Connecting NiFi to ElasticSearch

I'm trying to solve one task and will appreciate any help - links to documentation, or links to forums, or other FAQs besides https://cwiki.apache.org/confluence/display/NIFI/FAQs, or any meaningful answer in this post =) .
So, I have the following task:
Initial part of my system collects data each 5-15 min from different DB sources. Then I remove duplicates, remove junk, combine data from different sources according to logic and then redirect it to second part of the system as several streams.
As far as I know, "NiFi" can do this task in the best way =).
Currently I can successfully get information from InfluxDB by "GetHTTP" processor. However I can't configure same kind of processor for getting information from Elastic DB with all necessary options. I'd like to receive data each 5-15 minutes for time period from "now-minus-<5-15 minutes>" to "now". (depends on scheduler period) with several additional filters. If I understand it right, this can be achieved either by subscription to "_index" or by regular requests to DB with desired interval.
I know that NiFi has several specific Processors designed for Elasticsearch (FetchElasticsearch5, FetchElasticsearchHttp, QueryElasticsearchHttp, ScrollElasticsearchHttp) as well as GetHTTP and PostHTTP Processors. However, unfortunately, I have lack of information or even better - examples - how to configure their "Properties" for my purposes =(.
What's the difference between FetchElasticsearchHttp, QueryElasticsearchHttp? Which one fits better for my task? What's the difference between GetHTTP and QueryElasticsearchHttp besides several specific fields? Will GetHTTP perform the same way if I tune it as I need?
Any advice?
I will be grateful for any help.
The ElasticsearchHttp processors try to make it easier to interact with ES by generating the appropriate REST API call based on the properties you set. If you know the full URL you need, you could use GetHttp or InvokeHttp. However the ESHttp processors let you put in just the stuff you're looking for, and it will generate the URL and return the results.
FetchElasticsearch (and its variants) is used to get a particular document when you know the identifier. This is sometimes used after a search/query, to return documents one at a time after you know which ones you want.
QueryElasticsearchHttp is for when you want to do a Lucene-style query of the documents, when you don't necessarily know which documents you want. It will only return up to the value of index.max_result_window for that index. To get more records, you can use ScrollElasticsearchHttp afterwards. NOTE: QueryElasticsearchHttp expects a query that will work as the "q" parameter of the URL. This "mini-language" does not support all fields/operators (see here for more details).
For your use case, you likely need InvokeHttp in order to issue the kind of query you describe. This article describes how to issue a query for the last 15 minutes. Once your results are returned, you might need some combination of EvaluateJsonPath and/or SplitJson to work with the individual documents, see the Elasticsearch REST API documentation (and NiFi processor documentation) for more details.

How can I see what is happening under the covers when an Elasticsearch query is executed?

For Elasticsearch 1.7.5 (or earlier), how can I see what steps Elasticsearch takes to handle my queries?
I attempted to turn debugging on by setting es.logger.level=DEBUG, but while that produced a fair amount of information at startup and shutdown, it doesn't produce anything when queries are executed. Looking at the source code, apparently most of the debug logging for searches is just for exceptional situations.
I am trying to understand query performance. We're seeing Elasticsearch do way more I/O than we expected, on a very simple term query on an unanalyzed field.
With ES 1.7.5 and earlier versions, you can use the ?explain=true URL parameter when sending your query and you'll get some more insights into how the score was computed.
Also starting with ES 2.2, there is a new Profile API which you can use to get more insights into timing information while the different query components are being executed. Simply add "profile": true to the search body payload and you're good to go.

How to kill the thread of searching request on elasticsearch cluster? Is there some API to do this?

I made a elasticsearch cluster with big data, and the client can send searching request to it.
Sometimes, the cluster costs much time to deal with one request.
My question is, is there any API to kill the specified thread which cost too much time?
I wanted to follow up on this answer now that elasticsearch 1.0.0 has been released. I am happy to announce that there is new functionality that has been introduced that implements some protection for the heap, called the circuit breaker.
With the current implementation, the circuit breaker tries to anticipate how much data is going to be loaded into the field data cache, and if it's greater than the limit (80% by default) it will trip the circuit breaker and there by kill your query.
There are two parameters for you to set if you want to modify them:
indices.fielddata.breaker.limit
indices.fielddata.breaker.overhead
The overhead is the constant that is used to estimate how much data will be loaded into the field cache; this is 1.03 by default.
This is an exciting development to elasticsearch and a feature I have been waiting to be implemented for months.
This is the pull request if interested in seeing how it was made; thanks to dakrone for getting this done!
https://github.com/elasticsearch/elasticsearch/pull/4261
Hope this helps,
MatthewJ
Currently it is not possible to kill or stop the long running queries, But Elasticsearch is going to add a task management api to do this. The API is likely to be added in Elasticsearch 5.0, maybe in 2016 or later.
see Task management 1 and Task management 2.

Resources