How can I see what is happening under the covers when an Elasticsearch query is executed? - elasticsearch

For Elasticsearch 1.7.5 (or earlier), how can I see what steps Elasticsearch takes to handle my queries?
I attempted to turn debugging on by setting es.logger.level=DEBUG, but while that produced a fair amount of information at startup and shutdown, it doesn't produce anything when queries are executed. Looking at the source code, apparently most of the debug logging for searches is just for exceptional situations.
I am trying to understand query performance. We're seeing Elasticsearch do way more I/O than we expected, on a very simple term query on an unanalyzed field.

With ES 1.7.5 and earlier versions, you can use the ?explain=true URL parameter when sending your query and you'll get some more insights into how the score was computed.
Also starting with ES 2.2, there is a new Profile API which you can use to get more insights into timing information while the different query components are being executed. Simply add "profile": true to the search body payload and you're good to go.

Related

Multiple index search and PIT index ID reuse

I’m planning to search multiple different sets of multiple indices at once. I’d also like to use search_after with point-in-time indices for deep pagination. I’ve got some general questions about how/if PIT works in this scenario.
Calling the _pit endpoint with multiple indices works fine, but I’m not sure exactly how it works - is the PIT index coupled to the comma-separated set of indices I pass in my call to _pit (e.g. /index-1,index-2/_pit?keep_alive=15m will open a PIT id usable with any search where I want results from index-1,index-2 )? Also, will the implicit _shard_doc tiebreaking work when creating a multi-index PIT index?
The guidance on the elastic blog here re: having a background process create a PIT for use with all search requests (rather than creating one on each search request) seems to contradict the PIT docs which state the following — I must be misunderstanding one of these statements?
The open point in time request and each subsequent search request can return different id; thus always use the most recently received id for the next search request.
Eventually found answers (with the help of a co-worker).
Multi-index PIT indices are supported, though this is not reflected in the docs whatsoever at time of writing.
Tested and confirmed that both opening a PIT and searching against a PIT works with multiple indices, even though it isn't explicitly mentioned in the docs. Also double-checked with the ES search team & they confirmed that this is correct.
Yes, PIT ids are stable for the requested keep_alive period (found in the context of a similar question on sharding).
The answer to your question is yes, that the guarantee of PIT – a certain point of time in the index, regardless of what copies of shards are being used internally. If we can't find this point of time any more, an error of "no search context found" will be returned.
But as I said in the previous answer, currently we always return the same PIT id even if we end up using different shard copies.

Figure out Indexing error in elasticsearch?

I am using ES 1.x version and having trouble to find the errors while indexing some document.
Some documents are not getting indexed and all I saw is below lines in ES logs.
stop throttling indexing: numMergesInFlight=2, maxNumMerges=3
now throttling indexing: numMergesInFlight=4, maxNumMerges=3
I did a quick google and understood the high level of these errors but would like to understand below:
Will ES retry the documents which were throttled?
Is there is any way to know the documents which were throttled by enabling some detailed logging and if yes, then in which classes?
I don't see any error message, apart from above INFO logs. Is there is a way to enable verbose logging for indexing which shows what exactly is going on during indexing?
The throttling messages you see in the logs are not the issue. throttling is happening in the background in order for elasticsearch to protect against segments explosion. see explanation here: https://www.elastic.co/guide/en/elasticsearch/guide/current/indexing-performance.html#segments-and-merging
The throttling does not drop messages, but just slows down indexing, which cause a back pressure to the indexers and external queues.
When indexing fail you should get an error response for the index/bulk request. in order to tell what is the issue you should inspect the responses ES provide for the index/bulk requests. Logs might not tell the full story, as it depends on the log level configuration, which is per module in ES.
Another option is that you are able to index, but the logs don't have the timestamp you think it does. check _cat/indices in order to see if the docs count increases when you index. if the doc counts increases it means that the indexed docs are there, and you need to refine your searches.
elasticsearch does not do retries to the best of my knowledge, that is up to the client (though i havent used 1.x in quite some time)
logstash, for example, retries batches it gets 503 and 429 on exactly for these kinds of reasons https://github.com/logstash-plugins/logstash-output-elasticsearch/blob/master/lib/logstash/outputs/elasticsearch.rb#L55

Connecting NiFi to ElasticSearch

I'm trying to solve one task and will appreciate any help - links to documentation, or links to forums, or other FAQs besides https://cwiki.apache.org/confluence/display/NIFI/FAQs, or any meaningful answer in this post =) .
So, I have the following task:
Initial part of my system collects data each 5-15 min from different DB sources. Then I remove duplicates, remove junk, combine data from different sources according to logic and then redirect it to second part of the system as several streams.
As far as I know, "NiFi" can do this task in the best way =).
Currently I can successfully get information from InfluxDB by "GetHTTP" processor. However I can't configure same kind of processor for getting information from Elastic DB with all necessary options. I'd like to receive data each 5-15 minutes for time period from "now-minus-<5-15 minutes>" to "now". (depends on scheduler period) with several additional filters. If I understand it right, this can be achieved either by subscription to "_index" or by regular requests to DB with desired interval.
I know that NiFi has several specific Processors designed for Elasticsearch (FetchElasticsearch5, FetchElasticsearchHttp, QueryElasticsearchHttp, ScrollElasticsearchHttp) as well as GetHTTP and PostHTTP Processors. However, unfortunately, I have lack of information or even better - examples - how to configure their "Properties" for my purposes =(.
What's the difference between FetchElasticsearchHttp, QueryElasticsearchHttp? Which one fits better for my task? What's the difference between GetHTTP and QueryElasticsearchHttp besides several specific fields? Will GetHTTP perform the same way if I tune it as I need?
Any advice?
I will be grateful for any help.
The ElasticsearchHttp processors try to make it easier to interact with ES by generating the appropriate REST API call based on the properties you set. If you know the full URL you need, you could use GetHttp or InvokeHttp. However the ESHttp processors let you put in just the stuff you're looking for, and it will generate the URL and return the results.
FetchElasticsearch (and its variants) is used to get a particular document when you know the identifier. This is sometimes used after a search/query, to return documents one at a time after you know which ones you want.
QueryElasticsearchHttp is for when you want to do a Lucene-style query of the documents, when you don't necessarily know which documents you want. It will only return up to the value of index.max_result_window for that index. To get more records, you can use ScrollElasticsearchHttp afterwards. NOTE: QueryElasticsearchHttp expects a query that will work as the "q" parameter of the URL. This "mini-language" does not support all fields/operators (see here for more details).
For your use case, you likely need InvokeHttp in order to issue the kind of query you describe. This article describes how to issue a query for the last 15 minutes. Once your results are returned, you might need some combination of EvaluateJsonPath and/or SplitJson to work with the individual documents, see the Elasticsearch REST API documentation (and NiFi processor documentation) for more details.

Elasticsearch 5.x Get API issues refresh. If so, is Get a high-cost operation in fact?

I'm using Elasticsearch 2.3, and I know that Get API is realtime, i.e. the API retrieves the very recent document regardless of refresh_interval. This operation is totally independent of refresh.
While reading the ES 5.x documentation, I found the following:
By default, the get API is realtime, and is not affected by the
refresh rate of the index (when data will become visible for search).
If a document has been updated but is not yet refreshed, the get API
will issue a refresh call in-place to make the document visible. This
will also make other documents changed since the last refresh visible.
In order to disable realtime GET, one can set the realtime parameter
to false.
I tested and confirmed that this isn't the case on ES 2.3 environment; Get API does not refresh the index although it certainly gets the updated document.
Does this mean that Get API in ES 5.x actually is a very high-cost operation, because so is refresh?
The change will only affect you, if you have an update and GET the document by ID before it has been refreshed. Is this a common scenario in your use case? Then you might want to disable realtime, but the assumption in general is that you should not run into that situation frequently.
This has been discussed on the PR of the change (and explains why the change has been made), so you should find that discussion helpful: https://github.com/elastic/elasticsearch/pull/20102
Overall, the GET API in ES 5.x could be more costly, but it will depend on your actual use case.

RavenDB Aggresive caching doesn't seem to do anything different

Inside the following block
using (DocumentSession.Advanced.DocumentStore.AggressivelyCacheFor(TimeSpan.FromMinutes(1)))
{
return session.Query<Camera, Camera_Facets().Where(...).ToFacets("facets/CameraFacets")
}
I am executing a query and asking for facets. When I see the call on the raven server console, it takes 2.5 seconds, but when I run the same query again and again, it still takes the exact same time.
Now how is this meant to be fast? when it returns in roughly the exact same time every time. Am I missing something here. I am using build 499, and running client server mode, talking to raven on my local machine.
note: i am running the query on my data store for my domain, the camera code above is shown for reference purpose.
Faceted queries and aggressive caching currently don't work together. Faceted queries are a new feature and as yet they haven't been made to work with aggressive caching.
Note that regular queries work with aggressive caching just fine, it's only faceted queries that have this issue.

Resources