How does Elasticsearch6.8 cache the same request on the second time? - elasticsearch

I am using Elasticsearch 6.8 and I'd like to know if I send the same request multiple times, will ES do any optimised operation on that? If yes, is there any document explain how this works?

Adding more details to the answer given by #fmdaboville.
The above caches are provided out of the box and below are some options if you want to enable/disable more cache.
Enabling/fine tuning more cache options
Query type: if you are using filters in a search query, then those are cached by default at Elasticsearch and doesn't contribute to the score as it just means to filter out the data, more info from this official doc:
In a filter context, a query clause answers the question “Does this
document match this query clause?” The answer is a simple Yes or
No — no scores are calculated. Filter context is mostly used for
filtering structured data, e.g.
Does this timestamp fall into the range 2015 to 2016? Is the status
field set to "published"? Frequently used filters will be cached
automatically by Elasticsearch, to speed up performance.
Using refresh interval: this official doc has a lot more info, but in short its good, if you are OK to get some obsolete data and ready to trade-off it in favor of performance.
This makes new index changes visible to search.
Disable the cache on a particular request
By default, heavy searches are cached at shards level as explained in this official doc, but for certain requests, if you want to enable/disable this behavior then you can go use this in your API call.
Simply add below query param in your search param. link in API call has several other settings to do it.
request_cache=true/false

Here is the documentation explaining how work the optimized search : Tune for search speed. The part about cache seems to be what you are looking for :
There are multiple caches that can help with search performance, such as the filesystem cache, the request cache or the query cache. Yet all these caches are maintained at the node level, meaning that if you run the same request twice in a row, have 1 replica or more and use round-robin, the default routing algorithm, then those two requests will go to different shard copies, preventing node-level caches from helping.

Related

Query ElasticSearch after the index operation

I have the eservice A that executes some text processing. After it, service B has to execute some set of Elasticsearch queries on the document. The connectivity between the services provided by Kafka. The solution is tightly coupled to ES free text search capabilities, so I can't query in another way.
Possible solution:
To store the document in ES and query it. The problem is that ES is eventually consistent and I don't know if the document already indexed or not.
Is there some API to ensure that the document is already indexed?
Another option is to publish a message from service A with delay X+5 seconds, where X is the refresh interval of the index, where the document should be stored. Seems to me an unreliable solution. What do you think?
Another direction that I thought about, is some way to query the document with ES queries where the document is in memory. For example, if I will have some magic way to convert the ES query to Luciene DSL, so I don't need to deal with the eventual consistent behavior of Elasticsearch and I can query Lucine directly.
Maybe there are some other solutions?
take a look at the ?refresh flag so that an indexing request will only return once a refresh has happened. otherwise you can use the GET API to see if the document exists or not
however there is no magic options here, Elasticsearch is eventually consistent and you need to factor that in

Check if document is part of Elasticsearch query?

Curious if there is some way to check if document ID is part of a large (million+ results) Elasticsearch query/filter.
Essentially I’ll have a group of related document ID’s and only want to return them if they are part of a larger query. Hoping to do database side. Theoretically seemed possible since ES has to cache stuff related to large scrolls.
It's a interesting use-case but you need to understand that Elasticsearch(ES) doesn't return all the matching documents ids in the search result and return by default only the 10 documents in the response, which can be changed by the size parameter.
And if you increase the size param and have millions of matching docs in your query then ES query performance would be very bad and it might bring even entire cluster down if you frequently fire such queries(in absence of circuit breaker) so be cautious about it.
You are right that, ES cache the stuff, but again that if you try to cache huge amount of data and that is getting invalidate very frequent then you will not get the required performance benefits, so better do the benchmark against it.
You are already on the correct path to use, scroll API to iterate on millions on search result, just see below points to improve further.
First get the count of search result, this is included in default search response with eq or greater value which will give you idea that how many search results you have based on which you can give size param for subsequent calls to see if your id is present or not.
See if you effectively utilize the filters context in your query, which is by default cached at ES.
Benchmark your some heavy scroll API calls with your data.
Refer this thread to fine tune your cluster and index configuration to optimize ES response further.

ElasticSearch - Configuration to Analyse a document on Indexing

In a single request, I want to retrieve documents from a SOR, store them in ElasticSearch, and then search those documents using the ES search API.
There seems to be some lag from the time the document is indexed and the time it is analyzed and ready to be searched.
Is there any way to configure ES to not return from the request to index a document until the analyzer has analyzed it and so that it can immediately be searched?
Elasticsearch is "near real-time" by nature, i.e. all indices are refreshed every second (by default). While it may seem enough in a majority of cases, it might not, such as in your case.
If you need your documents to be available immediately, you need to refresh your indices explicitly by calling
POST /_refresh
or if you only want to refresh one index
POST /my_index/_refresh
The refresh needs to happen after the indexing call returned and before the search call is sent off.
Note that doing this on every document indexing will hurt the performance of your system. It might be better to make your application aware of the near real-time nature of ES and handle this on the client-side.
The refresh API, as suggested in the accepted answer, is heavy in nature and you may not want to call this API after every index operation, if you are going to do a significant number of indexing operations.
What happens under the hood is that the translog maintained by elasticsearch is written to the in memory segment which elasticsearch maintains. This operations is best left to the discretion of elasticsearch, however, there are some configuration parameters you can play around with.
There is an alternative approach you can take, it may or may not suit your specific use case, but here it goes.
Query the index/_stats/refresh api and retrieve the status of refresh from there, index your document and then keep performing the same stats query again. If the version has increased since your indexing time, it means you are good for searching your document.
https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-stats.html

How to handle pagination when the source data changes frequently

Specifically, I'm using Elasticsearch to do pagination, but this question could apply to any database.
Elasticsearch provides methods to paginate search results with handy from and to parameters.
So I run a query get me the most recent data from result 1 to 10
This works great.
The user clicks "next page" and the query is:
get me the most recent data from result 11 to 20
The problem is that in the time between the two queries, 2 new records have been added to the backing database, which means the paginated results will overlap (the last 2 from the first page show up as first two on the second page).
What's the best solution to avoid this? Right now, I'm adding a filter to the query that tell it to only include results later than the last result of the previous query. But it just seems hackish.
A filter is not a bad option, if you're already indexing a relevant timestamp. You have to track that timestamp on the client side in order to correctly prepare your queries. You also have to know when to get rid of it. But those aren't insurmountable problems.
The Scroll API is a solid option for this, because it effectively snapshots in time on the Elasticsearch side. The intent of the Scroll API is to provide a stable search query for deep pagination, which has to deal with the exact issue of change that you're experiencing.
You begin a Scrolling Search by supplying your query and the scroll parameter, for which Elasticsearch returns a scroll_id. You then make requests to /_search/scroll supplying that ID, each of which return a page of results and a new scroll_id for the next request.
(Note that you don't want the scan search type here. That's used to extract documents en masse, and does not apply any sorting.)
Compared to filtering, you do still have to track a value: the scroll_id for your next page of results. Whether that's easier than tracking a timestamp depends on your app.
There are other potential downsides to consider. Elasticsearch persists the context for your search on a single node within the cluster. Conceivably these could accumulate in your cluster, depending on how heavily you rely on scrolling search. You'll want to test the performance implications there. And if I recall correctly, scrolling searches also do not persist through a node failure or restart.
The ES documentation for the Scroll API provides good details on all of the above.
Bottom line: filtering by timestamp is actually not a bad choice. The Scroll API is another valid option, designed for a similar use case, but is not without its drawbacks.
Realise this is a bit old but with ElasticSearch 6.3 there's now the search_after feature for the request body which allows for cursor type paging:
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-search-after.html
It is very similar to the scroll API but unlike it, the search_after parameter is stateless, it is always resolved against the latest version of the searcher.
You need to use scan API for this. Scan and scroll API let's you do point in time search and pagination.
Scan API -

What caching strategy for search queries

We are developing a search engine web application that will enable users to search the content of about 200 portals.
Our business partner is taking care of maintaining and feeding a solr/lucene instance that is doing the workhorse job of indexing the data.
Our application queries solr and presents the results in a human-friendly way. However, we are wondering how we could limit the amount of queries, perhaps using some form of caching. The results could be cached for few hours.
What we are wondering is: what could be a good strategy for caching the queries results? Obviously we expect the method invocations to vary a lot... Does it make sense at all to do caching?
Is there some caching system that is particularly suitable in this use case? We are using Spring 3 for the development.
I would keep in mind that Solr already has a lot of caching built into it in order to speed up common queries. I'd advise you to look into the inherent capabilities in Solr/Lucene before you go off and reinvent the wheel with your own query cache.
Here is a good place to start.
The simplest solution is to reform your query before it hits Solr.
I created my own QueryBuilder method, which I pass through my query string before hitting Solr.
All this does is explodes all of the arguments and then sorts them in to a predefined group set.
For example, in order to normalize your queries so that they can be cacheable, you can sort alphabetically on each key, then reform the query string, and then use this to query Solr. (The actual query result will be unchanged).
Before you actually run the query, you could then create a hash of the Solr query string and check an in memory hash of all keys that have been saved against. If you find yourself approaching millions of query keys which might be quite likely, you might want to start looking at using a BloomFilter to reduce the keyspace and still maintain some degree of accuracy on cache hits.
Alternatively, you might want to look at putting a reverse proxy cache in between you and Solr. For example, if you were to query Solr like, Spring -> Varnish -> Solr, Varnish could be used to cache and it would use the query string as a hash. You would then be able to set a 2 hour Expires, in order to have the results automatically flushed/cleared/invalidated.
Hopefully this helps.
I have found that caching the results or the rendered content outside Lucene works best. Having an API search service that points to a caching tier with the results from a Lucene Index.
If you separate the caching tier out, you can then plug in whatever caching you want...distributed caching (Redis, Azure AppFabric, other cloud caching etc). Also you can cache the partial renderings of the web page (i.e. outputcaching in ASP.NET) or cache the API calls themselves using RESTful conventions. Things like cache-warming or proactive caching (based on usage) then are easy to do with services.
Your application/index cache then can be "re-used" across more tiers of your app instead of just caching at the index level. This all depends on how if your indexing updates are real-time, if the queries are date-level secure for each client/user id etc. As mentioned above Solr already does "some" of this stuff for you.

Resources