I am paginating elastic search data using search_after with sort using _uid. Sorting over huge amount of data leads to circuit breaker exception, as the field data size limit is exceeded. A possible solution provided by elastic search for sorting the large dataset is with _doc
(https://www.elastic.co/guide/en/elasticsearch/reference/6.3/search-request-sort.html)
This helps in getting the response quickly without failing with circuit breaker exception. However I am concerned about the unique value which is being used in search_after to get the next set of records as this value will be used as a cursor to get the next records.
https://www.elastic.co/guide/en/elasticsearch/reference/6.3/search-request-scroll.html
To understand what does the _doc value represent I was going through the documentation but there is no description added.
In my data I see more than few records with same _doc value. these documents with same _doc value has different _ids meaning they are two different records. Can anyone help me in understanding what does this value represents. Can I used it in search_after? I am using elastic version 5.4
Related
Hello Community I am fairly new to elastic search and have stumbled upon this issue.
My Elastic Search Application requires aggregation of data to get top hits. The query is working perfectly on almost all use cases and we are getting the values in buckets. But for some cases in which the field on which the aggregation is being performed has a very long text field is not giving the desired results the aggregation is not happening and we are getting buckets as an empty array. So my question is that is there a case that Elastic Search aggregation has a size limit that it cannot aggregate very long text fields?
For Kibana server decommissioning purposes, I want to get a list of index patterns which never had any single document and had documents.
How to achieve this using Kibana only?
I tried this but it doesn't give the list based on the document count.
GET /_cat/indices
Also in individual level getting the count to check the documents are there is time consuming .
GET index-pattern*/_count
You can try this. V is for verbose and s stands for sort.
GET /_cat/indices?v&s=store.size:desc
From the docs :
These metrics are retrieved directly from Lucene, which {es} uses internally to power indexing and search. As a result, all document counts include hidden nested documents.
I have two different Elasticsearch clusters,
One cluster is Elastcisearch 6.x with the data, Second new Elasticsearch cluster 7.7.1 with pre-created indexes.
I reindexed data from Elastcisearch 6.x to Elastcisearch 7.7.1
Is there any way to get the doc from source and compare it with the target doc, in order to check that data is there and it is not affected somehow.
When you perform a reindex the data will be indexed based on destination index mapping, so if your mapping is same you should get the same result in search, the _source value will be unique on both indices but it doesn't mean your search result will be the same. If you really want to be sure everything is OK you should check the inverted index generated by both indices and compare them for fulltext search, this data can be really big and there is not an easy way to retrieve it, you can check this for getting term-document matrix .
The documentation and recommendation for using stored_fields feature in ElasticSearch has been changing. In the latest version (7.9), stored_fields is not recommended - https://www.elastic.co/guide/en/elasticsearch/reference/7.9/search-fields.html
Is there a reason for this?
Where as in version 7.4.0, there is no such negative comment - https://www.elastic.co/guide/en/elasticsearch/reference/7.4/mapping-store.html
What is the guidance in using this feature? Is using _source filtering a better option? I ask because in some other doc, _source filtering is supposed to kill performance - https://www.elastic.co/blog/found-optimizing-elasticsearch-searches
If you use _source or _fields you will quickly kill performance. They access the stored fields data structure, which is intended to be used when accessing the resulting hits, not when processing millions of documents.
What is the best way to filter fields and not kill performance with Elastic Search?
source filtering is the recommended way to fetch the fields and you are getting confused due to the blog, but you seem to miss the very important concept and use-case where it is applicable. Please read the below statement carefully.
_source is intended to be used when accessing the resulting hits, not when processing millions of documents.
By default, elasticsearch returns only 10 hits/search results which can be changed based on the size parameter and if in your search results, you want to fetch few fields value than using source_filter makes perfect sense as it's done on the final result set(not all the documents matching search results),
While if you use the script, and using source value try to read field-value and filter the search result, this will cause queries to scan all the index which is the second part of the above-mentioned statement(not when processing millions of documents.)
Apart from the above, as all the field values are already stored as part of _source field which is enabled by default, you need not allocate extra space if you explicitly mark few fields as stored(disabled by default to save the index size) to retrieve field-values.
I need to make a paginated search call to ES. I am using _doc as a sort key and search_after for getting next page, since I don't care about ordering as long as it is consistent every time I make a search. However, what I found out is that returned objects are sorted in different order on every search request. In case of pagination this actually causes problems, because when making a call to get next page I often see same documents as I saw on a previous page.
Am I misunderstanding how _doc should be used? What are my other alternatives if I want consistent ordering.
I am using ES 5.5
The current documentation recommends using a field with one unique value per document to be used as the tiebreaker, this should be a duplicate of the _id field in another field that has doc values enabled.
See the search_after documentation here, specifically the first note:
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-search-after.html
You can use _uid for sorting but using _uid for sorting is an expensive operation in terms of memory usage. Please check this https://github.com/elastic/kibana/issues/11925