lucene documents order change after number of commits - sorting

I am facing a problem in ordering of the lucene documents while searching from an index. I need to get the documents in the order of last in fast out, from the last indexed to the first indexed documents. I could see the order was maintained till nine commits to an index. But the order get changed from 10th commit on wards. Noticed that all the document id numbers getting changed and the insertion order not maintained, old documents number all are renumbered to different numbers.
Any solution to maintain the order of insertion even after many commit to the index.
Till nine commits the file structure looks like - no order issue found:
After the tenth commit, the structure gets changed and order of the docs also getting changed

Never use docId as ordering reference. This is a lucene internal id and may change depending on indexing operations by lucene.
As an example: if you update a document with docId 1, lucene does a delete and insert internally. this could lead to another docId.
to achieve ordering / sorting by query input you should add a dedicated field to your index. See Field Javadoc. There are several fields for this purpose:
SortedDocValuesField: byte[] indexed column-wise for sorting/faceting
SortedSetDocValuesField: SortedSet indexed column-wise for
sorting/faceting
NumericDocValuesField: long indexed column-wise for sorting/faceting
SortedNumericDocValuesField: SortedSet indexed column-wise for
sorting/faceting
Important: this fields are used for scoring / sorting / faceting only. If you wanna have this value also in a query result you have to add an additional StoredField for this value.

Related

Does Elasticsearch have a Default Sort Order for Filter Queries?

Does Elasticsearch have a defined default sort order for filter queries if none is specified? Or is it more like an RDBMS without an order by - i.e. nothing is guaranteed?
From my experiments I appear to be getting my documents back in order of their id - which is exactly what I want - I am just wondering if this can be relied on?
When you only have filters (i.e. no scoring) and no explicit sort clause, then the documents are returned in index order, i.e. implicitly sorted by the special field named _doc.
Index order simply means the sequential order in which the documents have been indexed.
If your id is sequential and you've indexed your documents in the same order as your id, then what you observe is correct, but it might not always be the case.
No, the order cannot be relied on (in ES 7.12.1 at least)!
I've tested in a production environment, where we have a cluster with multiple shards and replicas and even running the simplest query like this returns results in different order on every few requests:
POST /my_index/_search
One way to ensure the same order is to add order by _id, which seems to bring a small performance hit with it.
Also, I know it's not related to this question, but keep in mind that if you do have scoring in your query and you still get random results, even after adding an order by _id, the problem is that the scores are randomly generated in a cluster environment. This problem can be solved with adding a parameter to you query:
POST /my_index/_search?search_type=dfs_query_then_fetch
More info and possible solutions can be found here:
https://www.elastic.co/guide/en/elasticsearch/reference/7.17/consistent-scoring.html

How to periodically update ElasticSearch index? UPDATE or REBUILD?

Suppose there are fixed 1 million titles of articles to be indexed, and each article has two growing fields with "type": "rank_feature": view_count and favor_count. I have to update the value of counters every hours in order to boosting hot articles in search result.
Since the UPDATE operation in ES and Lucene is equivalent to search-delete-create, I wonder what is the proper solution in my case. Does UPDATE operation save unnecessary ANALYZE steps for those fixed titles?
An update does not make your analysis more efficient — it still has to process the entire document again.
If you have 2 fields that change frequently and other fields that are more static, I'd restructure the documents to use parent/child:
The parent contains the static fields
The child has your 2 frequently changing fields
That way you can avoid the (re) analysis of documents as much as possible. This comes at the cost of some overhead at search-time, but should be manageable if you only have a single child.

Elasticsearch index with historical versions of documents

I have an Elasticsearch index continuously being updated and I'm creating a second index with the same mappings for doing offline analytics: I need to store changes for certain fields, in order to retrieve the values that were associated in specific time in the past. Therefore, in this second index I store multiple versions of the same document (same id but different _id fields).
My objective is to get ranked results for a given query and reference date. I've tried with aggregations but rather than modifying the hits fields you get a new aggregations one with unordered results.
Is there any way other than removing duplicates at the client side?
This is similar but different to this previous question as the proposed solution of just having a boolean current field allows for removing duplicates when querying the present.

Bulk read of all documents in an elasticsearch alias

I have the following elasticsearch setup:
4 to 6 small-ish indices (<5 million docs, <5Gb each)
they are unioned through an alias
they all contain the same doc type
they change very infrequently (i.e. >99% of the indexing happens when the index is created)
One of the use cases for my app requires to read all documents for the alias, ordered by a field, do some magic and serve the result.
I understand using deep pagination will most likely bring down my cluster, or at the very least have dismal performance so I'm wondering if the scroll API could be the solution. I know the documentation says it is not intended for use in real-time user queries, but what are the actual reasons for that?
Generally, how are people dealing with having to read through all the documents in an index? Should I look for another way to chunk the data?
When you use the scroll API, Elasticsearch creates a sort of a cursor for the current state of the index, so the reason for it not being recommended for real time search is because you will not see any new documents that were inserted after you created the scroll token.
Since your use case indicates that you rarely update or insert new documents into your indices, that may not be an issue for you.
When generating the scroll token you can specify a query with a sort, so if your documents have some sort of timestamp, you could create one scroll context for all documents with timestamp: { lte: "now" } and another scroll (or every a simple query) for the rest of the documents that were not included in the first search context by specifying a certain date range filter.

Index type in elasticsearch

I am trying to understand and effectively use the index type available in elasticsearch.
However, I am still not clear how _type meta field is different from any regular field of an index in terms of storage/implementation. I do understand avoiding_type_gotchas
For example, if I have 1 million records (say posts) and each post has a creation_date. How will things play out if one of my index types is creation_date itself (leading to ~ 1 million types)? I don't think it affects the way Lucene stores documents, does it?
In what way my elasticsearch query performance be affected if I use creation_date as index type against a namesake type say 'post'?
I got the answer on elastic forum.
https://discuss.elastic.co/t/index-type-effective-utilization/58706
Pasting the response as is -
"While elasticsearch is scalable in many dimensions there is one where it is limited. This is the metadata about your indices which includes the various indices, doc types and fields they contain.
These "mappings" exist in memory and are updated and shared around all nodes with every change. For this reason it does not make sense to endlessly grow the list of indices, types (and therefore fields) that exist in this cluster state. A type-per-document-creation-date registers a million on the one-to-ten scale of bad design decisions" - Mark_Harwood

Resources