Is ElasticSearch Auto-Generated Ids sequential? - elasticsearch

If we do not specify an Id when inserting a document to elasticsearch, the Id is automatically generated. I also understand that the Ids are Flake Ids, which have a predictive pattern.
My question is are these generated Flake Ids sequential enough that I can perform a sort on _id or _uid and be myself sure the results are in the same order as inserted?

The autogenerated _id is not sequential. It is an URL-safe, Base64-encoded GUID generated using modified FlakeID algorithm. FlakeID is a decentralized algorithm that generates k-ordered unique IDs.
Note that Elasticsearch does not generate the _id using the random UUIDs anymore.
See for more details:
https://github.com/elastic/elasticsearch/issues/5941
https://github.com/elastic/elasticsearch/pull/7531
https://github.com/ppearcy/elasticflake

Elasticsearch autogenerated _id is random, not sequential and same is the case for _uid. If you want to sort sequentially, then easy step is enabling _timestamp so _timestamp will have time of document inserted.
But, _timestamp is updated when document is updated. So, you may want to create new date field providing current time manually .
https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-timestamp-field.html

Related

Sorting Mongo records on non unique created date time field

I am sorting on created field which is in the form of 2022-03-26T03:56:13.176+00:00 and is a representation of a java.time.LocalDateTime.
Sorting by only created is not consistent as the field is not unique, due to batch operations that run quickly enough to result in duplicates.
I've added a second sort, on _id, which is ObjectId.
It seems the second sort adds quite a bit of time to the query, more so than the first, which is odd to me.
Why does it more than double response time, and is there a more preferred way to ensure the order?
Using MongoTemplate, I sort like this:
query.with(Sort.by(Sort.Order.desc("createdOn"), Sort.Order.desc("_id")));
If your use case is always to sort in descending order on both fileds it is best to create the compound index in the expected sort order as follow:
db.collection.createIndex({ createdOn:-1,_id:-1 })
But in general the default _id field is containing the document insertion date and it is unique accross mongodb process so you may just sort based on _id , you most porbably don't need to sort additionally on createdOn date ...

Are IDs guaranteed to be unique across indices in Elasticsearch 6+?

With mapping types being removed in Elasticsearch 6.0 I wonder if IDs of documents are guaranteed to be unique across indices?
Say I have three indices, all with a "parent" field that contains an ID. Do I need to include which index the ID belongs to or can I just search through all three indices when looking for a document with the given ID?
IDs are not unique across indices.
If you want to refer to a document you need to know both the index name and the ID.
Explicit IDs
If you explicitly set the document ID when indexing, nothing prevents you from using the same ID twice for documents going in different indices.
Autogenerated IDs
If you don't set the ID when indexing, ES will generate one before storing the document.
According to the code, the ID is securely generated from a random number, the host MAC address and the current timestamp in ms. Additional work is done to ensure that the timestamp (and thus the ID sequence) increases monotonically.
To generate the same ID, when the JVM starts a specific random number has to be picked and the document ID must be generated in a specific moment with sub-millisecond precision. So while the chance exists, it's so small that I wouldn't care about it. (just like I wouldn't care about collisions when using an hash function to check file integrity)
Final note: as a code comment notes, the implementation is opaque and could change at any time, so what I wrote might not hold true in future versions.

Weird results using Search After () elastic search

i am having issues with search after api in elastic search.
please see this link where i posted the full description of the problem
https://discuss.elastic.co/t/weird-results-using-search-after-elastic-search/116609?u=ayshwarya_sree
As per the documentation for searchAfter
A field with one unique value per document should be used as the
tiebreaker of the sort specification. Otherwise the sort order for
documents that have the same sort values would be undefined. The
recommended way is to use the field _id which is certain to contain
one unique value for each document.
Since you are only passing gender as sorting criteria, on your next second request it assumes that you are expecting results after Female, which will be results with gender Male.
Try adding _id as sort and searchafter parameter too

Lucene: Filter query by doc ID

I want to have in the search response only documents with specified doc id. In stackoverflow I found this question (Lucene filter with docIds) but as far as I understand there is created the additional field in the document and then doing search by this field. Is there another way to deal with it?
Lucene's docids are intended only to be internal keys. You should not be using them as search keys, or storing them for later use. Those ids are subject to change without warning. They will be changed when updating or reindexing documents, and can change at other times, such as segment merges, as well.
If you want your documents to have a unique identifier, you should generate that key separate from the docId, and index it as a field in your document.

Avoid duplicate documents in Elasticsearch

I parse documents from a JSON, which will be added as children of a parent document. I just post the items to the index, without taking care about the id.
Sometimes there will be updates to the JSON and items will be added to it. So e.g. I parsed 2 documents from the JSON and after a week or two I parse the same JSON again. This time the JSON contains 3 documents.
I found answers like: 'remove all children and insert all items again.', but I doubt this is the solution I'm looking for.
I could compare each item to the children of my target-parent and add new documents, if there is no equal child.
I wondered if there is a way, to let elasticsearch handle duplicates.
Duplication needs to be handled in ID handling itself.
Choose a key that is unique for a document and make that as the _id. In the the key is too large or it is multiple keys , create a SHAH checksum out of it and make that as the _id.
If you already have dedupes in the database , you can use terms aggregation nested with top_hits aggregation to detect those.
You can read more about this approach here.
When adding a new document to elasticsearch, it first scans the existing documents to see if any of the IDs match. If there is already an existing document with that ID, the document will be updated instead of adding in a duplicate document (the version field will be updated at the same time to track the amount of updates that have occurred). You will therefore need to keep track of your document IDs somehow and maintain the same IDs throughout matching documents to eliminate the possibility of duplicates.

Resources