Are IDs guaranteed to be unique across indices in Elasticsearch 6+? - elasticsearch

With mapping types being removed in Elasticsearch 6.0 I wonder if IDs of documents are guaranteed to be unique across indices?
Say I have three indices, all with a "parent" field that contains an ID. Do I need to include which index the ID belongs to or can I just search through all three indices when looking for a document with the given ID?

IDs are not unique across indices.
If you want to refer to a document you need to know both the index name and the ID.

Explicit IDs
If you explicitly set the document ID when indexing, nothing prevents you from using the same ID twice for documents going in different indices.
Autogenerated IDs
If you don't set the ID when indexing, ES will generate one before storing the document.
According to the code, the ID is securely generated from a random number, the host MAC address and the current timestamp in ms. Additional work is done to ensure that the timestamp (and thus the ID sequence) increases monotonically.
To generate the same ID, when the JVM starts a specific random number has to be picked and the document ID must be generated in a specific moment with sub-millisecond precision. So while the chance exists, it's so small that I wouldn't care about it. (just like I wouldn't care about collisions when using an hash function to check file integrity)
Final note: as a code comment notes, the implementation is opaque and could change at any time, so what I wrote might not hold true in future versions.

Related

Lucene: Filter query by doc ID

I want to have in the search response only documents with specified doc id. In stackoverflow I found this question (Lucene filter with docIds) but as far as I understand there is created the additional field in the document and then doing search by this field. Is there another way to deal with it?
Lucene's docids are intended only to be internal keys. You should not be using them as search keys, or storing them for later use. Those ids are subject to change without warning. They will be changed when updating or reindexing documents, and can change at other times, such as segment merges, as well.
If you want your documents to have a unique identifier, you should generate that key separate from the docId, and index it as a field in your document.

Elasticsearch. Is it possible to make Elastisearch use only numbers in auto assigned IDs?

When Elasticsearch index document and you do not provide ID, it assign auto ID which consists of 20 symbols (letters and numbers). I need to use numbers only. Is it possible to change mask/pattern/type of auto assigned ID?
This is currently not supported by ES, you can provide your own numerical ID sequence, though they will always be transformed to string first when used as the _id

Sort by a different index's values

Given two indexes, I'm trying to sort the first based on values of the second.
For example, Index 1 ('Products') has fields id, name. Index 2 ('Prices') has fields id, price.
Struggling to figure out how to sort 'Products' by the 'Prices'.price, assuming the ids match. Reason for this quest is that hypothetically the 'Products' index becomes very large (with duplicate ids), and updating all documents becomes expensive.
Elasticsearch is a document based store, rather than a column based store. What you're looking for is a way to JOIN the two indices, however this is not supported in Elasticsearch. The 'Elasticsearch way' of storing these documents is to have 1 index that contains all relevant data. If you're worried about update procedures taking very long, look into creating an index with an Alias. When you need to do a major update, do it to a new index and only when you're done switch the alias target to the new index, this will allow you to update you data seamlessly

Avoid duplicate documents in Elasticsearch

I parse documents from a JSON, which will be added as children of a parent document. I just post the items to the index, without taking care about the id.
Sometimes there will be updates to the JSON and items will be added to it. So e.g. I parsed 2 documents from the JSON and after a week or two I parse the same JSON again. This time the JSON contains 3 documents.
I found answers like: 'remove all children and insert all items again.', but I doubt this is the solution I'm looking for.
I could compare each item to the children of my target-parent and add new documents, if there is no equal child.
I wondered if there is a way, to let elasticsearch handle duplicates.
Duplication needs to be handled in ID handling itself.
Choose a key that is unique for a document and make that as the _id. In the the key is too large or it is multiple keys , create a SHAH checksum out of it and make that as the _id.
If you already have dedupes in the database , you can use terms aggregation nested with top_hits aggregation to detect those.
You can read more about this approach here.
When adding a new document to elasticsearch, it first scans the existing documents to see if any of the IDs match. If there is already an existing document with that ID, the document will be updated instead of adding in a duplicate document (the version field will be updated at the same time to track the amount of updates that have occurred). You will therefore need to keep track of your document IDs somehow and maintain the same IDs throughout matching documents to eliminate the possibility of duplicates.

Is ElasticSearch Auto-Generated Ids sequential?

If we do not specify an Id when inserting a document to elasticsearch, the Id is automatically generated. I also understand that the Ids are Flake Ids, which have a predictive pattern.
My question is are these generated Flake Ids sequential enough that I can perform a sort on _id or _uid and be myself sure the results are in the same order as inserted?
The autogenerated _id is not sequential. It is an URL-safe, Base64-encoded GUID generated using modified FlakeID algorithm. FlakeID is a decentralized algorithm that generates k-ordered unique IDs.
Note that Elasticsearch does not generate the _id using the random UUIDs anymore.
See for more details:
https://github.com/elastic/elasticsearch/issues/5941
https://github.com/elastic/elasticsearch/pull/7531
https://github.com/ppearcy/elasticflake
Elasticsearch autogenerated _id is random, not sequential and same is the case for _uid. If you want to sort sequentially, then easy step is enabling _timestamp so _timestamp will have time of document inserted.
But, _timestamp is updated when document is updated. So, you may want to create new date field providing current time manually .
https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-timestamp-field.html

Resources