Sort by a different index's values - sorting

Given two indexes, I'm trying to sort the first based on values of the second.
For example, Index 1 ('Products') has fields id, name. Index 2 ('Prices') has fields id, price.
Struggling to figure out how to sort 'Products' by the 'Prices'.price, assuming the ids match. Reason for this quest is that hypothetically the 'Products' index becomes very large (with duplicate ids), and updating all documents becomes expensive.

Elasticsearch is a document based store, rather than a column based store. What you're looking for is a way to JOIN the two indices, however this is not supported in Elasticsearch. The 'Elasticsearch way' of storing these documents is to have 1 index that contains all relevant data. If you're worried about update procedures taking very long, look into creating an index with an Alias. When you need to do a major update, do it to a new index and only when you're done switch the alias target to the new index, this will allow you to update you data seamlessly

Related

Which is the best way to index the data from relational database table of One to many relationship

Can you please let me know which is the best way to index the records in elastic search for my scenario.
My Scenario is :
1) Need to index around 40 million records from oracle table which has entries having one to many relationship records. And the uniqueness of the records is based on the composite key with 4 columns
2) After indexing , Search should support "full text search" on all the fields
3) Filters and sorting on selected fields needs to be supported.
After going through the official documentation i found couple of options , but want to know which approach would be most useful among below
1) For each record in table create a entry in the elastic index
2) Create a nested json object based on the composite key and then add this elastic index
3)Parent child Relationship mechanism and application side joins are not suitable for my scenario
Thanks
Girish T S
Your question is not particularly clear, here's how I understand it: you have 40M child records in one table, each with a reference to a parent record.
You want to index your records so as to be able to search for a parent record whose children match certain criteria.
There are two solutions here:
Indexing one document per parent, with all children indexed as nested documents within the parent
Indexing each child record as a separate document, with a parent-child relationship in ElasticSearch
The first solution will have better performance, but it means that every time a child is updated, the full parent document must be reindexed with all its children.
In any case you're saying that a parent-child scheme is not suitable for your case, so you're left with only the first solution.

Are IDs guaranteed to be unique across indices in Elasticsearch 6+?

With mapping types being removed in Elasticsearch 6.0 I wonder if IDs of documents are guaranteed to be unique across indices?
Say I have three indices, all with a "parent" field that contains an ID. Do I need to include which index the ID belongs to or can I just search through all three indices when looking for a document with the given ID?
IDs are not unique across indices.
If you want to refer to a document you need to know both the index name and the ID.
Explicit IDs
If you explicitly set the document ID when indexing, nothing prevents you from using the same ID twice for documents going in different indices.
Autogenerated IDs
If you don't set the ID when indexing, ES will generate one before storing the document.
According to the code, the ID is securely generated from a random number, the host MAC address and the current timestamp in ms. Additional work is done to ensure that the timestamp (and thus the ID sequence) increases monotonically.
To generate the same ID, when the JVM starts a specific random number has to be picked and the document ID must be generated in a specific moment with sub-millisecond precision. So while the chance exists, it's so small that I wouldn't care about it. (just like I wouldn't care about collisions when using an hash function to check file integrity)
Final note: as a code comment notes, the implementation is opaque and could change at any time, so what I wrote might not hold true in future versions.

Categorising documents in elasticsearch

I've got a bunch of ES documents that I'd like to put into "collections". Each document has a unique integer as an ID. Each collection also needs to have a unique integer as an ID.
I need to be able to run queries to get a list of docs in a collection, and easily add an existing doc to a collection.
What would be the most efficient and logical way of approaching this:
An index of collections, which each has an array of document IDs, or
For each document have an array of integers (or a single integer) indicating to which collections it belongs?
Thank you.

Avoid duplicate documents in Elasticsearch

I parse documents from a JSON, which will be added as children of a parent document. I just post the items to the index, without taking care about the id.
Sometimes there will be updates to the JSON and items will be added to it. So e.g. I parsed 2 documents from the JSON and after a week or two I parse the same JSON again. This time the JSON contains 3 documents.
I found answers like: 'remove all children and insert all items again.', but I doubt this is the solution I'm looking for.
I could compare each item to the children of my target-parent and add new documents, if there is no equal child.
I wondered if there is a way, to let elasticsearch handle duplicates.
Duplication needs to be handled in ID handling itself.
Choose a key that is unique for a document and make that as the _id. In the the key is too large or it is multiple keys , create a SHAH checksum out of it and make that as the _id.
If you already have dedupes in the database , you can use terms aggregation nested with top_hits aggregation to detect those.
You can read more about this approach here.
When adding a new document to elasticsearch, it first scans the existing documents to see if any of the IDs match. If there is already an existing document with that ID, the document will be updated instead of adding in a duplicate document (the version field will be updated at the same time to track the amount of updates that have occurred). You will therefore need to keep track of your document IDs somehow and maintain the same IDs throughout matching documents to eliminate the possibility of duplicates.

Having a string array field as an index, good or bad?

My gut feel is that setting a string (with array elements) field as an index on a table will be bad for performance (where the bulk of the operations done on a table are inserts and updates - the table holds transactional data and its current size is approximately 20 mil records).
The string extends a type with 4 array elements, where all of them aren’t always populated. I need to justify why not to set this field as one of the indexes. I’ve tried searching for answers, reading Kimberley Tripps blog, going through best practises re indexes on MSDN (which only mentions indexes are best on numerics first, then string fields), etc. But none of these mention indexing the table on a field that is of an array type. What reasons can I give to justify not indexing on the string-array field. And if my gut feel is totally wrong and indexes work well on array fields, why so?
A Memo or Container field cannot be part of an index in AX.
Furthermore, columns consisting of the ntext, text, or image data types cannot be specified as columns for an index in SQL Server.
Let's say you have an extended data type ArrElement with 3 additional array elements ArrElement2, ArrElement3, ArrElement4. Creating an index with a field of the ArrElement type in AX will effectively create an index with 4 fields (ArrElement, ArrElement2, ArrElement3, and ArrElement4 - in that order) in SQL Server. You cannot change the order of the array elements in the index, but in my opinion there's really nothing wrong in having such an index if it really serves your purpose. Hope that answers your question.
As #10p noted adding say Dimension as the only field, will create an index of all the array elements: Dimension, Dimension2_, Dimension3_ (which are the names of the SQL table fields).
The value of such an index will depend on the queries performed. If only Dimension[3] is queried, then the index is of no value because Dimension[1] and Dimension[2] is not known.
This could be solved by creating an index for each of the array elements, for example:
Dim1Idx: Dimension[1] (maybe append more fields)
Dim2Idx: Dimension[2] (maybe append more fields)
Dim3Idx: Dimension[3] (maybe append more fields)
Individual array elements can be selected by using the combo-box on the index field.
The value of such indexes should be weighted against the added cost of insertion (and update, if the array values are changed).

Resources