Categorising documents in elasticsearch - elasticsearch

I've got a bunch of ES documents that I'd like to put into "collections". Each document has a unique integer as an ID. Each collection also needs to have a unique integer as an ID.
I need to be able to run queries to get a list of docs in a collection, and easily add an existing doc to a collection.
What would be the most efficient and logical way of approaching this:
An index of collections, which each has an array of document IDs, or
For each document have an array of integers (or a single integer) indicating to which collections it belongs?
Thank you.

Related

elasticsearch: decide which query should run first

We have a simple web page, where the user can provide some input and query the database. We currently use mongodb but want to migrate to elasticsearch, since the queries are faster.
There are some required search fields, like start and end date, and some optional ones, like a search string to match an entry, or a parent search string, to match parent entries. Parent-child relations are just described through fields containing each entry's ancestors ids.
The question is the following: If both search and parent search string are provided, is there a way to know before executing the queries, which query should be executed first, in order to provide results faster and to be more performant?
For example, it could be that a specific parent search results in only 2 docs/parent entries, and then we can fetch all children matching the search string. In that case we should execute firstly the parent query and then the entry query.
One option would be to get the count of both queries and then execute first the one with the smallest count, but isn't this solution worse, since the queries are going to be executed twice? Once for the count and once for the actual query.
Are there any other options to solve this?
PS. We use elasticsearch v1.7
Example
Let's say the user wants to search for all entries matching the following fields.
searchString: type:BLOCK AND name:test
parentSearchString: name:parentTest AND NOT type:BLOCK
This means that we either have to
fetch all entries (parents) matching the parentSearchString and store their ids. Then, we have to fetch all entries that match the searchString and also have to contain any of the parent ids in the ancestors field.
OR
fetch all entries that match the searchString and store all ancestors ids. Then fetch all entries that match the parentSearchString and their id is one of the ancestors ids.
Just to clarify, both parent and children entries have the exact same structure and reside in the same index. We cannot have different indices since the pare-child relation can be 10 times nested, so an entry can be both a parent and a child. An entry looks more or less like:
{
id: "e32452365321",
name: "name",
type: "type",
ancestors: "id1 id2 id3" // stored in node as an array of ids
}
First of all, I would advise you, to upgrade your Elasticsearch version, if possible. There happened a lot since 1.7 and to be honest, I can't tell if all of what's written in the following article is valid for such an old version (probably it isn't).
But to your actual question: Hopefully I am understanding you correctly, but you try to estimate how costly a query for Elasticsearch is? Well, you don't have to. If you provide all 'queries' in one nested query, Elasticsearch will do that for you: https://www.elastic.co/blog/elasticsearch-query-execution-order
Regarding speed, there is one other thing I can mention: calculating score does take time. So if sorting is not based on the elasticsearch _score, you want to use boolean filter queries. This would also apply, if you want to sort only by _score of parent matches, then you could put the query for children into a filter.
update
Thanks to your example, I now see the problem. Self referencial Parent-Child relations are unfortunately not supported by ElasticSearch, so your approach is probably right. You might want to check out the short chapter of the documentation about application-joins.
So yes, in general, you want to send the second query with the least possible amount of ids/terms. While getting counts for both queries is not as bad as you might think, because the results are most likely still cached, does it actually help? Because if you're going from child to parent, you would have to count the ancestors (field values), and not the actual document count.
I would argue, that the most expensive operation is very often fetching result source from disk. So whichever way you go, you probably should only fetch what you need in the first query. So your options are:
Fetch only the id of parent matches, and then use a terms filter on ancestors in the second query.
Or, fetch only the ancestors field of child matches, and use an id filter in your second query.
Unfortunately, I can't help you more than that, since I don't have enough experience in comparing speed of those approaches. My guess would be, that an id filter might be faster in general. But that's just a guess...

Sort by a different index's values

Given two indexes, I'm trying to sort the first based on values of the second.
For example, Index 1 ('Products') has fields id, name. Index 2 ('Prices') has fields id, price.
Struggling to figure out how to sort 'Products' by the 'Prices'.price, assuming the ids match. Reason for this quest is that hypothetically the 'Products' index becomes very large (with duplicate ids), and updating all documents becomes expensive.
Elasticsearch is a document based store, rather than a column based store. What you're looking for is a way to JOIN the two indices, however this is not supported in Elasticsearch. The 'Elasticsearch way' of storing these documents is to have 1 index that contains all relevant data. If you're worried about update procedures taking very long, look into creating an index with an Alias. When you need to do a major update, do it to a new index and only when you're done switch the alias target to the new index, this will allow you to update you data seamlessly

Avoid duplicate documents in Elasticsearch

I parse documents from a JSON, which will be added as children of a parent document. I just post the items to the index, without taking care about the id.
Sometimes there will be updates to the JSON and items will be added to it. So e.g. I parsed 2 documents from the JSON and after a week or two I parse the same JSON again. This time the JSON contains 3 documents.
I found answers like: 'remove all children and insert all items again.', but I doubt this is the solution I'm looking for.
I could compare each item to the children of my target-parent and add new documents, if there is no equal child.
I wondered if there is a way, to let elasticsearch handle duplicates.
Duplication needs to be handled in ID handling itself.
Choose a key that is unique for a document and make that as the _id. In the the key is too large or it is multiple keys , create a SHAH checksum out of it and make that as the _id.
If you already have dedupes in the database , you can use terms aggregation nested with top_hits aggregation to detect those.
You can read more about this approach here.
When adding a new document to elasticsearch, it first scans the existing documents to see if any of the IDs match. If there is already an existing document with that ID, the document will be updated instead of adding in a duplicate document (the version field will be updated at the same time to track the amount of updates that have occurred). You will therefore need to keep track of your document IDs somehow and maintain the same IDs throughout matching documents to eliminate the possibility of duplicates.

Is there a way to efficiently find all unique values for a field in MongoDB?

Consider a collection of Users:
{ name: 'Jeff' }
{ name: 'Joel' }
Is there a way to efficiently get all the unique values for name?
User.pluck(:name).uniq
To return
[ 'Jeff', 'Joel' ]
I think this would get the whole collection, so it would be inefficient.
However, if there is an index on name, is there a way to get all the unique values without getting all the documents?
Or is there another way to efficiently get the unique names?
As indicated in the comments, you can efficiently get the unique values of a field over all docs in a collection using distinct.
The documentation specifically mentions that indexes are used when possible, and that they can cover the distinct query. This means that only the supporting index needs to be loaded into memory to get the results.
When possible, db.collection.distinct() operations can use indexes.
Indexes can also cover db.collection.distinct() operations. See
Covered Query for more information on queries covered by indexes.
In Ruby, you would perform your distinct query as:
User.distinct(:name)

Having a string array field as an index, good or bad?

My gut feel is that setting a string (with array elements) field as an index on a table will be bad for performance (where the bulk of the operations done on a table are inserts and updates - the table holds transactional data and its current size is approximately 20 mil records).
The string extends a type with 4 array elements, where all of them aren’t always populated. I need to justify why not to set this field as one of the indexes. I’ve tried searching for answers, reading Kimberley Tripps blog, going through best practises re indexes on MSDN (which only mentions indexes are best on numerics first, then string fields), etc. But none of these mention indexing the table on a field that is of an array type. What reasons can I give to justify not indexing on the string-array field. And if my gut feel is totally wrong and indexes work well on array fields, why so?
A Memo or Container field cannot be part of an index in AX.
Furthermore, columns consisting of the ntext, text, or image data types cannot be specified as columns for an index in SQL Server.
Let's say you have an extended data type ArrElement with 3 additional array elements ArrElement2, ArrElement3, ArrElement4. Creating an index with a field of the ArrElement type in AX will effectively create an index with 4 fields (ArrElement, ArrElement2, ArrElement3, and ArrElement4 - in that order) in SQL Server. You cannot change the order of the array elements in the index, but in my opinion there's really nothing wrong in having such an index if it really serves your purpose. Hope that answers your question.
As #10p noted adding say Dimension as the only field, will create an index of all the array elements: Dimension, Dimension2_, Dimension3_ (which are the names of the SQL table fields).
The value of such an index will depend on the queries performed. If only Dimension[3] is queried, then the index is of no value because Dimension[1] and Dimension[2] is not known.
This could be solved by creating an index for each of the array elements, for example:
Dim1Idx: Dimension[1] (maybe append more fields)
Dim2Idx: Dimension[2] (maybe append more fields)
Dim3Idx: Dimension[3] (maybe append more fields)
Individual array elements can be selected by using the combo-box on the index field.
The value of such indexes should be weighted against the added cost of insertion (and update, if the array values are changed).

Resources