Elasticsearch : nested VS flat indicies - elasticsearch

What are the pros and cons of a nested index in Elasticsearch ?
I am thinking about temporal data of some users or devices so by flat I mean all data are stored at the root of the index and nested I mean data is group by device / id. So there is one document by user/device id that contains one document per entry of time.
I see as pros :
Nested indices offer more querying possibility
And as cons :
Write is more costly
Index management could be really more difficult (how to expire the data ? No easy index rolling, how to easily spread the data into different indices ?)

Totally agree with you about pros and cons of ES nested type. Just want to elaborate the deepth of indexing cost. Keep in mind nested field opens query facilities also
If you use nested type and infrequent modification then it is awesome and creates broader scope for query but if you do frequent change then it will do huge cost.
nested type mapping has more impact in terms of indexing over flat type mappings. Since Lucene does not have any concept of nested object types, and everything is stored as flat objects. So there is an additional operation performed at the indexing time.
Imagine you have a large nested document that translates to 100k internal documents and compare this to a flat data model where we have indexed the 100k parts as independent documents. If we add a single nested document at the deepest, nested level, this would add a single flat document while the nested document would end up reindexing 100k+ 1 documents. If you on the other hand change something at the root, all documents need to be updated in both cases. So you can imagine a single doc change can cost you reindexing all nested field of that document.

Let's talk about pros and cons among flat fields, dynamic, object, flattened, nested.
flat fields
pros: Simple and clear, field type is specified in mappings.
cons: New field should do put mapping at first. Number of fields is limited by engine. Different fields have no relationship.
dynamic
For new fields or inner object which have no mappings represented.
set to false, will not index new fields, but new fields can be retrieved by _source.
set to strict, will report an error when unknown new fields in doc.
cons: strict is not elastic meanwhile dynamic will lead a mapping explosion problem, as Elasticsearch limit the max number of feild to 1000 by default, see also index.mapping.total_fields.limit: 1000
flattened
Better type than daynamic, comes with 7.3 +
pros: unlimited number of fields, all things flattened as its name described.
cons: only support few query: term or exist, no highlight.
object
Object is another default when you using {} in mappings.
pros: nature supported, also for inner object
cons: not support array of object. Elasticsearch actually has no concept of inner object.Therefore, it flattens object hierarchies into a simple list of field names and values, checkout https://www.elastic.co/guide/en/elasticsearch/reference/current/nested.html#nested-arrays-flattening-objects
nested
pros: Support complex structure of array of objects or object of arrays and also keep them separately, therefore the field relation inside single object is kept. Suppose we have a candidate indices and lots of candidates have multiple education backgrounds. with nested fields, we can now retrieve candidates graduated from CMU with a major in CS.
cons: Each nested object is indexed as a separate Lucene document, 1 doc with 100 nested object will create 101 Lucene documents.
Both fields and objects inside nested have limits, default to 50 and 10000.
see also
https://www.elastic.co/guide/en/elasticsearch/reference/current/nested.html#_limits_on_nested_mappings_and_objects

Related

Performance of nested type in elasticsearch

I just have this simple question.
I've been putting my mind into how nested fields work in elasticsearch. There is not many resources out there specifically talking about what uses cases it's good for. I get that it's useful for querying a document with an array-like field, and that the nested field saves in the same way as the document itself.
I was wondering about extreme cases of gigantic arrays within a document, like a comment section, or unique "views" for a post. Would a nested field for such arrays speed up the querying? Would it not matter at all?

Why is querying slower for parent/child relationships versus when using nested documents?

Queries on parent/child structures that were created using the join_field are considered slower than querying on nested documents. The general explication that I found was that in the first case the queries are slower because the child is stored separately from the parent. Still, I would like to have a fuller picture on why this is happening. I have read that in case of nested documents Elasticsearch actually indexes root object and nested objects separately, then relates them internally. I have also read that in this case the documents are stored in the same Lucene block and this makes queries faster. But still...how does the inverted index look exactly in case of nested documents versus parent/child?

What are the pros and cons of using Nested mapping in Elasticsearch vs parent child relationship

Currently in my ES document structure, there is a field of type 'Object'. This is a json object which can have upto 3000 fields inside. The problem being that at times, my ES runs out of memory because of the document size being too large. So I am looking to change my document structure.
The two structures that I am looking at are - Nested mappings and parent child relationship. Both the structures satisfy my requirement for search. Points being considered :
I read that nested queries are much faster than child queries.
Nested mappings too save the nested fields as separate documents.
Two points of confusion that I am facing :
How does nested indexing work? Does ES get the whole document in one go and analyze it completely at once, or the requests for nested documents are individual. Because in the first case, it might so happen that ES runs out of memory again.
When we say parent child queries are slower, how slower do we mean?
Looking for inputs.
Nested are faster than parent/child and are more simple to manage. Infact you can index child without parents, so you have to be careful when you index. Also when you want to delete one entry of parent you have to delete all the children node, is not an automatic task.
On the other hand, parent/child are more comfortable if you would to change/update your entry. With nested type you can't change only one nested value in the nested field, you have to reindex all the nested values in the nested field. With parent/child you can change/update also only one value in that parent or child field.
Nested are considered as atomical relational data in the index, instead parent/child are only a different datatype that keep the relations from 2 field - parent, child.
You can read the kimchy post here, and for the slowness of parent/child you can read the last one comment of the discussion https://discuss.elastic.co/t/choosing-parent-child-vs-nested-document/6742
Nested ::
Nested docs are stored in the same Lucene block as each other, which
helps read/query performance. Reading a nested doc is faster than
the equivalent parent/child.
Updating a single field in a nested document (parent or nested
children) forces ES to reindex the entire nested document.This can
be very expensive for large nested docs.
"Cross referencing" nested documents is impossible.
Best suited for data that does not change frequently.
Parent/Child ::
Children are stored separately from the parent, but are routed to the same shard.
So parent/children are slightly less performance on read/query than nested.
Parent/child mappings have a bit extra memory overhead, since ES maintains a "join"
list in memory.
Updating a child doc does not affect the parent or any other children, which can
potentially save a lot of indexing on large docs.
Sorting/scoring can be difficult with Parent/Child since the Has Child/Has Parent
operations can be opaque at times
The main difference is that nested are faster compared
to parent/child, but, nested docs require reindexing the parent with all
its children, while parent-child allows to reindex/add / delete specific
children.
for example, a product can have only a few tags but can have many comments so keeping tags as nested probably wouldn't be a
problem. but (hundreds of) comments to a blog post is a problem.

Index type in elasticsearch

I am trying to understand and effectively use the index type available in elasticsearch.
However, I am still not clear how _type meta field is different from any regular field of an index in terms of storage/implementation. I do understand avoiding_type_gotchas
For example, if I have 1 million records (say posts) and each post has a creation_date. How will things play out if one of my index types is creation_date itself (leading to ~ 1 million types)? I don't think it affects the way Lucene stores documents, does it?
In what way my elasticsearch query performance be affected if I use creation_date as index type against a namesake type say 'post'?
I got the answer on elastic forum.
https://discuss.elastic.co/t/index-type-effective-utilization/58706
Pasting the response as is -
"While elasticsearch is scalable in many dimensions there is one where it is limited. This is the metadata about your indices which includes the various indices, doc types and fields they contain.
These "mappings" exist in memory and are updated and shared around all nodes with every change. For this reason it does not make sense to endlessly grow the list of indices, types (and therefore fields) that exist in this cluster state. A type-per-document-creation-date registers a million on the one-to-ten scale of bad design decisions" - Mark_Harwood

Scaling with regard to Nested vs Parent/Child Documents

I'm running a proof of concept for us to run nested queries on more "normalised" data in ES.
e.g. with nested
Customer ->
- name
- email
- events ->
- created
- type
Now I have a situation where a list of events for a given customer can be moved to another customer. e.g. Customer A has 50 events
Customer B has 5000 events
I now want to move all events from customer A into Customer B
At scale with millions of customers and queries are run on this for graphs in a UI is Parent/Child more suitable or should nested be able to handle it?
What are the pros and cons in my situation?
It's hard to give you even rough performance metrics like "Nested is good enough", but I can give you some details about Nested vs Parent/Child that can help. I'd still recommend working up a few benchmark tests to verify performance is acceptable.
Nested
Nested docs are stored in the same Lucene block as each other, which helps read/query performance. Reading a nested doc is faster than the equivalent parent/child.
Updating a single field in a nested document (parent or nested children) forces ES to reindex the entire nested document. This can be very expensive for large nested docs
Changing the "parent" means ES will: delete old doc, reindex old doc with less nested data, delete new doc, reindex new doc with new nested data.
Parent/Child
Children are stored separately from the parent, but are routed to the same shard. So parent/children are slightly less performance on read/query than nested
Parent/child mappings have a bit extra memory overhead, since ES maintains a "join" list in memory
Updating a child doc does not affect the parent or any other children, which can potentially save a lot of indexing on large docs
Changing the parent means you will delete the old child document and then index an identical doc under the new parent.
It is possible Nested will work fine, but if you think there is the possibility for a lot of "data shuffling", then Parent/Child may be more suitable. Nested is best suited for instances where the nested data is not updated frequently but read often. Parent/Child is better for arrangements where the data moves around more frequently.

Resources