Scaling with regard to Nested vs Parent/Child Documents - elasticsearch

I'm running a proof of concept for us to run nested queries on more "normalised" data in ES.
e.g. with nested
Customer ->
- name
- email
- events ->
- created
- type
Now I have a situation where a list of events for a given customer can be moved to another customer. e.g. Customer A has 50 events
Customer B has 5000 events
I now want to move all events from customer A into Customer B
At scale with millions of customers and queries are run on this for graphs in a UI is Parent/Child more suitable or should nested be able to handle it?
What are the pros and cons in my situation?

It's hard to give you even rough performance metrics like "Nested is good enough", but I can give you some details about Nested vs Parent/Child that can help. I'd still recommend working up a few benchmark tests to verify performance is acceptable.
Nested
Nested docs are stored in the same Lucene block as each other, which helps read/query performance. Reading a nested doc is faster than the equivalent parent/child.
Updating a single field in a nested document (parent or nested children) forces ES to reindex the entire nested document. This can be very expensive for large nested docs
Changing the "parent" means ES will: delete old doc, reindex old doc with less nested data, delete new doc, reindex new doc with new nested data.
Parent/Child
Children are stored separately from the parent, but are routed to the same shard. So parent/children are slightly less performance on read/query than nested
Parent/child mappings have a bit extra memory overhead, since ES maintains a "join" list in memory
Updating a child doc does not affect the parent or any other children, which can potentially save a lot of indexing on large docs
Changing the parent means you will delete the old child document and then index an identical doc under the new parent.
It is possible Nested will work fine, but if you think there is the possibility for a lot of "data shuffling", then Parent/Child may be more suitable. Nested is best suited for instances where the nested data is not updated frequently but read often. Parent/Child is better for arrangements where the data moves around more frequently.

Related

Why is querying slower for parent/child relationships versus when using nested documents?

Queries on parent/child structures that were created using the join_field are considered slower than querying on nested documents. The general explication that I found was that in the first case the queries are slower because the child is stored separately from the parent. Still, I would like to have a fuller picture on why this is happening. I have read that in case of nested documents Elasticsearch actually indexes root object and nested objects separately, then relates them internally. I have also read that in this case the documents are stored in the same Lucene block and this makes queries faster. But still...how does the inverted index look exactly in case of nested documents versus parent/child?

Elasticsearch : nested VS flat indicies

What are the pros and cons of a nested index in Elasticsearch ?
I am thinking about temporal data of some users or devices so by flat I mean all data are stored at the root of the index and nested I mean data is group by device / id. So there is one document by user/device id that contains one document per entry of time.
I see as pros :
Nested indices offer more querying possibility
And as cons :
Write is more costly
Index management could be really more difficult (how to expire the data ? No easy index rolling, how to easily spread the data into different indices ?)
Totally agree with you about pros and cons of ES nested type. Just want to elaborate the deepth of indexing cost. Keep in mind nested field opens query facilities also
If you use nested type and infrequent modification then it is awesome and creates broader scope for query but if you do frequent change then it will do huge cost.
nested type mapping has more impact in terms of indexing over flat type mappings. Since Lucene does not have any concept of nested object types, and everything is stored as flat objects. So there is an additional operation performed at the indexing time.
Imagine you have a large nested document that translates to 100k internal documents and compare this to a flat data model where we have indexed the 100k parts as independent documents. If we add a single nested document at the deepest, nested level, this would add a single flat document while the nested document would end up reindexing 100k+ 1 documents. If you on the other hand change something at the root, all documents need to be updated in both cases. So you can imagine a single doc change can cost you reindexing all nested field of that document.
Let's talk about pros and cons among flat fields, dynamic, object, flattened, nested.
flat fields
pros: Simple and clear, field type is specified in mappings.
cons: New field should do put mapping at first. Number of fields is limited by engine. Different fields have no relationship.
dynamic
For new fields or inner object which have no mappings represented.
set to false, will not index new fields, but new fields can be retrieved by _source.
set to strict, will report an error when unknown new fields in doc.
cons: strict is not elastic meanwhile dynamic will lead a mapping explosion problem, as Elasticsearch limit the max number of feild to 1000 by default, see also index.mapping.total_fields.limit: 1000
flattened
Better type than daynamic, comes with 7.3 +
pros: unlimited number of fields, all things flattened as its name described.
cons: only support few query: term or exist, no highlight.
object
Object is another default when you using {} in mappings.
pros: nature supported, also for inner object
cons: not support array of object. Elasticsearch actually has no concept of inner object.Therefore, it flattens object hierarchies into a simple list of field names and values, checkout https://www.elastic.co/guide/en/elasticsearch/reference/current/nested.html#nested-arrays-flattening-objects
nested
pros: Support complex structure of array of objects or object of arrays and also keep them separately, therefore the field relation inside single object is kept. Suppose we have a candidate indices and lots of candidates have multiple education backgrounds. with nested fields, we can now retrieve candidates graduated from CMU with a major in CS.
cons: Each nested object is indexed as a separate Lucene document, 1 doc with 100 nested object will create 101 Lucene documents.
Both fields and objects inside nested have limits, default to 50 and 10000.
see also
https://www.elastic.co/guide/en/elasticsearch/reference/current/nested.html#_limits_on_nested_mappings_and_objects

What are the pros and cons of using Nested mapping in Elasticsearch vs parent child relationship

Currently in my ES document structure, there is a field of type 'Object'. This is a json object which can have upto 3000 fields inside. The problem being that at times, my ES runs out of memory because of the document size being too large. So I am looking to change my document structure.
The two structures that I am looking at are - Nested mappings and parent child relationship. Both the structures satisfy my requirement for search. Points being considered :
I read that nested queries are much faster than child queries.
Nested mappings too save the nested fields as separate documents.
Two points of confusion that I am facing :
How does nested indexing work? Does ES get the whole document in one go and analyze it completely at once, or the requests for nested documents are individual. Because in the first case, it might so happen that ES runs out of memory again.
When we say parent child queries are slower, how slower do we mean?
Looking for inputs.
Nested are faster than parent/child and are more simple to manage. Infact you can index child without parents, so you have to be careful when you index. Also when you want to delete one entry of parent you have to delete all the children node, is not an automatic task.
On the other hand, parent/child are more comfortable if you would to change/update your entry. With nested type you can't change only one nested value in the nested field, you have to reindex all the nested values in the nested field. With parent/child you can change/update also only one value in that parent or child field.
Nested are considered as atomical relational data in the index, instead parent/child are only a different datatype that keep the relations from 2 field - parent, child.
You can read the kimchy post here, and for the slowness of parent/child you can read the last one comment of the discussion https://discuss.elastic.co/t/choosing-parent-child-vs-nested-document/6742
Nested ::
Nested docs are stored in the same Lucene block as each other, which
helps read/query performance. Reading a nested doc is faster than
the equivalent parent/child.
Updating a single field in a nested document (parent or nested
children) forces ES to reindex the entire nested document.This can
be very expensive for large nested docs.
"Cross referencing" nested documents is impossible.
Best suited for data that does not change frequently.
Parent/Child ::
Children are stored separately from the parent, but are routed to the same shard.
So parent/children are slightly less performance on read/query than nested.
Parent/child mappings have a bit extra memory overhead, since ES maintains a "join"
list in memory.
Updating a child doc does not affect the parent or any other children, which can
potentially save a lot of indexing on large docs.
Sorting/scoring can be difficult with Parent/Child since the Has Child/Has Parent
operations can be opaque at times
The main difference is that nested are faster compared
to parent/child, but, nested docs require reindexing the parent with all
its children, while parent-child allows to reindex/add / delete specific
children.
for example, a product can have only a few tags but can have many comments so keeping tags as nested probably wouldn't be a
problem. but (hundreds of) comments to a blog post is a problem.

Store a threaded view in ElasticSearch

I am parsing a threaded forum (tree with parent_id joins) and am trying to store the single postings in ElasticSearch while keeping the hierarchy. However I am not quite sure what the best way would be.
parent/child model: The difficulty here is, that the root elements don't have parents + I am not sure whether or not I can point _parent to its own type.
Also a bonus question on this one. When inserting, do I need to pass the parent as query param or can I add it in the data-object as well?
nested model: I cannot tell in advance how deep the tree might get and I don't really to put useless objects in the mapping
I feel that this would be not such an uncommon task to do, so any advice would be great!
I wouldn't recommend taking your approach for this purpose.
Using both parent/child and nested you would have to pre-define the maximum depth of your tree, and articulate that with some nasty mapping. (While enumerating each level's field in your search queries.)
With parent/child you'd actually be creating additional indices for each level, which adds unnecessary resource overhead.
Is Elasticsearch your primary datasource? If not, consider simply indexing forum posts as a flat collection of documents with enough information present to be able to reconstruct the thread from your primary. E.g.:
POST
Thread ID
Author ID (perhaps not needed for search?)
Post ID
Parent ID (perhaps not needed for search?)
Post Date
Post Title
Post Body
Then Elasticsearch is reduced to the role of text search / highlighting engine, and will happily give you back snippets and Post IDs/Thread IDs needed to reconstruct the thread from the database.
If Elasticsearch is your primary store, then hopefully you've read this thread already. There is a commercial Elasticsearch plugin created by Siren Solutions which enables Elasticsearch to manage truly schemaless, nested data like yours.

elasticsearch - how to store social network connections

I'm indexing social network data on elasticsearch.
It is amazing for content and profiles, but with connections I'm getting some trouble...
option 1) index connections nested in profile document?
option 2) each connection is an independent document in a separated index?
What are advantages for each option?
What I will need:
Update connections list comparing to check who is new
Update connections list comparing to check who stopped following
There are three options when modeling relations. The most basic one is inner objects. Everything in one document. Problem is that you cannot query the inner objects on multiple properties. You will have a match if one property of inner object a matches and another of object b. this can be overcome using nested objects. Disadvantage of nested objects is when doing a lot of them and changing them often. Everything is stored in one document. Adding a nested object means updating the complete document. This problem can be overcome using parent child relationships. These are separate documents, so cheaper to add or remove children. Disadvantage is that you cannot obtain the parents and it's children in one query.
Hope it helps.

Resources