I'm indexing social network data on elasticsearch.
It is amazing for content and profiles, but with connections I'm getting some trouble...
option 1) index connections nested in profile document?
option 2) each connection is an independent document in a separated index?
What are advantages for each option?
What I will need:
Update connections list comparing to check who is new
Update connections list comparing to check who stopped following
There are three options when modeling relations. The most basic one is inner objects. Everything in one document. Problem is that you cannot query the inner objects on multiple properties. You will have a match if one property of inner object a matches and another of object b. this can be overcome using nested objects. Disadvantage of nested objects is when doing a lot of them and changing them often. Everything is stored in one document. Adding a nested object means updating the complete document. This problem can be overcome using parent child relationships. These are separate documents, so cheaper to add or remove children. Disadvantage is that you cannot obtain the parents and it's children in one query.
Hope it helps.
Related
Problem:
I have two types of repositories one is document and another one pages. There is a relationship between document and pages. You can think of them as one document(book) with 1 or more pages. Practically I may need to query pages from a document which matches certain criteria and vice versa. So what I am saying is I may some time query certain pages if not all from pages where the document matches.
Currently, I have created a Parent-Child relation in the Parent I have indexed the documents and in Child, I have indexed the pages with reference to the document.
But we have performance issues in our setup, the search and index queries are becoming very slow as the documents increases. I also found out that using Parent-Child relationship is not recommended as it is time-consuming for the elasticsearch site.
Is there any other Data modelling that I could use for this problem.
Yes. Index in the page object all the informations you have in document.
If I put that in another way: do join at index time and not search time.
Currently in my ES document structure, there is a field of type 'Object'. This is a json object which can have upto 3000 fields inside. The problem being that at times, my ES runs out of memory because of the document size being too large. So I am looking to change my document structure.
The two structures that I am looking at are - Nested mappings and parent child relationship. Both the structures satisfy my requirement for search. Points being considered :
I read that nested queries are much faster than child queries.
Nested mappings too save the nested fields as separate documents.
Two points of confusion that I am facing :
How does nested indexing work? Does ES get the whole document in one go and analyze it completely at once, or the requests for nested documents are individual. Because in the first case, it might so happen that ES runs out of memory again.
When we say parent child queries are slower, how slower do we mean?
Looking for inputs.
Nested are faster than parent/child and are more simple to manage. Infact you can index child without parents, so you have to be careful when you index. Also when you want to delete one entry of parent you have to delete all the children node, is not an automatic task.
On the other hand, parent/child are more comfortable if you would to change/update your entry. With nested type you can't change only one nested value in the nested field, you have to reindex all the nested values in the nested field. With parent/child you can change/update also only one value in that parent or child field.
Nested are considered as atomical relational data in the index, instead parent/child are only a different datatype that keep the relations from 2 field - parent, child.
You can read the kimchy post here, and for the slowness of parent/child you can read the last one comment of the discussion https://discuss.elastic.co/t/choosing-parent-child-vs-nested-document/6742
Nested ::
Nested docs are stored in the same Lucene block as each other, which
helps read/query performance. Reading a nested doc is faster than
the equivalent parent/child.
Updating a single field in a nested document (parent or nested
children) forces ES to reindex the entire nested document.This can
be very expensive for large nested docs.
"Cross referencing" nested documents is impossible.
Best suited for data that does not change frequently.
Parent/Child ::
Children are stored separately from the parent, but are routed to the same shard.
So parent/children are slightly less performance on read/query than nested.
Parent/child mappings have a bit extra memory overhead, since ES maintains a "join"
list in memory.
Updating a child doc does not affect the parent or any other children, which can
potentially save a lot of indexing on large docs.
Sorting/scoring can be difficult with Parent/Child since the Has Child/Has Parent
operations can be opaque at times
The main difference is that nested are faster compared
to parent/child, but, nested docs require reindexing the parent with all
its children, while parent-child allows to reindex/add / delete specific
children.
for example, a product can have only a few tags but can have many comments so keeping tags as nested probably wouldn't be a
problem. but (hundreds of) comments to a blog post is a problem.
I am parsing a threaded forum (tree with parent_id joins) and am trying to store the single postings in ElasticSearch while keeping the hierarchy. However I am not quite sure what the best way would be.
parent/child model: The difficulty here is, that the root elements don't have parents + I am not sure whether or not I can point _parent to its own type.
Also a bonus question on this one. When inserting, do I need to pass the parent as query param or can I add it in the data-object as well?
nested model: I cannot tell in advance how deep the tree might get and I don't really to put useless objects in the mapping
I feel that this would be not such an uncommon task to do, so any advice would be great!
I wouldn't recommend taking your approach for this purpose.
Using both parent/child and nested you would have to pre-define the maximum depth of your tree, and articulate that with some nasty mapping. (While enumerating each level's field in your search queries.)
With parent/child you'd actually be creating additional indices for each level, which adds unnecessary resource overhead.
Is Elasticsearch your primary datasource? If not, consider simply indexing forum posts as a flat collection of documents with enough information present to be able to reconstruct the thread from your primary. E.g.:
POST
Thread ID
Author ID (perhaps not needed for search?)
Post ID
Parent ID (perhaps not needed for search?)
Post Date
Post Title
Post Body
Then Elasticsearch is reduced to the role of text search / highlighting engine, and will happily give you back snippets and Post IDs/Thread IDs needed to reconstruct the thread from the database.
If Elasticsearch is your primary store, then hopefully you've read this thread already. There is a commercial Elasticsearch plugin created by Siren Solutions which enables Elasticsearch to manage truly schemaless, nested data like yours.
I want to store employee hierarchy in elastic search. where CFO, CTO, COO etc report to CEO. And each employee can have their own reportees.
I think above can be done using elastic search parent-child relationship. Can we write a query to get the all reportees(direct reportees and sub-reportees) in a single call.
For example if we query for CEO we should get all employees and for CFO we should get employees in finance dept.
Something similar exists in RDMS like SQL server's CTE.
Parent-child relations in ES is:
Parent knows nothing about children
Children must provide _parent to connect with it and to be routed accodringly.
Parent-child mapping is handled by ES via mapping in memory.
Parent/child documents is independent in any other aspect.
So, there is no easy way to do it (there's no way to actually store normal form of any relational data as well, because ES in non-relational DB). Workarounds about this:
query documents with has_parent/has_child queries (only 1 level of relation works for this)
store documents as nested objects (pay attention, that this model reindexes whole document if any of members changes)
denormalize data (most natural way for non-relational storages, IMO)
First and foremost, avoid thinking about ES in a relational database way. ES isn't so suited for joins/relations, though it can achieve similar effect via the parent/child relations. Don't even think about joins that might involve a undetermined number of depths. CTE can handle without much difficulty but not all relational databases support CTE AFAIK (MySQL being one).
The parent-child relations is more trouble than its worth IMMO. Child docs are routed to shards where their parents reside. In your case of a tree, all documents will eventually trace back to the root document, which will result all your documents to reside in a single shard. The depth of your tree could be quite large (more than 4 or 5 in a not-so-small organization). Also, if you go with this solution, it is quite inconvenient to retrieve (via the GET API) a particular child doc from ES based on its ID, because you have to specify its parent IDs all the way up to its root.
I think it's best to store the PATH from root up to but not including the current employee as a list of IDs. So each employee has a field like:
"superiors": [CEO_ID, CTO_ID, ... , HER_DIRECT_MANAGER_ID],
So it is completely denormalized and your application has to prepare for this list.
With this setup, to get all subordinates of an employee:
filtering out IDs in this employee's own superiors field plus her own ID, either using a filter agg or a filtered query.
do a terms agg on the superiors field and you will have all subordinates of this employee.
I must admit that at least two queries are needed. The first one is a GET request to retrieve the superiors field of this employee and then the second query to do what I described above.
Also, don't worry about the duplications due to denormalization. ES can handle way more data than you can save here.
I'm running a proof of concept for us to run nested queries on more "normalised" data in ES.
e.g. with nested
Customer ->
- name
- email
- events ->
- created
- type
Now I have a situation where a list of events for a given customer can be moved to another customer. e.g. Customer A has 50 events
Customer B has 5000 events
I now want to move all events from customer A into Customer B
At scale with millions of customers and queries are run on this for graphs in a UI is Parent/Child more suitable or should nested be able to handle it?
What are the pros and cons in my situation?
It's hard to give you even rough performance metrics like "Nested is good enough", but I can give you some details about Nested vs Parent/Child that can help. I'd still recommend working up a few benchmark tests to verify performance is acceptable.
Nested
Nested docs are stored in the same Lucene block as each other, which helps read/query performance. Reading a nested doc is faster than the equivalent parent/child.
Updating a single field in a nested document (parent or nested children) forces ES to reindex the entire nested document. This can be very expensive for large nested docs
Changing the "parent" means ES will: delete old doc, reindex old doc with less nested data, delete new doc, reindex new doc with new nested data.
Parent/Child
Children are stored separately from the parent, but are routed to the same shard. So parent/children are slightly less performance on read/query than nested
Parent/child mappings have a bit extra memory overhead, since ES maintains a "join" list in memory
Updating a child doc does not affect the parent or any other children, which can potentially save a lot of indexing on large docs
Changing the parent means you will delete the old child document and then index an identical doc under the new parent.
It is possible Nested will work fine, but if you think there is the possibility for a lot of "data shuffling", then Parent/Child may be more suitable. Nested is best suited for instances where the nested data is not updated frequently but read often. Parent/Child is better for arrangements where the data moves around more frequently.