ElasticSearch - Is Parent Child relationship the best approach? - elasticsearch

Problem:
I have two types of repositories one is document and another one pages. There is a relationship between document and pages. You can think of them as one document(book) with 1 or more pages. Practically I may need to query pages from a document which matches certain criteria and vice versa. So what I am saying is I may some time query certain pages if not all from pages where the document matches.
Currently, I have created a Parent-Child relation in the Parent I have indexed the documents and in Child, I have indexed the pages with reference to the document.
But we have performance issues in our setup, the search and index queries are becoming very slow as the documents increases. I also found out that using Parent-Child relationship is not recommended as it is time-consuming for the elasticsearch site.
Is there any other Data modelling that I could use for this problem.

Yes. Index in the page object all the informations you have in document.
If I put that in another way: do join at index time and not search time.

Related

Elastic Search Number of Document Views

I have a web app that is used to search and view documents in Elastic Search.
The goal now is to maintain two values.
1. How many times the document was fetched in total (life time views)
2. How many times the document was fetched in last 30 days.
Achieving the first is somewhat possible, but the second one seems to be a very hard problem.
The two values need to be part of the document as they will be used for sorting the results.
What is the best way to achieve this.
To maintain expiring data like that you will need to store each view with its timestamp. I suppose you could store them in an array in the ES document, but you're asking for trouble doing it like that, as the update operation that you'd need to call every time the document is viewed will have to delete and recreate the document (that's how ES does updates), and if two views happen at the same time it will be difficult to make sure they both get stored.
There are two ways to store the views, and make use of them in the query:
Put them in a separate store (could be a different index in ES if you like), and run a cron job or similar every day to update every item in the main index with the number of views from the last thirty days in the view store. Even with a lot of data it should be possible to make this quite efficient, depending on your choice of store for views.
Use the ElasticSearch parent/child datatype to store views in the same index as the main documents, as children. I'm not sure that I'd particularly recommend this approach, but I think it should be possible with aggregations to write a query that sorts primary documents by the number of children (filtered by date). It might be quite slow though.
I doubt there is any other way to do this with current versions of ES, because it doesn't support joining across indices. Either the data must be aggregated in advance onto the document, or it has to be available in the same index.

Store a threaded view in ElasticSearch

I am parsing a threaded forum (tree with parent_id joins) and am trying to store the single postings in ElasticSearch while keeping the hierarchy. However I am not quite sure what the best way would be.
parent/child model: The difficulty here is, that the root elements don't have parents + I am not sure whether or not I can point _parent to its own type.
Also a bonus question on this one. When inserting, do I need to pass the parent as query param or can I add it in the data-object as well?
nested model: I cannot tell in advance how deep the tree might get and I don't really to put useless objects in the mapping
I feel that this would be not such an uncommon task to do, so any advice would be great!
I wouldn't recommend taking your approach for this purpose.
Using both parent/child and nested you would have to pre-define the maximum depth of your tree, and articulate that with some nasty mapping. (While enumerating each level's field in your search queries.)
With parent/child you'd actually be creating additional indices for each level, which adds unnecessary resource overhead.
Is Elasticsearch your primary datasource? If not, consider simply indexing forum posts as a flat collection of documents with enough information present to be able to reconstruct the thread from your primary. E.g.:
POST
Thread ID
Author ID (perhaps not needed for search?)
Post ID
Parent ID (perhaps not needed for search?)
Post Date
Post Title
Post Body
Then Elasticsearch is reduced to the role of text search / highlighting engine, and will happily give you back snippets and Post IDs/Thread IDs needed to reconstruct the thread from the database.
If Elasticsearch is your primary store, then hopefully you've read this thread already. There is a commercial Elasticsearch plugin created by Siren Solutions which enables Elasticsearch to manage truly schemaless, nested data like yours.

getting complete tree from elastic search in single call

I want to store employee hierarchy in elastic search. where CFO, CTO, COO etc report to CEO. And each employee can have their own reportees.
I think above can be done using elastic search parent-child relationship. Can we write a query to get the all reportees(direct reportees and sub-reportees) in a single call.
For example if we query for CEO we should get all employees and for CFO we should get employees in finance dept.
Something similar exists in RDMS like SQL server's CTE.
Parent-child relations in ES is:
Parent knows nothing about children
Children must provide _parent to connect with it and to be routed accodringly.
Parent-child mapping is handled by ES via mapping in memory.
Parent/child documents is independent in any other aspect.
So, there is no easy way to do it (there's no way to actually store normal form of any relational data as well, because ES in non-relational DB). Workarounds about this:
query documents with has_parent/has_child queries (only 1 level of relation works for this)
store documents as nested objects (pay attention, that this model reindexes whole document if any of members changes)
denormalize data (most natural way for non-relational storages, IMO)
First and foremost, avoid thinking about ES in a relational database way. ES isn't so suited for joins/relations, though it can achieve similar effect via the parent/child relations. Don't even think about joins that might involve a undetermined number of depths. CTE can handle without much difficulty but not all relational databases support CTE AFAIK (MySQL being one).
The parent-child relations is more trouble than its worth IMMO. Child docs are routed to shards where their parents reside. In your case of a tree, all documents will eventually trace back to the root document, which will result all your documents to reside in a single shard. The depth of your tree could be quite large (more than 4 or 5 in a not-so-small organization). Also, if you go with this solution, it is quite inconvenient to retrieve (via the GET API) a particular child doc from ES based on its ID, because you have to specify its parent IDs all the way up to its root.
I think it's best to store the PATH from root up to but not including the current employee as a list of IDs. So each employee has a field like:
"superiors": [CEO_ID, CTO_ID, ... , HER_DIRECT_MANAGER_ID],
So it is completely denormalized and your application has to prepare for this list.
With this setup, to get all subordinates of an employee:
filtering out IDs in this employee's own superiors field plus her own ID, either using a filter agg or a filtered query.
do a terms agg on the superiors field and you will have all subordinates of this employee.
I must admit that at least two queries are needed. The first one is a GET request to retrieve the superiors field of this employee and then the second query to do what I described above.
Also, don't worry about the duplications due to denormalization. ES can handle way more data than you can save here.

Data model for fields that change frequently in ElasticSearch

What is the best way to deal with fields that change frequently inside a document for ElasticSearch? Per their docs about partial updates...
Internally, however, the update API simply manages the same retrieve-change-reindex process that we have already described.
In particular, what should be done when the indexing of the document will likely be expensive given the number of indexed field and the size of some of the text fields that have to be analyzed?
As a concrete example, use SO's view and vote counts on questions and answers. It would seem expensive to reindex the text body just to update those values.
Maybe you shouldn't update so frequently. Perhaps things like vote/views should only be periodically updated in ES, while more critical fields like answers/questions be pushed immediately. Consider what's most important and see if you can get away with some level of staleness.
ElasticSearch is great for text search, but I would not consider ES to support SO in its entirety (or similar applications). It could be a useful tool for searching for answers/questions on SO, or for internal applications (like log/event analysis). But perhaps the actual serving of data could be better done with a different solution? Maybe it should be powered by Cassandra instead for the bulk of the work? You get the idea...
If you want to use ES as a solution to your needs, and you MUST update frequently, you could definitely consider the parent/child model mentioned already. of course, that method will require more memory/disk space, and it will take up more cpu/time when you query for totals. An alternative would be to have the parent store searchable fields, and let the child hold the metadata (where the child's fields are not analyzed). this will allow you to make frequent updates without having to undergo an expensive re-index, since there is nothing to index.
You could also consider what I mentioned above and see if you can get away with some staleness. This can be done in many ways too. You can throttle your requests by type of change, or change the refresh/flush interval, or consider de-duping updates if you are sending updates in bulk. These too have their shortcomings...
I think best way to handle the change is to split the document (you can use Parent child relationship, or just have parent id), and make document as small as possible (moving changeable part to new types) .
This can be a way to accomplish your requirement say SO,
You can use multiple types for this, consider This post (Views and Vote count).
Create a type for post, view and vote.
For a post , index a document to post type (index post id, title description tag), and for every view of that post you can index a document to view type (with id of post), and if voted you can index vote with (no of votes , id of post and other info you need [like positive or negative flag] ) to vote type.
So, to get views for post, use filter of post id, and get document counts in views type
To get no of votes, use stat aggregation for no of votes , or terms aggregation followed by stat aggregation for getting positive and negative votes.
This is way I think is best, and there can be other opinion too.
Thanks
What I do is that I use a database like mongo or mysql for storing properties that get updated frequently and use elastic search to store documents for text searching.
Example: I want to keep data about a book and its contents and I also want to keep the total number of views, updating and reindexing the document each time a user views it is a total overkill.

Elastic search and "databases"

Sorry for the ambiguous title, couldn't thing of anything better fitting.
I 'm exploring Elastic Search and it looks very cool. My question is conceptual since I 'm used to sql.
In Sql, you have different databases and you store the data for each application there. Does the same concept exist in ES? Or is all data from all my application going to end up in the same place? In that case, what are the best practices to avoid unwanted results from unfitting data?
Schemaless doesn't mean structureless:
In elastic search you can organize your data into document collections
A top-level document collection is roughly equivalent to a database
You can also hierarchically create new document collections inside top-level collections, which is a very rough equivalent of a database table
When you search documents, you search for documents inside specific document collections (such as search for all posts inside blog1)
Individual documents can be viewed as equivalent to rows in a database table
Also please note that I say roughly equivalent -- data in SQL is often normalized into tables by relations, while documents (in ES) often hold large entities of data. For instance, it generally makes sense to embed all comments inside a blog post document, whereas in SQL you would normalize comments and blogposts into individual tables.
For a nice tutorial, I recommend taking look at "ElasticSearch in 5 minutes" tutorial.
Switching from SQL to a search engine can be challenging at times. Elasticsearch has a concept of index, that can be roughly mapped to a database and type that can, again very roughly, mapped to a table. Elasticsearch has very powerful mechanism of selecting records (rows) of a single type and combining results from different types and indices (union). However, there is no support for joins at the moment. The only relationship that elasticsearch supports is has_child, but it's not suitable for modeling many-to-many relationships. So, in most cases, you need to be prepared to denormalize your data, so it can be stored in a single table.

Resources