getting complete tree from elastic search in single call - elasticsearch

I want to store employee hierarchy in elastic search. where CFO, CTO, COO etc report to CEO. And each employee can have their own reportees.
I think above can be done using elastic search parent-child relationship. Can we write a query to get the all reportees(direct reportees and sub-reportees) in a single call.
For example if we query for CEO we should get all employees and for CFO we should get employees in finance dept.
Something similar exists in RDMS like SQL server's CTE.

Parent-child relations in ES is:
Parent knows nothing about children
Children must provide _parent to connect with it and to be routed accodringly.
Parent-child mapping is handled by ES via mapping in memory.
Parent/child documents is independent in any other aspect.
So, there is no easy way to do it (there's no way to actually store normal form of any relational data as well, because ES in non-relational DB). Workarounds about this:
query documents with has_parent/has_child queries (only 1 level of relation works for this)
store documents as nested objects (pay attention, that this model reindexes whole document if any of members changes)
denormalize data (most natural way for non-relational storages, IMO)

First and foremost, avoid thinking about ES in a relational database way. ES isn't so suited for joins/relations, though it can achieve similar effect via the parent/child relations. Don't even think about joins that might involve a undetermined number of depths. CTE can handle without much difficulty but not all relational databases support CTE AFAIK (MySQL being one).
The parent-child relations is more trouble than its worth IMMO. Child docs are routed to shards where their parents reside. In your case of a tree, all documents will eventually trace back to the root document, which will result all your documents to reside in a single shard. The depth of your tree could be quite large (more than 4 or 5 in a not-so-small organization). Also, if you go with this solution, it is quite inconvenient to retrieve (via the GET API) a particular child doc from ES based on its ID, because you have to specify its parent IDs all the way up to its root.
I think it's best to store the PATH from root up to but not including the current employee as a list of IDs. So each employee has a field like:
"superiors": [CEO_ID, CTO_ID, ... , HER_DIRECT_MANAGER_ID],
So it is completely denormalized and your application has to prepare for this list.
With this setup, to get all subordinates of an employee:
filtering out IDs in this employee's own superiors field plus her own ID, either using a filter agg or a filtered query.
do a terms agg on the superiors field and you will have all subordinates of this employee.
I must admit that at least two queries are needed. The first one is a GET request to retrieve the superiors field of this employee and then the second query to do what I described above.
Also, don't worry about the duplications due to denormalization. ES can handle way more data than you can save here.

Related

ElasticSearch - Is Parent Child relationship the best approach?

Problem:
I have two types of repositories one is document and another one pages. There is a relationship between document and pages. You can think of them as one document(book) with 1 or more pages. Practically I may need to query pages from a document which matches certain criteria and vice versa. So what I am saying is I may some time query certain pages if not all from pages where the document matches.
Currently, I have created a Parent-Child relation in the Parent I have indexed the documents and in Child, I have indexed the pages with reference to the document.
But we have performance issues in our setup, the search and index queries are becoming very slow as the documents increases. I also found out that using Parent-Child relationship is not recommended as it is time-consuming for the elasticsearch site.
Is there any other Data modelling that I could use for this problem.
Yes. Index in the page object all the informations you have in document.
If I put that in another way: do join at index time and not search time.

I have 2 index as, which were 2 tables in sql as I can perform an inner joins query

how I have 2 index one is called assignment and the other user in sql had a data field fk but I do not know how to perform an inner join in elasticsearch someone can support me
So you have a couple options which might be useful, without knowing your specific use case I'm going to list a potentially useful links.
1)
parent child mapping, really useful when you want to return all documents associated with a specific document. To make the mapping process a bit easier I typically index the data the retrieve the mapping using the /_mapping endpoint, modify the mapping, delete the index, then reingest the data. Sometimes that isn't an option in the case of short lived data.
https://www.elastic.co/guide/en/elasticsearch/guide/current/parent-child-mapping.html
after updating the current mapping it's possible to use one of the joining queries.
https://www.elastic.co/guide/en/elasticsearch/reference/current/joining-queries.html
2)
When deleting the index and re ingesting the data isn't an option, create a new index, modify the data as described above, but instead of deleting the index use the re index API to get the information to the new index.
https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-reindex.html
3)
It might also be possible to use an ingest processor to join the tables
https://www.elastic.co/guide/en/elasticsearch/reference/master/ingest-processors.html
4)
possibly the quickest until you get your head wrapped around how elasticsearch works is to either join the information prior to ingesting or write a script joining the tables using one of the sdk's.
https://elasticsearch-py.readthedocs.io/en/master/
https://www.elastic.co/guide/en/elasticsearch/client/javascript-api/current/index.html
plus a lot more build by the community.

How to filter a huge list of ids from Solr at runtime

I have an index for products is Solr. I need to serve a customized list of products for each customer such that I have to exclude some specific products for each customer.
Currently I am storing this relationship of customer & excluded products in a SQL database and then filtering them in Solr using a terms query. Is there a way I can store this relationship in Solr itself so that I dont have to calculate the exclude list every time from SQL first.
Something very similar to what we can do in elasticsearch using https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-terms-query.html
Possible ways I could think of doing in Solr:
Keeping a list of customers in the products index itself, and filter on that. But this will really be a pain if I have to reindex all the documents. Also the list can be huge.
Another way that I could think of is maintaining a separate core for keeping documents per customer and excluded product_id and perform a join using {!join} to filter out products for a customer. Is it a scalable solution.
What should be the ideal approach for storing such kind of data in Solr.
Are there any performance issues with the SQL DB? It is perfectly fine to query the DB and get the IDs, and send them to Solr. You would avoid complexity and data duplication. You would anyway have to do some computation to send those IDs to Solr as well.
But to answer your question, yes, you could store the excluded product IDs per customer indeed in a separate index. You would be using a multi-valued field and update using atomic updates. If you do that, make sure to keep the indexing schema simple with no analyzer used for the IDs (just use the string type without any tokenizer or filter).
You do not need to do a Solr join query. You only have to lookup the product IDs per customer (1st query) and massage them as CSV, and do the terms query with the product IDs retrieved from the index (2nd query).
You need to find the best compromise for you
Best Query Time Performances
You add a field (multi valued) to the product index : allowed_users ( or forbidden_users) depending on the cardinality ( that you want to minimise).
This would require a re-indexing for the first time and an index update for each User permission change.
In order to reduce the network traffic and optimise the updates you could take a look to atomic updates[1] .
Best Index Time Performances
If the previous approach is not feasible in your case or doesn't satisfy you, you could try to optimise the indexing side.
You can index a document in a separate collection :
<Id>
<product_id>
<user_id>
You can use the query time join to filter the collection for the current user and then get back the products to filter them on your query.
So basically, you already thought about both the ideas :)
[1] https://lucene.apache.org/solr/guide/6_6/updating-parts-of-documents.html

Store a threaded view in ElasticSearch

I am parsing a threaded forum (tree with parent_id joins) and am trying to store the single postings in ElasticSearch while keeping the hierarchy. However I am not quite sure what the best way would be.
parent/child model: The difficulty here is, that the root elements don't have parents + I am not sure whether or not I can point _parent to its own type.
Also a bonus question on this one. When inserting, do I need to pass the parent as query param or can I add it in the data-object as well?
nested model: I cannot tell in advance how deep the tree might get and I don't really to put useless objects in the mapping
I feel that this would be not such an uncommon task to do, so any advice would be great!
I wouldn't recommend taking your approach for this purpose.
Using both parent/child and nested you would have to pre-define the maximum depth of your tree, and articulate that with some nasty mapping. (While enumerating each level's field in your search queries.)
With parent/child you'd actually be creating additional indices for each level, which adds unnecessary resource overhead.
Is Elasticsearch your primary datasource? If not, consider simply indexing forum posts as a flat collection of documents with enough information present to be able to reconstruct the thread from your primary. E.g.:
POST
Thread ID
Author ID (perhaps not needed for search?)
Post ID
Parent ID (perhaps not needed for search?)
Post Date
Post Title
Post Body
Then Elasticsearch is reduced to the role of text search / highlighting engine, and will happily give you back snippets and Post IDs/Thread IDs needed to reconstruct the thread from the database.
If Elasticsearch is your primary store, then hopefully you've read this thread already. There is a commercial Elasticsearch plugin created by Siren Solutions which enables Elasticsearch to manage truly schemaless, nested data like yours.

Elastic search and "databases"

Sorry for the ambiguous title, couldn't thing of anything better fitting.
I 'm exploring Elastic Search and it looks very cool. My question is conceptual since I 'm used to sql.
In Sql, you have different databases and you store the data for each application there. Does the same concept exist in ES? Or is all data from all my application going to end up in the same place? In that case, what are the best practices to avoid unwanted results from unfitting data?
Schemaless doesn't mean structureless:
In elastic search you can organize your data into document collections
A top-level document collection is roughly equivalent to a database
You can also hierarchically create new document collections inside top-level collections, which is a very rough equivalent of a database table
When you search documents, you search for documents inside specific document collections (such as search for all posts inside blog1)
Individual documents can be viewed as equivalent to rows in a database table
Also please note that I say roughly equivalent -- data in SQL is often normalized into tables by relations, while documents (in ES) often hold large entities of data. For instance, it generally makes sense to embed all comments inside a blog post document, whereas in SQL you would normalize comments and blogposts into individual tables.
For a nice tutorial, I recommend taking look at "ElasticSearch in 5 minutes" tutorial.
Switching from SQL to a search engine can be challenging at times. Elasticsearch has a concept of index, that can be roughly mapped to a database and type that can, again very roughly, mapped to a table. Elasticsearch has very powerful mechanism of selecting records (rows) of a single type and combining results from different types and indices (union). However, there is no support for joins at the moment. The only relationship that elasticsearch supports is has_child, but it's not suitable for modeling many-to-many relationships. So, in most cases, you need to be prepared to denormalize your data, so it can be stored in a single table.

Resources