how to join indexes on elasticSearch - elasticsearch

I have two indexes : Student
"mappings": {
"properties": {
"name": {
"type": "text"
}
}
}
}
and University having a onetomany relationship with student
how to declare mappings for university?

While you can use the Join field type here, you should really beware of trying to use Elasticsearch as a relational database. It doesn't really support joins without a big performance tax and without some limitations (indexing parent and child into the same shard).
Usually the answer would be to de-normalize the relation. For example, in this case put some of the University fields directly in the Student document, allowing you to search on them directly.

Related

How to model this multi-tenancy usecase?

We are a multi-tenant platform.
The platform has a construct called Entity.
Users can create entities to model any real-life object eg: Customers, Orders, Payment, Inventory, Cart, pretty much anything.
Each entity will have its set of attributes, for example, a customer entity can have: name, email, phone, address (can be another nested entity), etc.
The requirement is to provide query/OLAP capabilities on these entities. For example, find all customers where name = 'john'.
The requirement includes all types of queries such as DATE RANGE, CONTAINS, LIKE, NUMERIC RANGE, FULL-TEXT Queries, etc. We also need Sorting, Aggregation, Pagination features.
Current design
We use elasticsearch to store entity data.
Each tenant is assigned a separate index.
When an entity is created in a tenant, the corresponding mappings are created inside the associated index. The mappings have roughly the following form:
{
"properties": {
"Customer": {
"properties": {
"name": {
"values": {
"type": "text"
}
},
"age": {
"values": {
"type": "integer"
}
}
}
},
"Order": {
"properties": {
"id": {
"values": {
"type": "text"
}
},
"eta": {
"values": {
"type": "integer"
}
}
}
}
//... other entities of this tenant
}
}
Major Problems with this design
Ever-growing mappings.
Frequent updates to mappings and hence the nodes are busy circulating cluster update information, leading to search/indexing latencies and occasional timeouts.
Existing mappings can't be altered if required. We have to go for the entire re-index procedure.
The current design was able to serve us for a few years until recently when the issues started popping up.
What would be a good design to model the above multi-tenancy requirement? Which database solution and schema modeling will be appropriate?
If you decide to stick with ES, you'll need to give your tenants more than one index, i.e. one index per tenant/entity instead of one index per tenant. The reasons for this are exactly the ones you mentioned, i.e. ever-growing mapping and the difficulty to update existing mappings.
You'll end up with more indexes for sure (N tenants x M entities) and the challenge will be to properly size those indexes in terms of how many primary shards you need for each of their entities. But I think it's even more difficult now that all entities are stored in a single index per tenant, so it'll turn out to be easier to fan out your tenant index into several.
Another option is to come up with a very generic mapping, which only contains typed fields like int_field_1, int_field_2, text_field_1, text_field_2, etc, and you keep a per-tenant mapping between the generic field names and the tenant specific field names:
name -> text_field_1
age -> int_field_1
...
That way you've less mappings to manage, it's more flexible in terms of what kind of data you can accommodate, but it comes at the price of keeping the above field mapping up-to-date.
In any case, you need to end up having more indexes for your tenants in order to make it easier to manage their mappings and keep their size at bay. it will also allow you to scale better, because it's easier to spread several smaller indices over your data nodes, than very big indexes, especially given the new sizing recommendation made by Elastic.

Elasticsearch, join data type: single mapping type for parent and child fields

I want to implement Parent/Child relationships between two entities X and Y with completely different set of fileds each, in Elasticsearch 6.3.2. I was going to create two mapping files for each of the associations and define _parent field on child side.
But according to ES documentation, starting with 6.x multiple types are no longer supported in a single index.
So with this restriction, should I put all the fields for entity X and Y into a single mapping file? If so, what if I have same field, say name in both entities. Should I name them x.name and y.name? What is the approach here?
Parent child documents reside in the same index.
Example
Parent document:
Post index-name/_doc/1
{
"my_id": "1",
"text": "This is a question",
"my_join_field": "question"
}
Child Document:
Post index-name/_doc/2
{
"my_id": "2",
"text": "This is answer",
"my_join_field": {
"name": "answer",
"parent": "1"
}
}
Above have same fields, they can have different fields. In which case fields will be null in one document and have value in other. Join type is used to identify parent and child document

ElasticSearch content ACL Filtering performance

Following is my content model.
Document(s) are associated with user & group acls defining the principals who have access to the document.
The document itself is a bunch of metadata & a large content body (extracted from pdfs/docs etc).
The user performing the search has to be limited to only the set of documents he/she is entitled to (as defined by the acls on the document). He/She could have access to the document owing to user acls or owing to the group the user belongs to.
Both group membership and acls on the document are highly transient in nature meaning a user's group membership changes quite often so are the ACLs on the document itself.
Approach 1
Store the acls on the document along with its metadata as a non-stored field. Expand the groups in the ACL to the individual users (since the acl can be a group).
At the time of query, append a filter to the user query which will do a bool filter to include only documents with the userid in the acl field
"filter" : {
        "query" : {
            "term": {
                "acls": "1234"
            }
        }
      }
The problem i see with this approach is that documents need to get re-indexed though the document metadata/content is not changed.
Every time a user's group membership changes
Every time the ACL on the document changes (permission changed for the document)
I am assuming that this will lead to a large number of segment creation and merges and especially since the document body (one of the fields of the document) is a pretty large text section.
Approach 2:
This is a modification on the approach 1. This approach attempts to limit the updates on the document when the updates are strictly acl related.
Instead of having the acls defined on the metadata. This approach entails creating multiple types
In the Document Index
Document (with metadata & text body) as a parent
id
text
userschild Document (parent id & user acls only). This document will exist for each parent
id
parentid
useracls
groupschild Document (parent id & group acls only). This document will exist for each parent with group acls
id
parentid
groupacls
In the Users Index
An entry for each user in the system with the groups he/she is associated with
User
id
groups
The idea here is that updates are now localized to the different ElasticSearch entities.
In case of user acl changes only the userschild document will get updated (avoiding a potentially costly update on the parent document).
In case of the group acl changes only the groupschild document will get updated (again avoiding a potentially costly update on the parent document).
In case of user group membership changes again only the secondary index will get updated (avoiding the update on the parent document).
The query itself will look as follows.
"filter" : {
"query" : {
"bool": {
"should": [
{
"has_child": {
"type": "userschild",
"query": {
"term": {
"users": "1234"
}
}
}
},{
"has_child": {
"type": "groupschild",
"query": {
"terms" : {
"groups" : {
"index" : "users",
"type" : "user",
"id" : "1234",
"path" : "groups"
}
}
}
}
}
]
}
}
}
I have doubts with regards to its scalability owing to the nature of the query that will be involved. It involves two terms query one of which that has to be built from a separate index. I am considering improving the terms lookup using fields with docvalues enabled.
Will the approach 2 scale? The concerns I have are around the has_child query and its scalability.
Could someone clarify my understanding in this regard?
I think perhaps this is overcomplicated by expanding groups before querying. How about leaving group identifiers intact in the documents index instead?
Typically, I'd represent this in two indices (no parent-child, or any sort of nested relationships at all).
Users Index
(sample doc)
{
"user_id": 12345,
"user_name": "Swami PR"
"user_group_ids": [900, 901, 902]
}
Document Index
(sample doc)
{
"doc_id": 98765,
"doc_name": "Lunch Order for Tuesday - Top Secret and Confidential",
"doc_acl_read_users": [12345, 12346, 12347],
"doc_acl_write_users": [12345],
"doc_acl_read_groups": [435, 620],
"doc_acl_write_groups": []
}
That Users Index could just as easily be in a database... your app just needs "Swami's" user_id and group_ids available when querying for documents.
Then, when you query [Top Secret] documents as Swami PR, (to read), make sure to add:
"should": [
{
"term": {
"doc_acl_read_users": 12345
}
},
{
"terms": {
"doc_acl_read_groups": [900, 901, 902]
}
},
"minimum_should_match": 1
]
I can see 2 main types of updates that can happen here:
Users or Groups updated on Document := reindex one record in Document Index
User added to/removed from Group := reindex one record in User Index
With an edge-case
User or Group deleted
okay, here you might want to batch through and reindex all documents
periodically to clean out stale user/group identifiers... but
theoretically, stale user/group identifiers won't exist in the
application anymore, so don't cause issues in the index.
i have implemented the approach #2 in my company, now i'm researching a graph DB to handle ACL. Once we have crossed 4 million documents for one our clients the updates and search queries became quite frequent and didn't scale as expected.
my suggestion to look into graph frameworks to solve this.

Can I use parent-child relationships on Kibana?

On a relational DB, I have two tables connected by a foreign key, on a typical one-to-many relationship. I would like to translate this schema into ElasticSearch, so I researched and found two options: the nested and parent-child. My ultimate goal was to visualize this dataset in Kibana 4.
Parent-child seemed the most adequate one, so I'll describe the steps that I followed, based on the official ES documentation and a few examples I found on the web.
curl -XPUT http://server:port/accident_struct -d '
{
"mappings" : {
"event" : {
},
"details": {
"_parent": {
"type": "event"
} ,
"properties" : {
}
}
}
}
';
here I create the index accident_struct, which contains two types (corresponding to the two relational tables): event and details.
Event is the parent, thus each document of details has an event associated to it.
Then I upload the documents using the bulk API. For event:
{"index":{"_index":"accident_struct","_type":"event","_id":"17f14c32-53ca-4671-b959-cf47e81cf55c"}}
{values here...}
And for details:
{"index":{"_index":"accident_struct","_type":"details","_id": "1", "_parent": "039c7e18-a24f-402d-b2c8-e5d36b8ad220" }}
The event does not know anything about children, but each child (details) needs to set its parent. In the ES documentation I see the parent being set using "parent", while in other examples I see it using "_parent". I wonder what is the correct option (although at this point, none works for me).
The requests are successfully completed and I can see that the number of documents contained in the index corresponds to the sum of events + types.
I can also query parents for children and children for parents, on ES. For example:
curl -XPOST host:port/accident_struct/details/_search?pretty -d '{
"query" : {
"has_parent" : {
"type" : "event",
"query" : {
"match_all" : {}
}
}
}
}'
After setting the index on Kibana, I am able to list all the fields from parent and child. However, if I go to the "discover" tab, only the parent fields are listed.
If I uncheck a box that reads "hide missing fields", the fields from the child documents are shown as grey out, along with an error message (see image)
Am I doing something wrong or is the parent-child not supported in Kibana4? And if it is not supported, what would be the best alternative to represent this type of relationship?
Per the comment in this discussion on the elastic site, P/C is, like nested objects, at least not supported in visualizations. Le sigh.

Elasticsearch with multiple parent/child relationship

I'm building an application with complicated model, says Book, User and Review.
A Review contains both Book and User id.
To be able to search for Books that contain at least one review, I've set the Book as Review's parent and have routing as such. However I also need to find Users who wrote reviews that contain certain phrases.
Is it possible to have both the Book and User as Review's parent? Is there a better way to handle such situation?
Note that I'm not able to change the way data is modeled/not willing to do so because the data is transfered to Elasticsearch from a persistence database.
As far as I know you can't have a document with two parents.
My suggestion based on Application-side join chapter of Elasticsearch the definitive guide:
Create a parent/child relationship Book/Review
Be sure you have user_id property in Review mapping which contain the user id who wrote that review.
I think that covers both uses cases you described as follows:
Books that contain at least one review It can be solved with has child filter/query
Users who wrote reviews that contain certain phrases It can be solved by querying to reviews with the phrase you want to search and perform a cardinality aggregation on field user_id. If you need users information you have to query your database (or another elasticsearch index) with the ids retrieved.
Edit: "give me the books that have reviews this month written by user whose name started with John"
I recommend you to collect all those advanced uses cases and denormalize the data you need to achieve them. In this particular case it's enough with denormalizing the user name into Review. In any case elasticsearch people has written about managing relations in their blog or elasticsearch the definitive guide
Somths like (just make Books type as parent for Users and Reviews types)
.../index/users/_search?pretty" -d '
{
"query": {
"filtered": {
"filter": {
"and": [
{
"has_parent": {
"parent_type": "books",
"filter": {
"has_child": {
"type": "Reviews",
"query": {
"term": {
"text_review": "some word"
}
}
}
}
}
}
]
}
}
}
}
'
You have two options
Elasticsearch Nested Objects
Elasticsearch parent&child
both are compared and evaluated nicely here

Resources