ElasticSearch content ACL Filtering performance - elasticsearch

Following is my content model.
Document(s) are associated with user & group acls defining the principals who have access to the document.
The document itself is a bunch of metadata & a large content body (extracted from pdfs/docs etc).
The user performing the search has to be limited to only the set of documents he/she is entitled to (as defined by the acls on the document). He/She could have access to the document owing to user acls or owing to the group the user belongs to.
Both group membership and acls on the document are highly transient in nature meaning a user's group membership changes quite often so are the ACLs on the document itself.
Approach 1
Store the acls on the document along with its metadata as a non-stored field. Expand the groups in the ACL to the individual users (since the acl can be a group).
At the time of query, append a filter to the user query which will do a bool filter to include only documents with the userid in the acl field
"filter" : {
        "query" : {
            "term": {
                "acls": "1234"
            }
        }
      }
The problem i see with this approach is that documents need to get re-indexed though the document metadata/content is not changed.
Every time a user's group membership changes
Every time the ACL on the document changes (permission changed for the document)
I am assuming that this will lead to a large number of segment creation and merges and especially since the document body (one of the fields of the document) is a pretty large text section.
Approach 2:
This is a modification on the approach 1. This approach attempts to limit the updates on the document when the updates are strictly acl related.
Instead of having the acls defined on the metadata. This approach entails creating multiple types
In the Document Index
Document (with metadata & text body) as a parent
id
text
userschild Document (parent id & user acls only). This document will exist for each parent
id
parentid
useracls
groupschild Document (parent id & group acls only). This document will exist for each parent with group acls
id
parentid
groupacls
In the Users Index
An entry for each user in the system with the groups he/she is associated with
User
id
groups
The idea here is that updates are now localized to the different ElasticSearch entities.
In case of user acl changes only the userschild document will get updated (avoiding a potentially costly update on the parent document).
In case of the group acl changes only the groupschild document will get updated (again avoiding a potentially costly update on the parent document).
In case of user group membership changes again only the secondary index will get updated (avoiding the update on the parent document).
The query itself will look as follows.
"filter" : {
"query" : {
"bool": {
"should": [
{
"has_child": {
"type": "userschild",
"query": {
"term": {
"users": "1234"
}
}
}
},{
"has_child": {
"type": "groupschild",
"query": {
"terms" : {
"groups" : {
"index" : "users",
"type" : "user",
"id" : "1234",
"path" : "groups"
}
}
}
}
}
]
}
}
}
I have doubts with regards to its scalability owing to the nature of the query that will be involved. It involves two terms query one of which that has to be built from a separate index. I am considering improving the terms lookup using fields with docvalues enabled.
Will the approach 2 scale? The concerns I have are around the has_child query and its scalability.
Could someone clarify my understanding in this regard?

I think perhaps this is overcomplicated by expanding groups before querying. How about leaving group identifiers intact in the documents index instead?
Typically, I'd represent this in two indices (no parent-child, or any sort of nested relationships at all).
Users Index
(sample doc)
{
"user_id": 12345,
"user_name": "Swami PR"
"user_group_ids": [900, 901, 902]
}
Document Index
(sample doc)
{
"doc_id": 98765,
"doc_name": "Lunch Order for Tuesday - Top Secret and Confidential",
"doc_acl_read_users": [12345, 12346, 12347],
"doc_acl_write_users": [12345],
"doc_acl_read_groups": [435, 620],
"doc_acl_write_groups": []
}
That Users Index could just as easily be in a database... your app just needs "Swami's" user_id and group_ids available when querying for documents.
Then, when you query [Top Secret] documents as Swami PR, (to read), make sure to add:
"should": [
{
"term": {
"doc_acl_read_users": 12345
}
},
{
"terms": {
"doc_acl_read_groups": [900, 901, 902]
}
},
"minimum_should_match": 1
]
I can see 2 main types of updates that can happen here:
Users or Groups updated on Document := reindex one record in Document Index
User added to/removed from Group := reindex one record in User Index
With an edge-case
User or Group deleted
okay, here you might want to batch through and reindex all documents
periodically to clean out stale user/group identifiers... but
theoretically, stale user/group identifiers won't exist in the
application anymore, so don't cause issues in the index.

i have implemented the approach #2 in my company, now i'm researching a graph DB to handle ACL. Once we have crossed 4 million documents for one our clients the updates and search queries became quite frequent and didn't scale as expected.
my suggestion to look into graph frameworks to solve this.

Related

How to model this multi-tenancy usecase?

We are a multi-tenant platform.
The platform has a construct called Entity.
Users can create entities to model any real-life object eg: Customers, Orders, Payment, Inventory, Cart, pretty much anything.
Each entity will have its set of attributes, for example, a customer entity can have: name, email, phone, address (can be another nested entity), etc.
The requirement is to provide query/OLAP capabilities on these entities. For example, find all customers where name = 'john'.
The requirement includes all types of queries such as DATE RANGE, CONTAINS, LIKE, NUMERIC RANGE, FULL-TEXT Queries, etc. We also need Sorting, Aggregation, Pagination features.
Current design
We use elasticsearch to store entity data.
Each tenant is assigned a separate index.
When an entity is created in a tenant, the corresponding mappings are created inside the associated index. The mappings have roughly the following form:
{
"properties": {
"Customer": {
"properties": {
"name": {
"values": {
"type": "text"
}
},
"age": {
"values": {
"type": "integer"
}
}
}
},
"Order": {
"properties": {
"id": {
"values": {
"type": "text"
}
},
"eta": {
"values": {
"type": "integer"
}
}
}
}
//... other entities of this tenant
}
}
Major Problems with this design
Ever-growing mappings.
Frequent updates to mappings and hence the nodes are busy circulating cluster update information, leading to search/indexing latencies and occasional timeouts.
Existing mappings can't be altered if required. We have to go for the entire re-index procedure.
The current design was able to serve us for a few years until recently when the issues started popping up.
What would be a good design to model the above multi-tenancy requirement? Which database solution and schema modeling will be appropriate?
If you decide to stick with ES, you'll need to give your tenants more than one index, i.e. one index per tenant/entity instead of one index per tenant. The reasons for this are exactly the ones you mentioned, i.e. ever-growing mapping and the difficulty to update existing mappings.
You'll end up with more indexes for sure (N tenants x M entities) and the challenge will be to properly size those indexes in terms of how many primary shards you need for each of their entities. But I think it's even more difficult now that all entities are stored in a single index per tenant, so it'll turn out to be easier to fan out your tenant index into several.
Another option is to come up with a very generic mapping, which only contains typed fields like int_field_1, int_field_2, text_field_1, text_field_2, etc, and you keep a per-tenant mapping between the generic field names and the tenant specific field names:
name -> text_field_1
age -> int_field_1
...
That way you've less mappings to manage, it's more flexible in terms of what kind of data you can accommodate, but it comes at the price of keeping the above field mapping up-to-date.
In any case, you need to end up having more indexes for your tenants in order to make it easier to manage their mappings and keep their size at bay. it will also allow you to scale better, because it's easier to spread several smaller indices over your data nodes, than very big indexes, especially given the new sizing recommendation made by Elastic.

Filtering collapsed results in Elasticsearch

I have an elasticsearch index containing documents that represent entities at a given point in time. When an entity changes state, a new document is created with a timestamp. When I need to get the current state of all entities, I can do the following:
GET https://127.0.0.1:9200/myindex/_search
{
"collapse": {
"field": "entity_id"
},
"sort" : [{
"timestamp": {
"order": "desc"
}
}]
}
However, I would like to further filter the result of the collapse. When entities are deleted I create a new document that includes an is_deleted flag along with the timestamp in a nested metadata field. I would like to extend the above query to entirely filter out those entities that have been deleted. Using a term filter on entity_metadata.is_deleted: true obviously does not work, because then my result just includes the last document with that entity_id before it got marked as deleted. How can I filter my results after the collapse is done to exclude any tombstoned entites?
What I would suggest is that instead of adding an is_deleted flag to all entity_id documents, you could add a date_deleted field with the date of the deletion to all documents of that entity, and then when you view a document, given its date and the deleted_date you'd know if the document was LIVE or deleted at that date.
In addition, it would allow you to consider:
all documents that don't have a deleted_date field (i.e. not deleted) and
all documents that have a deleted_date before/after a given date.

Logstash -> Elasticsearch - update denormalized data

Use case explanation
We have a relational database with data about our day-to-day operations. The goal is to allow users to search the important data with a full-text search engine. The data is normalized and thus not in the best form to make full-text queries, so the idea was to denormalize a subset of the data and copy it in real-time to Elasticsearch, which allows us to create a fast and accurate search application.
We already have a system in place that enables Event Sourcing of our database operations (inserts, updates, deletes). The events only contains the changed columns and primary keys (on an update we don't get the whole row). Logstash already gets notified for each event so this part is already handled.
Actual problem
Now we are getting to our problem. Since the plan is to denormalize our data we will have to make sure updates on parent objects are propagated to the denormalized child objects in Elasticsearch. How can we configure logstash to do this?
Example
Lets say we maintain a list of Employees in Elasticsearch. Each Employee is assigned to a Company. Since the data is denormalized (for the purpose of faster search), each Employee also carries the name and address of the Company. An update changes the name of a Company - how can we configure logstash to update the company name in all Employees, assigned to the Company?
Additional explanation
#Darth_Vader:
The problem we are facing is, that we get an event that a Company has changed, but we want to modify documents of type Employee in Elasticsearch, because they carry the data about the company in itself. Your answer expects that we will get an event for every Employee, which is not the case.
Maybe this will make it clearer. We have 3 employees in Elasticsearch:
{type:'employee',id:'1',name:'Person 1',company.cmp_id:'1',company.name:'Company A'}
{type:'employee',id:'2',name:'Person 2',company.cmp_id:'1',company.name:'Company A'}
{type:'employee',id:'3',name:'Person 3',company.cmp_id:'2',company.name:'Company B'}
Then an update happens in the source DB.
UPDATE company SET name = 'Company NEW' WHERE cmp_id = 1;
We get an event in logstash, where it says something like this:
{type:'company',cmp_id:'1',old.name:'Company A',new.name:'Company NEW'}
This should then be propagated to Elasticsearch, so that the resulting employees are:
{type:'employee',id:'1',name:'Person 1',company.cmp_id:'1',company.name:'Company NEW'}
{type:'employee',id:'2',name:'Person 2',company.cmp_id:'1',company.name:'Company NEW'}
{type:'employee',id:'3',name:'Person 3',company.cmp_id:'2',company.name:'Company B'}
Notice that the field company.name changed.
I suggest a similar solution to what I've posted here, i.e. to use the http output plugin in order to issue an update by query call to the Employee index. The query would need to look like this:
POST employees/_update_by_query
{
"script": {
"source": "ctx._source.company.name = params.name",
"lang": "painless",
"params": {
"name": "Company NEW"
}
},
"query": {
"term": {
"company.cmp_id": "1"
}
}
}
So your Logstash config should look like this:
input {
...
}
filter {
mutate {
add_field => {
"[script][lang]" => "painless"
"[script][source]" => "ctx._source.company.name = params.name"
"[script][params][name]" => "%{new.name}"
"[query][term][company.cmp_id]" => "%{cmp_id}"
}
remove_field => ["host", "#version", "#timestamp", "type", "cmp_id", "old.name", "new.name"]
}
}
output {
http {
url => "http://localhost:9200/employees/_update_by_query"
http_method => "post"
format => "json"
}
}

How to exclude large number of IDs from an Elastic Search query

I'm working on an app similar to Tinder. In ElasticSearch I have a collection of about half a million users and their locations). Whenever the user opens the app to search for nearby users I run an Elastic Search query over that collection. The query is fairly complex, it takes into consideration not only the location but also how active the user is or how many photos he has.
What I struggle with is how to exclude those users who the current user already swiped through from the query. A naive way to implement this would probably be to maintaint a nested array of user IDs as part of every user document in the index and exclude based on that. But as every user does dozens of thousands swipes that array could potentially grow super big, so it's not a scalable solution.
Is there a way to exclude large number of entities from an Elastic Search query based on their IDs which does not hurt performace?
Use the lookup feature of the Terms query: Terms lookup mechanism
When it’s needed to specify a terms filter with a lot of terms it can be beneficial to fetch those term values from a document in an index. A concrete example would be to filter tweets tweeted by your followers. Potentially the amount of user ids specified in the terms filter can be a lot. In this scenario it makes sense to use the terms filter’s terms lookup mechanism.
You can try adding the ids filter into a bool/must_not clause of your complex query and see how it behaves.
{
"query": {
"filtered": {
"filter": {
"bool": {
"must": [
... <--- your other "must" constraints
],
"must_not": [
{
"ids": {
"values": [ "id1", "id2", "id3" ] <--- your list of ids to exclude
}
}
]
}
}
}
}
}

Elasticsearch with multiple parent/child relationship

I'm building an application with complicated model, says Book, User and Review.
A Review contains both Book and User id.
To be able to search for Books that contain at least one review, I've set the Book as Review's parent and have routing as such. However I also need to find Users who wrote reviews that contain certain phrases.
Is it possible to have both the Book and User as Review's parent? Is there a better way to handle such situation?
Note that I'm not able to change the way data is modeled/not willing to do so because the data is transfered to Elasticsearch from a persistence database.
As far as I know you can't have a document with two parents.
My suggestion based on Application-side join chapter of Elasticsearch the definitive guide:
Create a parent/child relationship Book/Review
Be sure you have user_id property in Review mapping which contain the user id who wrote that review.
I think that covers both uses cases you described as follows:
Books that contain at least one review It can be solved with has child filter/query
Users who wrote reviews that contain certain phrases It can be solved by querying to reviews with the phrase you want to search and perform a cardinality aggregation on field user_id. If you need users information you have to query your database (or another elasticsearch index) with the ids retrieved.
Edit: "give me the books that have reviews this month written by user whose name started with John"
I recommend you to collect all those advanced uses cases and denormalize the data you need to achieve them. In this particular case it's enough with denormalizing the user name into Review. In any case elasticsearch people has written about managing relations in their blog or elasticsearch the definitive guide
Somths like (just make Books type as parent for Users and Reviews types)
.../index/users/_search?pretty" -d '
{
"query": {
"filtered": {
"filter": {
"and": [
{
"has_parent": {
"parent_type": "books",
"filter": {
"has_child": {
"type": "Reviews",
"query": {
"term": {
"text_review": "some word"
}
}
}
}
}
}
]
}
}
}
}
'
You have two options
Elasticsearch Nested Objects
Elasticsearch parent&child
both are compared and evaluated nicely here

Resources