Elastic search per user access control to document - elasticsearch

I'm using ElasticSearch 7.1.1 as a full-text search engine. At the beginning all the documents are accessible to every user. I want to give users the possibility to edit documents. The modified version of the document will be accessible only to the editor and everyone else will only be able to see the default document.
To do this I will add two array to every document:
An array of users excluded from seeing the doc
An array with the only user that can see the this doc
Every time someone edit a document I will:
Add to the excluded users list the user that made the edit
Create document containing the edit available only to that user.
This way in the index I'll have three types of documents:
Documents accessible to everyone
Documents accessible to everyone except some users
Documents accessible only to a specific users
I use ElasticSearch not only to fetch documents but also to calculate live aggregations (e.g. sums of some field) so query-time I will be able to fetch user specific documents.
I don't expect a lot of edits, less than 1% of the total documents.
Is there a smarter, and less query intensive, way to obtain the same results?

You could implement a document level security.
With that you can define roles that restrict the read-access to certain documents that match a query (e.g. you could use the id of the document).
So instead of updating the documents each time via your proposed array-solution, you would instead update the role respectively granting the roles to the particular users. This would of course require that every user has an elasticsearch user.
This feature is the only workaround to fulfill your requirements that Elasticsearch brings on the table "out of the box" as far as I know.
I hope I could help you.

Related

How to model shared folders in ElasticSearch or SOLR?

Popular search engines are quite performant when it comes to full text searches and many other aspects, however, I am not sure how to map the main document storage system security policies to ES and/or SOLR?
Consider Google Drive and it's folders. Users can share any folder - then files and folders below are also shared. Content management systems use something similar.
But how to map that to the external search engines (that is, not built-in to application's content management system), especially, if there are millions of documents in many tens of thousands of folders, tens of thousands of users? Will it help if, for example, depth (nestedness) of the folders is limited to some small number?
I know ES has user roles, but I can't see it can help here, because accesses are given more or less arbitrary. Another approach is to somehow materialize user access in the documents (folders and documents) themselves, but then changes in users' roles, local to some folder, will result in changing many thousands of documents.
Also, searches can be quite arbitrary and lengthy, so it is desired to have pagination, so, for example, fetching "everything" and then sorting out user access on application side is not an option.
I believe the scenario described is quite common, but I can't find any hints how to implement it.
I had used solr as search engine and solr's Data Import Handler (DIH) feature for importing the data from database to Solr.
I would suggest you to go with the approach of indexing the acl's along with the documents.
I had done the same approach and its working fine till now.
I agree that you have re-index the data on the solr side when there is any changes on folder access or change in the access of level of documents. We do need to re-index the document if the metadata of the document is changes or the content of the document is changes. Similarly we can also update the documents on the solr side for any changes in the ACL(Access Control List).
Why to index the ACL along with Document information.
The reason is whenever user search for a document, you can pass the user acl as part of the query in the form of filter query and get the documents which are accessible to user.
I feel this removes the complexity of applying the acl logic at the back end side.
If you dont index the ACL in solr, then you have to filter out the documents after you retrieve from solr by checking the document is and whatever the acl logic applies.
Or the last option could be index the document without acls. Let the user search all the documents. When he tries to perform any action on those documents then you can check the permission and allow the user to perform the action or deny the user saying you dont have enough permission to access the document.
Action could be like View, Download, Update etc..
You need to decide whichever approach suits and works out in your case.

How to show only results that user has access to

Database structure for my Python application is very similar to Instagram one. I have users, posts, and users can follow each other. There are public and private accounts.
I am indexing this data in ElasticSearch and searching works fine so far. However, there is a problem that search returns all posts, without filtering by criteria if user has access to it (e.g. post is created by another user who has private account, and current user isn't following that user).
My data in ElasticSearch is indexed simply across several indexes in a flat format, one index for users, one for posts.
I can post-process results that ElasticSearch returns, and remove posts that current access doesn't have access to, but this introduces additional query to the database to retrieve that user followers list, and possibly blocklist (I don't want to show posts to users that block each other too).
I can also add list of follower IDs for each user to ElasticSearch upon indexing and then match against them, but in case where user has thousands of followers, these lists will be huge, and I am not sure how convenient it will be to keep them in ElasticSearch.
How can I efficiently do this? My stack is backend Python + Flask, PostgreSQL database and ElasticSearch as search index.
Maybe you already found a solution...
Using elastic "terms lookup" can solve this problem if you have an index with the list of followers you can filter on, as you said here:
I can also add list of follower IDs for each user to ElasticSearch
upon indexing and then match against them, but in case where user has
thousands of followers, these lists will be huge, and I am not sure
how convenient it will be to keep them in ElasticSearch.
More details in the doc:
https://www.elastic.co/guide/en/elasticsearch/reference/7.5/query-dsl-terms-query.html#query-dsl-terms-lookup
Note that there's a limitation of 65 536 terms (but it can be overwritten) so if your service don't have millions of users default limit will be fine.

separating data access with elasticsearch

I'm just getting to know elasticsearch and I'm wondering if it suits my case at all:
Considering a system where companies (with multiple employees) can register and administer their clients, and send documents to their clients.
Now, I want to enable companies to search their documents - but ONLY theirs, not the documents of other companies. In other words: how to separate the data of those companies for searches? How can this be implemented with elasticsearch?
Is this separation to be handled by elasticsearch itself? I.e. there is some mapping between the companies in my system and a related user for elasticsearch.
Or is this to be handled by the backend of my system? I.e. the backend somehow decides (how?) to show only search results for that particular company. So there would be just one user, namely the backend of my system, that accesses and filters the results of elasticsearch. But is this sensible?
I'm sure there is a wealth of information about this out there. Please just give me a hint, because I don't know what to search for. Searches for elasticsearch authentication/authorization, for example, only yield results about who gains access to the search system in general - not about a pattern to solve this separation.
Thanks in advance!
Elasticsearch on its own does not support Authorization and Authentication, you need to add this via plugins, of which there are two that I know of. Shield is the official solution, which is part of the X-Pack and you need to pay Elastic if you want to use it. SearchGuard is an open source alternative with enterprise upgrades that you can buy.
Both of these enable you to define fine grained access rights for different users. What you'd probably want to do is give every company an index of their own for their documents and then restrict their user to only be able to read/write that index. Or if you absolutely want all documents in one index, you can add document level restrictions as well, so that everybody queries the same index but only gets results returned for their company. Depending on how many companies you expect to service this might make more sense in order to not have too many indices and shards, but I'd suspect that an index per company would be the best way to go.
Without these plugins you would need to resort to something on the http-layer, for example an nginx reverse proxy that filters requests based on the index names contained in the urls or something, but I'd severely advise against this, lots of pain lies that way!

Searching through user favorites with Elasticsearch

We're using Elasticsearch to index about 100,000 documents. Users can favorite items and currently we're using Django Haystack's RelatedSearchQuerySet to look through the favorites table, but this executes a lot of SQL queries to filter a subset of those documents, which makes searching through a user's favorites obscenely slow.
To speed things up, I thought about adding a multivalue field to each document (e.g. favorited_by) and storing users' primary keys in it, but since users can favorite thousands of items, popular documents will become large.
Searching through user's favorites seems like a solved problem. How is it done?
In elasticsearch I would store the favorites together with the username for that favorite. Than add a filter to the query with the user name. You even could use the username as a multivalue field, not sure if that would scale in case of thousands of users. Advantage could be one favorite per website if multiple users create the same favorite. But I think I would keep a favorite per user.

ElasticSearch separate index per user

I'm wondering if having thousands of different indexes is a bad idea?
I'm adding a search page to my web app based on ElasticSearch. The search page lets users search for other users on the site by filtering on a number of different indexed criteria (name, location, gender etc). This is fairly straight forward and will require just one index that contains a document every user of the site.
However, I want to also create a page where users can see a list of all of the other users they follow. I want this page to have the same filtering options that are available on the search page. I'm wondering if a good way to go about this would be to create a separate index for each user containing documents for each user they follow?
While you can certainly create thousands of indices in elasticsearch, I don't really see the need for it in your use case. I think you can use one index. Simply create an additional child type followers for the main user record. Every time user A follows user B, create a child record of B with the following content: {"followed_by" : "A"}. To get the list of users that current user is following, you can simply add Has Child Filter to you query.
I would like to add to Igor's answer that creating thousand of indexes on a tiny cluster (one or two nodes) can cause some drawbacks.
Each shard of an index is a full Lucene instance. That said, you will have many opened files (probably too many opened files) if you have a single node (or a small cluster - in term of nodes).
That's one of the major reasons why I would not define too many indices...
See also File descriptors on installation guide

Resources