How to model shared folders in ElasticSearch or SOLR? - elasticsearch

Popular search engines are quite performant when it comes to full text searches and many other aspects, however, I am not sure how to map the main document storage system security policies to ES and/or SOLR?
Consider Google Drive and it's folders. Users can share any folder - then files and folders below are also shared. Content management systems use something similar.
But how to map that to the external search engines (that is, not built-in to application's content management system), especially, if there are millions of documents in many tens of thousands of folders, tens of thousands of users? Will it help if, for example, depth (nestedness) of the folders is limited to some small number?
I know ES has user roles, but I can't see it can help here, because accesses are given more or less arbitrary. Another approach is to somehow materialize user access in the documents (folders and documents) themselves, but then changes in users' roles, local to some folder, will result in changing many thousands of documents.
Also, searches can be quite arbitrary and lengthy, so it is desired to have pagination, so, for example, fetching "everything" and then sorting out user access on application side is not an option.
I believe the scenario described is quite common, but I can't find any hints how to implement it.

I had used solr as search engine and solr's Data Import Handler (DIH) feature for importing the data from database to Solr.
I would suggest you to go with the approach of indexing the acl's along with the documents.
I had done the same approach and its working fine till now.
I agree that you have re-index the data on the solr side when there is any changes on folder access or change in the access of level of documents. We do need to re-index the document if the metadata of the document is changes or the content of the document is changes. Similarly we can also update the documents on the solr side for any changes in the ACL(Access Control List).
Why to index the ACL along with Document information.
The reason is whenever user search for a document, you can pass the user acl as part of the query in the form of filter query and get the documents which are accessible to user.
I feel this removes the complexity of applying the acl logic at the back end side.
If you dont index the ACL in solr, then you have to filter out the documents after you retrieve from solr by checking the document is and whatever the acl logic applies.
Or the last option could be index the document without acls. Let the user search all the documents. When he tries to perform any action on those documents then you can check the permission and allow the user to perform the action or deny the user saying you dont have enough permission to access the document.
Action could be like View, Download, Update etc..
You need to decide whichever approach suits and works out in your case.

Related

How to implement search over shared documents

In our application we allow users to like and share documents. We would like to be able to perform an Elasticsearch query over all documents relevant to a specific user (liked and shared documents) but without storing any authorization fields in Elasticsearch (since updating these would be pretty slow).
Are there any standard ways of architecting a solution for this?

Elastic search per user access control to document

I'm using ElasticSearch 7.1.1 as a full-text search engine. At the beginning all the documents are accessible to every user. I want to give users the possibility to edit documents. The modified version of the document will be accessible only to the editor and everyone else will only be able to see the default document.
To do this I will add two array to every document:
An array of users excluded from seeing the doc
An array with the only user that can see the this doc
Every time someone edit a document I will:
Add to the excluded users list the user that made the edit
Create document containing the edit available only to that user.
This way in the index I'll have three types of documents:
Documents accessible to everyone
Documents accessible to everyone except some users
Documents accessible only to a specific users
I use ElasticSearch not only to fetch documents but also to calculate live aggregations (e.g. sums of some field) so query-time I will be able to fetch user specific documents.
I don't expect a lot of edits, less than 1% of the total documents.
Is there a smarter, and less query intensive, way to obtain the same results?
You could implement a document level security.
With that you can define roles that restrict the read-access to certain documents that match a query (e.g. you could use the id of the document).
So instead of updating the documents each time via your proposed array-solution, you would instead update the role respectively granting the roles to the particular users. This would of course require that every user has an elasticsearch user.
This feature is the only workaround to fulfill your requirements that Elasticsearch brings on the table "out of the box" as far as I know.
I hope I could help you.

separating data access with elasticsearch

I'm just getting to know elasticsearch and I'm wondering if it suits my case at all:
Considering a system where companies (with multiple employees) can register and administer their clients, and send documents to their clients.
Now, I want to enable companies to search their documents - but ONLY theirs, not the documents of other companies. In other words: how to separate the data of those companies for searches? How can this be implemented with elasticsearch?
Is this separation to be handled by elasticsearch itself? I.e. there is some mapping between the companies in my system and a related user for elasticsearch.
Or is this to be handled by the backend of my system? I.e. the backend somehow decides (how?) to show only search results for that particular company. So there would be just one user, namely the backend of my system, that accesses and filters the results of elasticsearch. But is this sensible?
I'm sure there is a wealth of information about this out there. Please just give me a hint, because I don't know what to search for. Searches for elasticsearch authentication/authorization, for example, only yield results about who gains access to the search system in general - not about a pattern to solve this separation.
Thanks in advance!
Elasticsearch on its own does not support Authorization and Authentication, you need to add this via plugins, of which there are two that I know of. Shield is the official solution, which is part of the X-Pack and you need to pay Elastic if you want to use it. SearchGuard is an open source alternative with enterprise upgrades that you can buy.
Both of these enable you to define fine grained access rights for different users. What you'd probably want to do is give every company an index of their own for their documents and then restrict their user to only be able to read/write that index. Or if you absolutely want all documents in one index, you can add document level restrictions as well, so that everybody queries the same index but only gets results returned for their company. Depending on how many companies you expect to service this might make more sense in order to not have too many indices and shards, but I'd suspect that an index per company would be the best way to go.
Without these plugins you would need to resort to something on the http-layer, for example an nginx reverse proxy that filters requests based on the index names contained in the urls or something, but I'd severely advise against this, lots of pain lies that way!

Is Elasticsearch suitable as a final storage solution?

I'm currently learning Elasticsearch, and I have noticed that a lot of operations for modifying indices require reindexing of all documents, such as adding a field to all documents, which from my understanding means retrieving the document, performing the desirable operation, deleting the original document from the index and reindex it. This seems to be somewhat dangerous and a backup of the original index seems to be preferable before performing this (obviously).
This made me wonder if Elasticsearch actually is suitable as a final storage solution at all, or if I should keep the raw documents that makes up an index separately stored to be able to recreate an index from scratch if necessary. Or is a regular backup of the index safe enough?
You are talking about two issues here:
Deleting old documents and re-indexing on schema change: You don't always have to delete old documents when you add new fields. There are various options to change the schema. Have a look at this blog which explains changing the schema without any downtime.
http://www.elasticsearch.org/blog/changing-mapping-with-zero-downtime/
Also, look at the Update API which gives you the ability to add/remove fields.
The update API allows to update a document based on a script provided. The operation gets the document (collocated with the shard) from the index, runs the script (with optional script language and parameters), and index back the result (also allows to delete, or ignore the operation). It uses versioning to make sure no updates have happened during the "get" and "reindex".
Note, this operation still means full reindex of the document, it just removes some network roundtrips and reduces chances of version conflicts between the get and the index. The _source field need to be enabled for this feature to work.
Using Elasticsearch as a final storage solution at all : It depends on how you intend to use Elastic Search as storage. Do you need RDBMS , key Value store, column based datastore or a document store like MongoDb? Elastic Search is definitely well suited when you need a distributed document store (json, html, xml etc) with Lucene based advanced search capabilities. Have a look at the various use cases for ES especially the usage at The Guardian:http://www.elasticsearch.org/case-study/guardian/
I'm pretty sure, that search engines shouldn't be viewed as a storage solution, because of the nature of these applications. I've never heard about this kind of a practice to backup index of search engine.
Usual schema when you using ElasticSearch or Solr or whatever search engine you have:
You have some kind of a datasource (it could be database, legacy mainframe, excel papers, some REST service with data or whatever)
You have search engine that should index this datasource to add to your system capability for search. When datasource is changed - you could reindex it, or index only changed part with the help of incremental indexation.
If something happen to search engine index - you could easily reindex all your data.

ElasticSearch separate index per user

I'm wondering if having thousands of different indexes is a bad idea?
I'm adding a search page to my web app based on ElasticSearch. The search page lets users search for other users on the site by filtering on a number of different indexed criteria (name, location, gender etc). This is fairly straight forward and will require just one index that contains a document every user of the site.
However, I want to also create a page where users can see a list of all of the other users they follow. I want this page to have the same filtering options that are available on the search page. I'm wondering if a good way to go about this would be to create a separate index for each user containing documents for each user they follow?
While you can certainly create thousands of indices in elasticsearch, I don't really see the need for it in your use case. I think you can use one index. Simply create an additional child type followers for the main user record. Every time user A follows user B, create a child record of B with the following content: {"followed_by" : "A"}. To get the list of users that current user is following, you can simply add Has Child Filter to you query.
I would like to add to Igor's answer that creating thousand of indexes on a tiny cluster (one or two nodes) can cause some drawbacks.
Each shard of an index is a full Lucene instance. That said, you will have many opened files (probably too many opened files) if you have a single node (or a small cluster - in term of nodes).
That's one of the major reasons why I would not define too many indices...
See also File descriptors on installation guide

Resources