How to implement search over shared documents

How to implement search over shared documents - elasticsearch

In our application we allow users to like and share documents. We would like to be able to perform an Elasticsearch query over all documents relevant to a specific user (liked and shared documents) but without storing any authorization fields in Elasticsearch (since updating these would be pretty slow).
Are there any standard ways of architecting a solution for this?

Related

How to model shared folders in ElasticSearch or SOLR?

Popular search engines are quite performant when it comes to full text searches and many other aspects, however, I am not sure how to map the main document storage system security policies to ES and/or SOLR?
Consider Google Drive and it's folders. Users can share any folder - then files and folders below are also shared. Content management systems use something similar.
But how to map that to the external search engines (that is, not built-in to application's content management system), especially, if there are millions of documents in many tens of thousands of folders, tens of thousands of users? Will it help if, for example, depth (nestedness) of the folders is limited to some small number?
I know ES has user roles, but I can't see it can help here, because accesses are given more or less arbitrary. Another approach is to somehow materialize user access in the documents (folders and documents) themselves, but then changes in users' roles, local to some folder, will result in changing many thousands of documents.
Also, searches can be quite arbitrary and lengthy, so it is desired to have pagination, so, for example, fetching "everything" and then sorting out user access on application side is not an option.
I believe the scenario described is quite common, but I can't find any hints how to implement it.

I had used solr as search engine and solr's Data Import Handler (DIH) feature for importing the data from database to Solr.
I would suggest you to go with the approach of indexing the acl's along with the documents.
I had done the same approach and its working fine till now.
I agree that you have re-index the data on the solr side when there is any changes on folder access or change in the access of level of documents. We do need to re-index the document if the metadata of the document is changes or the content of the document is changes. Similarly we can also update the documents on the solr side for any changes in the ACL(Access Control List).
Why to index the ACL along with Document information.
The reason is whenever user search for a document, you can pass the user acl as part of the query in the form of filter query and get the documents which are accessible to user.
I feel this removes the complexity of applying the acl logic at the back end side.
If you dont index the ACL in solr, then you have to filter out the documents after you retrieve from solr by checking the document is and whatever the acl logic applies.
Or the last option could be index the document without acls. Let the user search all the documents. When he tries to perform any action on those documents then you can check the permission and allow the user to perform the action or deny the user saying you dont have enough permission to access the document.
Action could be like View, Download, Update etc..
You need to decide whichever approach suits and works out in your case.

Using elastic search for a UI dashboard behind a proxy

I am working on a search dashboard with full text search capabilities, backed by ES. The search would initially be consumed by a UI dashboard. I am planning to have an application web service (WS) api layer between the UI dashboard and ES which will route the business search to ES.
There can be multiple clients to WS going forward, each with its own business use cases, and complex data requirements (basically response fields). There are many entities and huge number of fields across them. Each client would need to specify what fields entities it wants to return with what fields.
To support this dynamically changing requirement, one approach could be to have the WS be a pass through to the ES (with pre validations like access control and post transformations to the response from ES). The WS APIs will look exactly like the ES APIs, the UI should build ES queries through JS client and send it to WS, which after access control will get data from ES.
I am new to ES and skeptic of this approach. Can there be any particular challenges in this approach. One of my colleague has worked on ES before but always with a backend Java client, so he's not too sure.
I looked up a ES Js client and there's an official one here.
Some Context here:
We have around 4 different entities (can increase in future) with both full text and keyword type fields. A typical search could have multiple filters and search terms and would want to specify the result fields. Also, some searches would be across entities and some to individual ones. We are maintaining a separate entity for each entity.

What I understand from your post is, below is what you want to achieve at high level.
There can be multiple clients to WS going forward, each with its own
business use cases, and complex data requirements (basically response
fields)
And as you are not sure, how to do this, you are thinking to build Elasticsearch queries from Javascript in your front-end only. I am not a very big fan of this approach as it exposes, how you are building queries and if some hacker knows crucial like below information, then can bring your entire ES cluster to its knees:
Knows what types of wildcard queries.
Knows index names and ES cluster details(although you may have access control but still you are exposing the crucial info).
How you are building your search queries.
Above are just a few examples and will add more info.
Right approach
As you already have a backend, where you would be checking the access, there only build the Elasticsearch queries and you even have the advantage of your teammates who knows it.
For building complex response field, you can use the source filtering, using which you can specify in your search request, what all fields you want to return in your search result.

Elastic search per user access control to document

I'm using ElasticSearch 7.1.1 as a full-text search engine. At the beginning all the documents are accessible to every user. I want to give users the possibility to edit documents. The modified version of the document will be accessible only to the editor and everyone else will only be able to see the default document.
To do this I will add two array to every document:
An array of users excluded from seeing the doc
An array with the only user that can see the this doc
Every time someone edit a document I will:
Add to the excluded users list the user that made the edit
Create document containing the edit available only to that user.
This way in the index I'll have three types of documents:
Documents accessible to everyone
Documents accessible to everyone except some users
Documents accessible only to a specific users
I use ElasticSearch not only to fetch documents but also to calculate live aggregations (e.g. sums of some field) so query-time I will be able to fetch user specific documents.
I don't expect a lot of edits, less than 1% of the total documents.
Is there a smarter, and less query intensive, way to obtain the same results?

You could implement a document level security.
With that you can define roles that restrict the read-access to certain documents that match a query (e.g. you could use the id of the document).
So instead of updating the documents each time via your proposed array-solution, you would instead update the role respectively granting the roles to the particular users. This would of course require that every user has an elasticsearch user.
This feature is the only workaround to fulfill your requirements that Elasticsearch brings on the table "out of the box" as far as I know.
I hope I could help you.

separating data access with elasticsearch

I'm just getting to know elasticsearch and I'm wondering if it suits my case at all:
Considering a system where companies (with multiple employees) can register and administer their clients, and send documents to their clients.
Now, I want to enable companies to search their documents - but ONLY theirs, not the documents of other companies. In other words: how to separate the data of those companies for searches? How can this be implemented with elasticsearch?
Is this separation to be handled by elasticsearch itself? I.e. there is some mapping between the companies in my system and a related user for elasticsearch.
Or is this to be handled by the backend of my system? I.e. the backend somehow decides (how?) to show only search results for that particular company. So there would be just one user, namely the backend of my system, that accesses and filters the results of elasticsearch. But is this sensible?
I'm sure there is a wealth of information about this out there. Please just give me a hint, because I don't know what to search for. Searches for elasticsearch authentication/authorization, for example, only yield results about who gains access to the search system in general - not about a pattern to solve this separation.
Thanks in advance!

Elasticsearch on its own does not support Authorization and Authentication, you need to add this via plugins, of which there are two that I know of. Shield is the official solution, which is part of the X-Pack and you need to pay Elastic if you want to use it. SearchGuard is an open source alternative with enterprise upgrades that you can buy.
Both of these enable you to define fine grained access rights for different users. What you'd probably want to do is give every company an index of their own for their documents and then restrict their user to only be able to read/write that index. Or if you absolutely want all documents in one index, you can add document level restrictions as well, so that everybody queries the same index but only gets results returned for their company. Depending on how many companies you expect to service this might make more sense in order to not have too many indices and shards, but I'd suspect that an index per company would be the best way to go.
Without these plugins you would need to resort to something on the http-layer, for example an nginx reverse proxy that filters requests based on the index names contained in the urls or something, but I'd severely advise against this, lots of pain lies that way!

Is it safe to expose the Elasticsearch Search API directly through your application's API?

I am developing an AngularJS app with a Java/Spring Boot API. It uses Spring Data Elasticsearch to provide access to Elasticsearch's Search API for searching. Here is an example:
Page<Address> page = addressSearchRepository.search(simpleQueryStringQuery(query), pageable);
The variable query is a user's search string. pageable is an object that specifies page number, page size, and sorting. I can use QueryBuilders to build other Elasticsearch queries and expose them as different API endpoints.
Another option is to use QueryBuilders.wrapperQuery and send Elasticsearch queries directly from JavaScript. Here is an example where jsonQuery is a string containing a full Elasticsearch query:
Page<Address> page = addressSearchRepository.search(wrapperQuery(jsonQuery), pageable);
This would be a secure endpoint that only authenticated users can access. This seems to be equivalent to exposing an Elasticsearch index's Search API directly. Assuming that any data in the index is safe to show the user, would this be a security risk?
In my research so far I've found that it may be possible to crash Elasticsearch using a query, but it isn't that big of a problem in newer versions: https://www.elastic.co/blog/found-crash-elasticsearch#arbitrary-large-size-parameter
Maybe limiting the page size or using the scan and scroll API when the page size is very large would mitigate this.
I know that script fields should be avoided at all costs, but they are disabled by default (as of v1.4.3).

You can still crash Elasticsearch if you know how to do it. For example, if you start building a 10 deep nested aggregations, you might very well go and take a break. It will either take a lot of time, or be very expensive, use a lot of memory, make the JVM do a lot of garbage collection (which basically freezes all other threads running in the JVM), reclaim back small amounts of memory. It can make the cluster unresponsive in this way.
I'm not saying that whatever aggregations you take and create a 10 deep nested aggregations you'll cripple the cluster, but under normal circumstances a cluster built for a certain SLA and deal with a certain amount of data, given some heavy aggregations (for example terms on analyzed string fields), will be very highly computational for the nodes.
Maybe the nodes will not run out of memory, but the nodes will barely be responsive.
Elastic's team is trying to implement other circuit breakers and to add default limits to certain types of queries and aggregations (a huge task). But if your aim is for your users not to crash ES, while they have full access to all queries, I think there are ways to crash it. I, personally, wouldn't expose ES and let my users do whatever they want with whatever queries they create.
Depending on how your wrapper is configured, I'd only allow my users certain types of queries/aggregations and for those I'd impose some limits (applicable for those queries/aggs that accept limits).

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio