separating data access with elasticsearch - elasticsearch

I'm just getting to know elasticsearch and I'm wondering if it suits my case at all:
Considering a system where companies (with multiple employees) can register and administer their clients, and send documents to their clients.
Now, I want to enable companies to search their documents - but ONLY theirs, not the documents of other companies. In other words: how to separate the data of those companies for searches? How can this be implemented with elasticsearch?
Is this separation to be handled by elasticsearch itself? I.e. there is some mapping between the companies in my system and a related user for elasticsearch.
Or is this to be handled by the backend of my system? I.e. the backend somehow decides (how?) to show only search results for that particular company. So there would be just one user, namely the backend of my system, that accesses and filters the results of elasticsearch. But is this sensible?
I'm sure there is a wealth of information about this out there. Please just give me a hint, because I don't know what to search for. Searches for elasticsearch authentication/authorization, for example, only yield results about who gains access to the search system in general - not about a pattern to solve this separation.
Thanks in advance!

Elasticsearch on its own does not support Authorization and Authentication, you need to add this via plugins, of which there are two that I know of. Shield is the official solution, which is part of the X-Pack and you need to pay Elastic if you want to use it. SearchGuard is an open source alternative with enterprise upgrades that you can buy.
Both of these enable you to define fine grained access rights for different users. What you'd probably want to do is give every company an index of their own for their documents and then restrict their user to only be able to read/write that index. Or if you absolutely want all documents in one index, you can add document level restrictions as well, so that everybody queries the same index but only gets results returned for their company. Depending on how many companies you expect to service this might make more sense in order to not have too many indices and shards, but I'd suspect that an index per company would be the best way to go.
Without these plugins you would need to resort to something on the http-layer, for example an nginx reverse proxy that filters requests based on the index names contained in the urls or something, but I'd severely advise against this, lots of pain lies that way!

Related

How to implement search over shared documents

In our application we allow users to like and share documents. We would like to be able to perform an Elasticsearch query over all documents relevant to a specific user (liked and shared documents) but without storing any authorization fields in Elasticsearch (since updating these would be pretty slow).
Are there any standard ways of architecting a solution for this?

How to model shared folders in ElasticSearch or SOLR?

Popular search engines are quite performant when it comes to full text searches and many other aspects, however, I am not sure how to map the main document storage system security policies to ES and/or SOLR?
Consider Google Drive and it's folders. Users can share any folder - then files and folders below are also shared. Content management systems use something similar.
But how to map that to the external search engines (that is, not built-in to application's content management system), especially, if there are millions of documents in many tens of thousands of folders, tens of thousands of users? Will it help if, for example, depth (nestedness) of the folders is limited to some small number?
I know ES has user roles, but I can't see it can help here, because accesses are given more or less arbitrary. Another approach is to somehow materialize user access in the documents (folders and documents) themselves, but then changes in users' roles, local to some folder, will result in changing many thousands of documents.
Also, searches can be quite arbitrary and lengthy, so it is desired to have pagination, so, for example, fetching "everything" and then sorting out user access on application side is not an option.
I believe the scenario described is quite common, but I can't find any hints how to implement it.
I had used solr as search engine and solr's Data Import Handler (DIH) feature for importing the data from database to Solr.
I would suggest you to go with the approach of indexing the acl's along with the documents.
I had done the same approach and its working fine till now.
I agree that you have re-index the data on the solr side when there is any changes on folder access or change in the access of level of documents. We do need to re-index the document if the metadata of the document is changes or the content of the document is changes. Similarly we can also update the documents on the solr side for any changes in the ACL(Access Control List).
Why to index the ACL along with Document information.
The reason is whenever user search for a document, you can pass the user acl as part of the query in the form of filter query and get the documents which are accessible to user.
I feel this removes the complexity of applying the acl logic at the back end side.
If you dont index the ACL in solr, then you have to filter out the documents after you retrieve from solr by checking the document is and whatever the acl logic applies.
Or the last option could be index the document without acls. Let the user search all the documents. When he tries to perform any action on those documents then you can check the permission and allow the user to perform the action or deny the user saying you dont have enough permission to access the document.
Action could be like View, Download, Update etc..
You need to decide whichever approach suits and works out in your case.

Elasticsearch Best Practices Flow

I am using elastic-search for product filtering for products. We have complex logic of product availability. I can see two options
Using elastic to store only product specific data and product availability logic resides in web server part. we first filter data from elastic then check the condition on those result set if it matches the logic of availability.
or We can flatten the data and store it in elastic though for that case there will be duplicates of data.
My concern is if it is good practice to call elastic endpoint from browser. As it does not have any auth system by default. and every query and response will be visible in network log. I believe call should be made from web server to elastic and front end will communicate with elastic unaware of elastic existence.
Any best practice insight will be helpful
Simply create and authenticated endpoint in your backend and send the queries to that endpoint. Do make sure there are some enforced limits such as
size -- You don't want to let anybody download your whole index and
aggregation depth -- you don't want anyone to perform summaries on your whole index/indices to get a competitive advantage.
Regarding the duplicates: I wouldn't worry too much about the storage aspect (many NoSQL approaches will probably have some duplication to facilitate fast queries) but keep in mind that aggregations might yield "wrong" counts and sums. You'd typically perform those aggregations to get, say, the totals in your product categories and you want to make sure they are representative of your warehouse state.
More cannot really be said right now based on the limited information you've provided.

Are there conventions for naming/organizing Elasticsearch indexes which store log data?

I'm in the process of setting up Elasticsearch and Kibana as a centralized logging platform in our office.
We have a number of custom utilities and plug-ins which I would like to track the usage of and if users are encountering any errors. Not to mention there are servers, and scheduled jobs I would like to keep track of as well.
So if I have a number of different sources for log data all going to the same elasticsearch cluster what are the conventions or best practices for how this is organized into indexes and document types?
The default index value used by Logstash is "logstash-%{+YYYY.MM.dd}". So it seems like it's best to suffix any index names with the current date, as this makes it easy to purge old data.
However, Kibana allows for adding multiple "index patterns" that can be selected from in the UI. Yet all the tutorials I've read only mention creating a single pattern like logstash-*.
How are multiple index patterns used in practice? Would I just give names for all the sources for my data? Such as:
BackupUtility-%{+YYYY.MM.dd}
UserTracker-%{+YYYY.MM.dd}
ApacheServer-%{+YYYY.MM.dd}
I'm using nLog in a number of my tools which has an elastic search target. The convention for nLog and other similar logging frameworks is to have a "logger" for each class in the source code. Should these logger translate to indexes in elastic search?
MyCompany.CustomTool.FooClass-%{+YYYY.MM.dd}
MyCompany.CustomTool.BarClass-%{+YYYY.MM.dd}
MyCompany.OtherTool.BazClass-%{+YYYY.MM.dd}
Or is this too granular for elasticsearch index names, and it would be better to stick to just to a single dated index for the application?
CustomTool-%{+YYYY.MM.dd}
In my environment we're working through a similar question. We have a mix of system logs, metric alerts from Prometheus, and application logs from both client and server applications. In addition, we have some shared variables between the client and server apps that let us correlate the two (e.g., we know what server logs match some operation on the client that made requests to said server). We're experimenting with the following scheme to help Kibana answer questions for us:
logs-system-{date}
logs-iis-{date}
logs-prometheus-{date}
logs-app-{applicationName}-{date}
Where:
{applicationName} is the unique name of some application we wrote (these could be client or server side)
{date} is whatever date-based scheme you use for indexes
This way we can set up Kibana searches against logs-app-* and quickly search for logs among any of our applications. This is still new for us, but we started without this type of scheme and are already regretting it. It makes searching for correlated logs across applications much harder than it should be.
In my company we have worked lot about this topic. We agree the following convention:
Customer
-- Product
--- Application
---- Date
In any case, it is neccesary to review both how the data is organized and how the data is consulted inside the organization
Kind Regards
Dario Rodriguez
I am not aware of such conventions, but for my environment, we used to create two different type of indexes logstash-* and logstash-shortlived-*depending on the severity level. In my case, I create index pattern logstash-* as it will satisfy both kind of indices.
As these indices will be stored at Elasticsearch and Kibana will read them, I guess it should give you the options of creating the indices of different patterns.
Give it a try on your local machine. Why don't you try logstash-XYZ if you want more granularity otherwise you can always create indices with your custom name.

Elasticsearch for multiple sites (sources)

We have a large number of websites, and since this is our first time to use Elastic Search I don't know how should we configure ES to:
We want to use ES as the only search engine for these sites, should we setup separate ES instance for each site? (I imagine this may take much more resources than just one ES?)(on the other hand, will putting all documents from all site to only 1 ES instance generate too much overhead for each search?)
When we do a search, we will search for documents within 1 specific site only, however (and would be nice to somehow prevent other sites to search documents not belonging to them)
Would be nice to have, but not a must, is the ability to search for documents on ALL sites if possible.
You should setup one cluster for all websites. The biggest advantage of elasticsearch is that it scales pretty well so handling several requests from different clients shouldnt be a problem for your cluster if you scale it big enough. Each site should have an own index(an index is like a container that holds your documents). Elasticsearch allows to search over one or several indices. That means that you dont have the problem of searching all documents if you dont want too. Each site can search in its own index or if you want so over all indices.
I am new to ES, but i think you can get what you want with having one ES instance and setting up routing. http://www.elasticsearch.org/blog/customizing-your-document-routing/

Resources