Multi tenancy in Elastic Search - elasticsearch

We are planning to introduce Elastic search(AWS) for our Multi tenancy application. We have below options,
Using One Index Per Tenant
Using One Type Per Tenant
All Tenants Share One Index with Custom routing
As per this blog https://www.elastic.co/blog/found-multi-tenancy the first option would give memory issue. But not clear about other options.
It seems if we are using the third option then there is no data segregation. Not sure about security.
I believe second option would be better option as data would be segregated.
Help me to identify best option to proceed elastic search with Multi tenancy.
Please note that we would leverage AWS infrastructure.

We are considering the same question right now, and the following set of articles by Elasticsearch was very helpful.
Start here: https://www.elastic.co/guide/en/elasticsearch/guide/current/scale.html
And read through each subsequent article until you hit this one: https://www.elastic.co/guide/en/elasticsearch/guide/current/finite-scale.html
The following two were very eye-opening for me:
https://www.elastic.co/guide/en/elasticsearch/guide/current/faking-it.html
https://www.elastic.co/guide/en/elasticsearch/guide/current/one-big-user.html
The basic takeaway:
Alias per customer
Shard routing
Now you can have indexes for big customers, shared indexes for little customers, and they all appear to be separate indices

This is a too important link not to be mentioned here:
http://www.bigeng.io/elasticsearch-scaling-multitenant/
Good architecture dilemmas, and great performance analysis / reasoning.
tldr; they had index groups that are built around shard allocation filtering to segregate load across nodes in the cluster

To sum up accepted answer and other articles,
Use a shared index using custom routing using an alias
1.1) Special case: Big client can have dedicated index, only if needed.
Following article covers many use cases for detailed explanation.
https://www.elastic.co/blog/found-multi-tenancy
Following is the conclusion on how you can do it (link source: accepted answer)
https://www.elastic.co/guide/en/elasticsearch/guide/current/faking-it.html

Related

Strategy for Laravel scout with million+ records and multiple interchangeable drivers (i.e. TNT & Algolia)

I have used Algolia a bit and it is an awesome service. I have also used the TNT-search driver for scout and it is also pretty good, but not really a touch on the features, speed and ease that you get with Algolia.
Unfortunately, Algolia gets very expensive when dealing with a lot of records - for example one of our apps has over 10 million searchable rows, which would be thousands of $$ monthly!
Has anyone had any success in using both together? i.e. I would like use Algolia for recent records or categories where I need more advanced search capabilities (100k of records) and then use TNT search for the remainder.
EDIT:
Elasticsearch was the answer. It is a little harder to setup but has such great flexibility. I would highly recommend https://github.com/matchish/laravel-scout-elasticsearch to connect it up with Laravel Scout seamlessly.
Well ,AWS Elasticsearch very good and cheap .. you might use it and it's very easy to use and configure with laravel AWS Elasticsearch pricing
You may start with r3.large.elasticsearch it will costs you about 180$ per month and if you want to more r3.xlarge.elasticsearch will be amazing too and it will serve you need.
to configure AWS Elasticsearch with laravel you may read this artical How-to integrate-your-Laravel-app-with-Elasticsearch
Elasticsearch is already suggested in another answer as a cost effective alternative. But if you are looking for something similar to Algolia but open source, check out Typesense. It's must easier to set-up and manage and also features like typo correction etc. just works out of the box.
This answer might be late but if there is anyone who still has this problem, Laravel scout has a searchableUsing() method that determines the search engine to use, you can override that to configure different search drivers for different models. This blogpost is a detailed step by step on how to do this

Both ElasticSearch and Redis, overkill usecase?

I'm currently designing the architecture of my project or atleast try to figure it out what will be useful in my case.
** Simple use case
I will have several thousands of profiles in a backend and I to need implement a fast search engine. So elasticsearch look perfect in that case. Everytime a profile is updated, the index will be updated by an asynchronous task.
My question now is : If I want to implement a cache system for the detail of a profile. Should I stick with elasticsearch and put these data in my index ? Or use Redis and do something like profil_id => data ?
I think both sounds good the problem is whenever a profile is updated, I will have to flush it after the reindexing in elasticsearch. If I want to see the change in my backend.
So what can I do ? Thank you so much !
You should consider using RediSearch. Using RediSearch can provide you a solution for your needs, getting both Redis performance and a full-text support.
Elasticsearch and redis are basically meant to solve two different problems, As one does indexing while other does caching.
Redis is meant to return already requested data as fast as possible whereas as
Elasticsearch is a search and analytics engine, it would perfectly fit a use-case where you have to implement a fast search engine and it will be more performant than any in-memory data structure store or cache such as redis(Assuming your searches will be complex, will involve some aggregation/filters).
The problem comes profile updates Since your profile updates are not that frequent you could actually do partial updates to the ES index rather doing reindex.So whenever a person updates its profile get the changeling set(changed data) and do a partial update to the particular document in ES Index. You can see how its done here partial update.
This one particular stackoverflow answer will help you cache vs indexing

separating data access with elasticsearch

I'm just getting to know elasticsearch and I'm wondering if it suits my case at all:
Considering a system where companies (with multiple employees) can register and administer their clients, and send documents to their clients.
Now, I want to enable companies to search their documents - but ONLY theirs, not the documents of other companies. In other words: how to separate the data of those companies for searches? How can this be implemented with elasticsearch?
Is this separation to be handled by elasticsearch itself? I.e. there is some mapping between the companies in my system and a related user for elasticsearch.
Or is this to be handled by the backend of my system? I.e. the backend somehow decides (how?) to show only search results for that particular company. So there would be just one user, namely the backend of my system, that accesses and filters the results of elasticsearch. But is this sensible?
I'm sure there is a wealth of information about this out there. Please just give me a hint, because I don't know what to search for. Searches for elasticsearch authentication/authorization, for example, only yield results about who gains access to the search system in general - not about a pattern to solve this separation.
Thanks in advance!
Elasticsearch on its own does not support Authorization and Authentication, you need to add this via plugins, of which there are two that I know of. Shield is the official solution, which is part of the X-Pack and you need to pay Elastic if you want to use it. SearchGuard is an open source alternative with enterprise upgrades that you can buy.
Both of these enable you to define fine grained access rights for different users. What you'd probably want to do is give every company an index of their own for their documents and then restrict their user to only be able to read/write that index. Or if you absolutely want all documents in one index, you can add document level restrictions as well, so that everybody queries the same index but only gets results returned for their company. Depending on how many companies you expect to service this might make more sense in order to not have too many indices and shards, but I'd suspect that an index per company would be the best way to go.
Without these plugins you would need to resort to something on the http-layer, for example an nginx reverse proxy that filters requests based on the index names contained in the urls or something, but I'd severely advise against this, lots of pain lies that way!

Are there conventions for naming/organizing Elasticsearch indexes which store log data?

I'm in the process of setting up Elasticsearch and Kibana as a centralized logging platform in our office.
We have a number of custom utilities and plug-ins which I would like to track the usage of and if users are encountering any errors. Not to mention there are servers, and scheduled jobs I would like to keep track of as well.
So if I have a number of different sources for log data all going to the same elasticsearch cluster what are the conventions or best practices for how this is organized into indexes and document types?
The default index value used by Logstash is "logstash-%{+YYYY.MM.dd}". So it seems like it's best to suffix any index names with the current date, as this makes it easy to purge old data.
However, Kibana allows for adding multiple "index patterns" that can be selected from in the UI. Yet all the tutorials I've read only mention creating a single pattern like logstash-*.
How are multiple index patterns used in practice? Would I just give names for all the sources for my data? Such as:
BackupUtility-%{+YYYY.MM.dd}
UserTracker-%{+YYYY.MM.dd}
ApacheServer-%{+YYYY.MM.dd}
I'm using nLog in a number of my tools which has an elastic search target. The convention for nLog and other similar logging frameworks is to have a "logger" for each class in the source code. Should these logger translate to indexes in elastic search?
MyCompany.CustomTool.FooClass-%{+YYYY.MM.dd}
MyCompany.CustomTool.BarClass-%{+YYYY.MM.dd}
MyCompany.OtherTool.BazClass-%{+YYYY.MM.dd}
Or is this too granular for elasticsearch index names, and it would be better to stick to just to a single dated index for the application?
CustomTool-%{+YYYY.MM.dd}
In my environment we're working through a similar question. We have a mix of system logs, metric alerts from Prometheus, and application logs from both client and server applications. In addition, we have some shared variables between the client and server apps that let us correlate the two (e.g., we know what server logs match some operation on the client that made requests to said server). We're experimenting with the following scheme to help Kibana answer questions for us:
logs-system-{date}
logs-iis-{date}
logs-prometheus-{date}
logs-app-{applicationName}-{date}
Where:
{applicationName} is the unique name of some application we wrote (these could be client or server side)
{date} is whatever date-based scheme you use for indexes
This way we can set up Kibana searches against logs-app-* and quickly search for logs among any of our applications. This is still new for us, but we started without this type of scheme and are already regretting it. It makes searching for correlated logs across applications much harder than it should be.
In my company we have worked lot about this topic. We agree the following convention:
Customer
-- Product
--- Application
---- Date
In any case, it is neccesary to review both how the data is organized and how the data is consulted inside the organization
Kind Regards
Dario Rodriguez
I am not aware of such conventions, but for my environment, we used to create two different type of indexes logstash-* and logstash-shortlived-*depending on the severity level. In my case, I create index pattern logstash-* as it will satisfy both kind of indices.
As these indices will be stored at Elasticsearch and Kibana will read them, I guess it should give you the options of creating the indices of different patterns.
Give it a try on your local machine. Why don't you try logstash-XYZ if you want more granularity otherwise you can always create indices with your custom name.

Internal data storage mechanism of elasticsearch

I have been working with elasticsearch for the past 2 months. I have used both REST approach and API support in different languages to index, get and search data. I also read a lot about elasticsearch and found out it is not a good option to use it as a data store. Why is this? And I'm also curious about how elasticsearch internally stores the indexed data. Any good link or explanation??
Elastic Search is built on top of Apache Lucene - here's a reference doc on the Lucene index file structure:
http://lucene.apache.org/core/4_7_2/core/org/apache/lucene/codecs/lucene46/package-summary.html#package_description
Regarding whether or not it's a good option as a data store I think that's more individual opinion and specific use cases than a fact that can be proved. It does not have the transaction support that something like MySQL does if that's what you are looking for. In that case it's somewhat on a par with other NoSQL solutions. This is a pretty decent writeup on the trade-offs and issues: https://www.found.no/foundation/elasticsearch-as-nosql/
In the end it depends on what you are doing with your data and what level of robustness you require.

Resources