Elasticsearch: Several independent nodes in the same machine - elasticsearch

Our current software solution uses a local ES installation (1 cluster and 1 node) to store documents so then later the user is able to search them. The ingest of nodes is not continuously done but let's say once a month by using bulks. The document set isn't huge and the size of documents is small. This solution has been working correctly without problems in normal laptop PCs (i5 with 8Gb RAM) since the use case does not require big performance.
Now we're facing 2 new requirements for our software solution:
Should be branded for other customers
The same final user (using the same machine) should be able to work with several instances of our solution (from different customers)
With these 2 new requirement the current solution cannot be used because all documents would be indexed in the same node using the same index. Further searches would show document from different customers.
A first approach to solve this issue was to index documents based on customer, that is, to create indices per customer and index/search documents on the corresponding index. However, we're thinking on another solution that allows us the following:
ES indexed information must be easily removed from the system (i.e. by removing the data folder)
Each customer may want to use a newer version of our solution (i.e. which uses ES 7) whereas other will remain with older versions (i.e. ES 6)
Based on this, I think that the solution would be to have several ES installations on the same PC, each one with its customer dependent configuration:
Different cluster
Different node name and port
Different ES version
My questions then would be, has anyone faced a similar use case? Would it be performance issues by installing several ES an let their services running continuously at the same time? Which possible problems could arise of having this configuration?
Any help would be appreciated.
UPDATE
Based on the answer received and for possible future answers, I would like to clarify a bit more about the architecture of our solution + ES:
Our solution is a desktop application executed on normal laptop PCs
Single user
Even if more than one customer specific solution is installed in the PC, only 1 will be active at a time
Searches will be executed sporadically when the user wants to search for a specific document (as if someone opens Wikipedia to search for an article)
So topics as ...
Infrastructure failure
Data replication
Performance at high search demand
... are not critical

You can run the multiple installations of ES in the same machine in production but it has a lot of disadvantages.
Ideally, you should have at least 1 replica of your shard and it should present in another physical machine(node) so that in case of infrastructure failure, it can recover, this is done to improve the resiliency of your system.
In production, it's common to come across a use case, where having single shard is not enough and you need to break your index into multiple primary shards to make it horizontal scalable but if you just use 1 physical server then having multiple shards will not help you.
Having multiple installations also doesn't help in the case where there is a lot of traffic in one installation and it consumes all the physical resources like RAM, CPU, disk and brings down all the installations also down in production.it also becomes difficult to isolate the root cause and quickly fix the issue as ES installation is not stateless and you can not just start the same installation on another machine, without moving all its data and configuration.
Basically, yours is a truly tenant-based SAAS application and by looking into your requirement, you should design your system considering below:
Upgrading the ES version sometimes is not very straightforward and it involves a lot of breaking changes in your application code as well, having just a cluster running with the latest version will not solve the problem. Hence your application should expose the tenant(your customer) registration API which Also takes which version of ES customer wants to use and accordingly your code handles that.
ES indexed information must be easily removed from the system :- I didn't get what the issue here, you can simply delete it using the ES API which is the recommended way of doing that, instead of doing it manually.
Hope my answer is clear to you and let me know if I missed any of your requirement and you need further clarification.
Based on the update on the question I am adding below points:
As OP mentioned its a very small desktop application and not a server-side application, then it's very important to not mix and store the content of each customer. Anybody can install the ES web admin plugin like https://github.com/lmenezes/cerebro and read the data of other customers.
The best solution in your case to have a single installation of ES based on the version specified by the customer and have just 1 index pertaining to the customer running the desktop application. And you can easily use the delete API as I mentioned earlier.
There is no need to have multiple installations at all, even though they won't be active but still, they consume the local disk space(which is even more important in case of desktop app) and can cause this and this issue and its not at all cleaner design to store the unnecessary information on desktop app and also cause a security issue which is much bigger concerns in general.

Related

Encapsulate an Elastic Search cluster in an of the shelf product

I need to provide a product which is installed on up to 10-20 servers and encapsulates an Elastic Search cluster internally as part of the product.
The product must be self managed and work 100% of time without any human intervention.
I'm concerned whether such an ES cluster can be maintained only programmatically while recovering events of hardware failure, power outage and other unexpected events
Or such cluster must have a human administrator "keeping it up".
I'll appreciate inputs from people who have done such thing, or maintain a cluster and thinks it isn't possible.
You basically can automate via cluster and cat api
The other question is how much you need to do, also what action should be taken if for instance your app is growing and 10 servers in not enough or you are running our of space.
Also what to do with software updates and so on. So basically you can automate a lot but still i think sometimes you need a human..

Camunda: architecture and decisions for a high-performance, dynamic multi tenant app

I'm in charge of building a complex system for my company, and after some research decided that Camunda fits most of my requirements. But some of my requirements are not common, and after reading the user guide I realized there are many ways of doing the same thing, so I hope this question will clarify my thoughts and also will serve as a base questión for everyone else looking for building something similar.
First of all, I'm planning to build a specific App on top of Camunda BPM. It will use workflow and BPM, but not necessarily all the stuff BPM/Camunda provides. This means it is not in my plans to use mostly of the web apps that came bundled with Camunda (tasks, modeler...), at least not for end users. And to make things more complicated it must support multiple tenants... dynamically.
So, I will try to specify all of my requirements and then hopefully someone with more experience than me could explain which is the best architecture/solution to make this work.
Here we go:
Single App built on top of Camunda BPM
High-performance
Workload (10k new process instances/day after few months).
Users (starting with 1k, expected to be ~ 50k).
Multiple tenants (starting with 10, expected to be ~ 1k)
Tenants dynamically managed (creation, deploy of process definitions)
It will be deployed on cluster
PostgreSQL
WildFly 8.1 preferably
After some research, this are my thoughts
One Process Application
One Process Engine per tenant
Multi tenancy data isolation: schema or table level.
Clustering (2 nodes) at first for high availability, and adding more nodes when amount of tenants and workload start to rise.
Doubts
Should I let camunda manage my users/groups, or better manage this on my app? In this case, can I say to Camunda “User X completed Task Y”, even if camunda does not know about the existence of user X?
What about dynamic multi tenancy? Is it possible to create tenants on the fly and make those tenants persist over time even after restarting the application server? What about re-deployment of processes after restarting?
After which point should I think on partitioning of engines on nodes? It’s hard to figure out how I’m going to do this with dynamic multi tenancy, but moreover... Is this the right way to deal with high workload and growing number of tenants?
With my setup of just one process application, should I take care of something else in a cluster environment?
I'm not ruling out using only one tenant, one process engine and handle everything related to tenants logically within my app, but I understand that this can be very (VERY!) cumbersome.
All answers are welcome, hopefully we'll achieve a good approach to this problem.
1. Should I let camunda manage my users/groups, or better manage this on my app? In this case, can I say to Camunda “User X completed Task Y”, even if camunda does not know about the existence of user X?
Yes, you can choose your app to manage the users and tell Camunda that a task is completed by a user whom Camunda doesn't know about. And same way, you can make Camunda to assign task to users which it doesn't know at all. This is done by implementing their org.camunda.bpm.engine.impl.identity.ReadOnlyIdentityProvider interface and let the configuration know about your implementation.
PS: If you doesn't need all the application that comes with Camunda, I would even suggest you to embed the Camunda engine in your app. It can be done easily and they have good documentation for thier java APIs. And it is easily achievable.
2. What about dynamic multi tenancy? Is it possible to create tenants on the fly and make those tenants persist over time even after restarting the application server? What about re-deployment of processes after restarting?
Yes. Its possible to dynamically add Tenants. While restarting the engine or your application, you can either choose to redeploy / or just use the existing deployed processes. Even when you redeploy a process, if you want Camunda to create a new version of the process only if there is a change in the process, that's also possible. See enableDuplicateFiltering property for their DeploymentBuilder.
3. After which point should I think on partitioning of engines on nodes? It’s hard to figure out how I’m going to do this with dynamic multi tenancy, but moreover... Is this the right way to deal with high workload and growing number of tenants?
In my experience, it is possible. You need to keep track of various parameters here, like memory, number of requests being served, number of open connections available etc., then accordingly add more or remove nodes. With AWS, this will be much easier as they have some of these tools already available for dynamic scaling in / out nodes. But that said, I have done this only with Camunda as embedded engine application(s).

Using elasticsearch as central data repository

We are currently using elasticsearch to index and perform searches on about 10M documents. It works fine and we are happy with its performance. My colleague who initiated the use of elasticsearch is convinced that it can be used as the central data repository and other data systems (e.g. SQL Server, Hadoop/Hive) can have data pushed to them. I didn't have any arguments against it because my knowledge of both is too limited. However, I am concerned.
I do know that data in elasticsearch is stored in a manner that is efficient for text searching. Hadoop stores data just as a file system would but in a manner that is efficient to scale/replicate blocks over over multiple data nodes. Therefore, in my mind it seems more beneficial to use Hadoop (as it is more agnostic w.r.t its view on data) as a central data repository. Then push data from Hadoop to SQL, elasticsearch, etc...
I've read a few articles on Hadoop and elasticsearch use cases and it seems conventional to use Hadoop as the central data repository. However, I can't find anything that would suggest that elasticsearch wouldn't be a decent alternative.
Please Help!
As is the case with all database deployments, it really depends on your specific application.
Elasticsearch is a great open source search engine built on top of Apache Lucene. Its features and upgrades allow it to basically function just like a schema-less JSON datastore that can be accessed using both search-specific methods and regular database CRUD-like commands.
Nevertheless all the advantages Elasticsearch that brings, there are still some main disadvantages:
Security - Elasticsearch does not provide any authentication or access control functionality. It's supported since they have introduced shield.
Transactions - There is no support for transactions or processing on data manipulation. Well now data manipulation is handled with logstash.
Durability - ES is distributed and fairly stable but backups and durability are not as high priority as in other data stores.
Maturity of tools - ES is still relatively new and has not had time to develop mature client libraries and 3rd party tools which can make development much harder. We can consider that it's quite mature now
with a variety of connectors and tools around it like kibana. But it's still not suited for large computations - Commands for searching data are not suited to "large" scans of data and advanced computation on the db side.
Data Availability - ES makes data available in "near real-time" which may require additional considerations in your application (ie: comments page where a user adds new comment, refreshing the page might not actually show the new post because the index is still updating).
If you can deal with these issues then there's certainly no reason why you can't use Elasticsearch as your primary data store. It can actually lower complexity and improve performance by not having to duplicate your data but again this depends on your specific use case.
As always, weigh the benefits, do some experimentation and see what works best for you.
DISCLAIMER: This answer was written a while ago for the Elasticsearch 1.x series. These critics still somehow stand with the 2.x series. But Elastic is working on them, as the 2.x series comes with more mature tools, APIs and plugins per example, security wise, like Shield or even transport clients like Logstash or Beats, etc.
I'd highly discourage most users from using elasticsearch as your primary datastore. It will work great until your cluster melts down due to a network partition. Even settings such as minimum_master_nodes that the ES pros always set won't save you. See this excellent analysis by Aphyr with his Call Me Maybe series:
http://aphyr.com/posts/317-call-me-maybe-elasticsearch
eliasah, is right, it depends on your use case, but if your data (and job) is important to you, stay away.
Keep your golden record of your data stored in something really focused on persisting and sync your data out to search from there. It adds extra complexity and resources, but will result in a better nights rest :)
There are plenty of ways to go about this and if elasticsearch does everything you need, you can look into Kafka for persisting all the events going into a cluster which would allow replaying if things go wrong. I like this approach as it provides an async ingestion pipeline into elasticsearch that also does the persistence.

using elasticsearch replications in django

The current setup is compose by 3 elastic search servers, of witch one is the master and the other 2 are slave, at least they define themselves as that.
It might happen that the master goes down, for any kind of problem, this means that elastic search is going to find a new elegible master and switch to this new one.
Currently the problem is that all my application on the frontend servers is totally not aware of this so it will keel making queries to the same backend, of course killing all my website because it will not answer. I had a look around but I was not able to find anything related to backend switch on the fly even related to the new Haystack 2.x.
Any suggestion?
Many thanks in advance
It doesn't seem to be necessary to me to leave this to your application layer. Most probably you access ES through HTTP-REST requests, which means you can put any HTTP load balancer like Nginx in front of your ES servers which should also switch to another node if one times out.

Lucene.NET + Azure + AzureDirectory or something else?

Good morning.
I am currently working on a project which was originally going to be hosted on a physical server with SQL2k8R2, but it looks like we are moving towards the cloud and Azure... Since SQL Azure does not currently support Full Text Indexing, i have been looking at Lucene.NET with the AzureDirectory project for back end storage. The way this will work is that updates will come in and be queued. once processed, they will be placed in a ToIndex queue, which will kick off Lucene.NET indexing. I am just wondering if there would be a better way of doing this? We dont need to use Azure for this project, so if there is a better solution somewhere, please tell us... main requirement for hosting is it is in Europe...(Azure and Amazon Data centers in Dublin is handy, RackSpace in US is not so handy).
Thanks.
I haven't used that project, but it looks promising. From what I understand, the basic issue is that Lucene requires a file-system. I see 2 other possible solutions (basically just doing what the library does):
Use Azure Drive Storage and a worker role
Use Drive storage, but use a VM (if there are config issues with using a worker role)
http://go.microsoft.com/?linkid=9710117
SQLite also has full text search available, but it has the same basic issue - it requires a filesystem:
http://www.sqlite.org/fts3.html
I have another solution for you, but it's a bit more radical, and a bit more of a conceptual one.
You could create your own indexes, using azure table storage. Create partitions based on each word in your documents, as all tables are indexed on the partitionkey, per word search should be fast, and just do memory joins for multiple word searches.
You could host it as an Azure Website as long as your Lucene index is less than 1GB.
I did this recently when I rewrote Ask Jon Skeet to be hosted as a self contained Azure Website. It uses WebBackgrounder to poll the Stackoverflow API for changes, before updating the Lucene index.

Resources