how to know elastic-search cluster capacity? - performance

Recently I'm working in a project which requires figuring out elastic-search capacity as we will increase a lot msgs in es system per second.
We have 3 types of nodes in es cluster: master, data, client.
how do we know the maximum insert count per second our client can handle? do we need to care about the bandwidth of the client nodes?

as per the above comments, you need to benchmark your cluster hardware and settings with your proposed data structure using a tool like https://esrally.readthedocs.io/en/stable/

Related

Elasticsearch throughput on a single node

I am building a steaming + analytics application using kafka and elasticsearch. Using kafka streams apps I am continuously pushing data into elasticsearch. Can a single node elasticsearch with 16GB RAM setup handle a write load of 5000 msgs/sec? The message size is 10KB
There are many other conditions to consider, like cluster memory, network latency and read operations. Writing operations in Elasticsearch are slow. Also it seems like the indexes could grow quickly so performance might start to degrade over time and you'll need to scale vertically.
That said, I think this could work with enough RAM and a queue where pending items wait to be indexed when the cluster is slow.
Adding more nodes should help with uptime, which is a normally a concern with user-facing production apps.

Fastest way to index huge data in elastic

I am asked to index more than 3*10^12 documents in to elastic cluster, the cluster has 50 nodes with 40 cores, and 128G of memory.
I was able to do it with _bulk in python language (multi thread) but I could not reach more than 50,000 records per second for one node.
So I want to know:
What is the fastest way to index data?
As I know, I can index data to each data node, does it grow linear? I mean I can have 50,000 records for each node?
Per your question:
Balance your resources. Both Elasticsearch and Your Application will need to try to run at 60-80% of server utilization in order to achieve the best performance. You can achieve this utilization from Application side by using Multiple Processing in python or Unix xargs + Elasticsearch _bulk API.
Elasticsearch performance grows almost linearly with 99%, as my experience. If you have a correct design of your cluster / index-shards settings. 50,000 records/second for each node is possible.
Here are some useful links that would help:
https://www.elastic.co/guide/en/elasticsearch/guide/current/indexing-performance.html
https://qbox.io/support/article/choosing-a-size-for-nodes
https://www.elastic.co/guide/en/elasticsearch/reference/5.6/modules-threadpool.html (for monitoring your cluster during work loads)
It's recommended to do performance testing and then monitor your clusters + application servers closely during workloads. (I used unix htop + newrelic combined :D).

Elasticsearch - general architecture and Elastic Cloud questions

Background
We're designing the architecture of a new system using Elasticsearch now, and we plan to use Elastic Cloud based on reviews contrasting their service with AWS's, and self-hosting on an EC2 instance. As we design the system, I'm trying to learn from a small test project my team deployed on Elastic Cloud 6 months ago. While I've spent a lot of time reading the Elasticsearch Docs, Elasticsearch: The Definitive Guide, and Elastic Cloud's Docs, there are some concepts here that I'm still not understanding.
Our Test Project's issues
Our test project uses the default of 5 primary shards and 1 replica shard per primary. It was configured using the default deployment options on Elastic Cloud with a single one node, currently with 2GB of memory. Because there is only one node, and because replica shards are never assigned to the same node as their primary shard (reason 2), none of the replicas are getting assigned. Also, this project uses time-based data, and is creating one index per account per day, resulting in about 10 indexes per day (or 100 shards), and over time, the proverbial Kagillion Shards. This system was only ever meant to have several months of data on it at a time, so the solution has been to manually delete old data when memory on this deployment runs out.
The New System
Our new system is meant to have 5 years worth of time based-data on it, which is projected to grow to 250 GB in size. The current implementation uses a single index for the time-based data, with 6 primary shards and 1 replica per primary. This decision was made based on reading that a single shard should aim for a maximum of 30GB in size.
Questions
Our old system had one node with too many indexes (over 100) and too many shards (over 1000), and it seems like our new one is being designed with too few (one index for 5+ years of data). It seems a better indexing strategy according to the time-based data recommendations would be to create one index per week or month? That being said, according to another answer on SO the optimal number of indexes per node is 1, so what is the utility in creating multiple indices for time-based data in the first place if we're only running on one node?
How does one add a node to an ES deployment in Elastic Cloud? Currently all of the replica nodes in the test project are unassigned, because the deployment only has one node. There is a slider which allows you to easily choose the memory of each node in a deployment (between 1GB and 250B), however I see no way to add multiple nodes, which is confusing because it seems like basic functionality for Elasticsearch.
Our test project's node has restarted several times, always when there is lots of old data on the node, and therefore memory pressure. The solution has been to delete old data (as the test project was only meant to have several months of data at a time), but it appears the node didn't lose data when it restarted. Why would this be?
Our test project has taken no snapshots, which are supposed to happen automatically on Elastic Cloud every 30 minutes. I've asked their support about this, but just curious to see if anyone knows what could cause this and how to resolve it?
Our test project uses the default of 5 primary shards and 1 replica shard per primary. It was configured using the default deployment options on Elastic Cloud with a single one node
Clearly, on a single node, you cannot have replicas. So your index should have been configured with 0 replicas and you can do it dynamically to get your cluster back to green (PUT index/_settings {"index.number_of_replicas": 0}), simple as that.
Also, this project uses time-based data, and is creating one index per account per day, resulting in about 10 indexes per day (or 100 shards)
I cannot tell if 50 new primary shards (10 index) per day were reasonable or not because you don't give any information regarding the volume of data in your test project. But it's probably too many.
It seems a better indexing strategy according to the time-based data recommendations would be to create one index per week or month?
Having five years worth of data in a single index is perfectly possible, it doesn't really depend on how old the data is, but on how big it grows. You mention 250GB and also that you know a shard shouldn't grow over 30GB (and that again depends on the spec of your hardware underneath, more on that later), but since you have only 6 shards for that index, it means that each shard will grow over 40GB (which is ok according to this), but to be on the safe side, you should probably increase to 8-9 shards, or you split your data into yearly/monthly indices.
The 30GB-ish limit per shard is also dependent on how much heap your nodes have. If you have nodes with 2GB heap, then having 30GB shards is clearly too big. Since you're on ES Cloud and you plan to have 250GB of data, you must have chosen a node capacity of 16GB heap + 384GB storage (or bigger). So with 16GB heap, it's reasonable to have 30GB shards, but you'll need several nodes in my opinion. You can verify how many nodes you have using GET _cat/nodes?v.
That being said, according to another answer on SO the optimal number of indexes per node is 1...
What Chris is saying is a theoretical/ideal setting, which is almost never possible/advisable/desired to do in reality. You do want to have several shards in your index and the reason is that when your data grows, you want to be able to scale to more than one node, that's the whole point of ES, otherwise you'd be better off embedding the Lucene library directly in your project.
..., so what is the utility in creating multiple indices for time-based data in the first place if we're only running on one node?
First check how many nodes you have in your cluster using GET _cat/nodes?v, but clearly if you're assigned a single node for 250GB of data split on 6-8 shards, a single node is not ideal, indeed.
How does one add a node to an ES deployment in Elastic Cloud?
Right now, you can't. However, at the last Elastic{ON} conference, Elastic announced that it will be possible to pick the number of nodes or the kind of deployment (hot/warm, etc) you want to set up.
Currently all of the replica nodes in the test project are unassigned, because the deployment only has one node.
You don't really need replicas in a test project, right?
The solution has been to delete old data (as the test project was only meant to have several months of data at a time), but it appears the node didn't lose data when it restarted. Why would this be?
How did you delete the data? Between the time you deleted the data and before the node restarted, did you witness that the data was indeed gone?
Our test project has taken no snapshots, which are supposed to happen automatically on Elastic Cloud every 30 minutes.
This is weird, since on ES cloud your cluster generally gets snapshotted every 30 minutes. What do you see under Deployments > cluster-id > Elasticsearch > Snapshots? What does the ES Cloud support say about it? What do you get when running GET _cat/repositories?v and GET _cat/snapshots/found-snapshots?v? (update your question with the results)

Elasticsearch architecture

Is there a way to sync multiple ES clusters with each other? The ES docs discourage from having a cluster spanning multiple data centers. So to avoid that I'd be having distinct ES clusters in each datacenter. I also need to have the same data indexed in each cluster.
One way to achieve that would be to send each document to each cluster. But issuing 'n' write requests seems unnecessary. Additionally, if some write requests fail, the clusters could potentially go out of sync.
Is there a way for a cluster to "subscribe" to changes in another cluster? Or send the writes to a master cluster (whichever one is the closest to the data source) and let it eventually replicate to the other ones?
edit: I've read about tribe nodes. The docs say that it works just for reads and has some limitations. Is that something that would let me do this?
You can set up custom routing/allocation strategy on datacenter id [1]. This will ensure that one replica of the shard goes into each data center. Example
cluster.routing.allocation.awareness.force.dc.values: dc1,dc2
cluster.routing.allocation.awareness.attributes: dc
[1] https://www.elastic.co/guide/en/elasticsearch/reference/1.6/modules-cluster.html

MongoDB capacity planning

I have an Oracle Database with around 7 millions of records/day and I want to switch to MongoDB. (~300Gb)
To setup a POC, I'd like to know how many nodes I need? I think 2 replica of 3 node in 2 shard will be enough but I want to know your thinking about it :)
I'd like to have an HA setup :)
Thanks in advance!
For MongoDB to work efficiently, you need to know your working set size..You need to know how much data does 7 million records/day amounts to. This is active data that will need to stay in RAM for high performance.
Also, be very sure WHY you are migrating to Mongo. I'm guessing..in your case, it is scalability..
but know your data well before doing so.
For your POC, keeping two shards means roughly 150GB on each.. If you have that much disk available, no problem.
Give some consideration to your sharding keys, what fields does it make sense for you to shared your data set on? This will impact on the decision of how many shards to deploy, verses the capacity of each shard. You might go with relatively few shards maybe two or three big deep shards if your data can be easily segmented into half or thirds, or several more lighter thinner shards if you can shard on a more diverse key.
It is relatively straightforward to upgrade from a MongoDB replica set configuration to a sharded cluster (each shard is actually a replica set). Rather than predetermining that sharding is the right solution to start with, I would think about what your reasons for sharding are (eg. will your application requirements outgrow the resources of a single machine; how much of your data set will be active working set for queries, etc).
It would be worth starting with replica sets and benchmarking this as part of planning your architecture and POC.
Some notes to get you started:
MongoDB's journaling, which is enabled by default as of 1.9.2, provides crash recovery and durability in the storage engine.
Replica sets are the building block for high availability, automatic failover, and data redundancy. Each replica set needs a minimum of three nodes (for example, three data nodes or two data nodes and an arbiter) to enable failover to a new primary via an election.
Sharding is useful for horizontal scaling once your data or writes exceed the resources of a single server.
Other considerations include planning your documents based on your application usage .. for example, if your documents will be updated frequently and grow in size over time, you may want to consider manual padding to prevent excessive document moves.
If this is your first MongoDB project you should definitely read the FAQs on Replica Sets and Sharding with MongoDB, as well as for Application Developers.
Note that choosing a good shard key for your use case is an important consideration. A poor choice of shard key can lead to "hot spots" for data writes, or unbalanced shards if you plan to delete large amounts of data.

Resources