ELK Stack and scaling - elasticsearch

Bear with me here. I have spent the last week or so familiarising myself with the ELK Stack.
I have a working single box solution running the ELK stack, and I have the basics down on how to forward more than one type of log, and how to put them into different ES indexes.
This is all working pretty well, I would like to expand operations.
My question is more how to scale the solution out to cover more data needs/requirements.
The current solution is handling a smaller subset of data, and working fine, but I would like to aggregate a lot more data. For example I am currently pushing message tracking logs from 4 mailbox servers, I want to do the same but for 40 mailbox servers, and much, much busier ones.
I would also like to push over IIS Log files from the Client Access servers, there are 18 CAS servers, and around 30 mins of IIS logs per server during peak time were 120MB in size, with almost 1 million records.
This volume of data would most likely collapse a single box running ELK.
I haven't really looked into it but I read that ES allows for some form of clustering to add more instances, does the same apply to Logstash as well? Should Kibana be run on more than one server? or a different server to both Logstash and ES?

You will hit limits with logstash if you're doing a lot of processing on the records - groks, conditionals, etc. Watch the cpu utilization of the machine for hints.
For elasticsearch itself, it's about RAM and disk IO. Having more nodes in a cluster should provide both.
With two elasticsearch nodes, you'll get redundancy (a copy on both machines). Add a third, and you can start to realize an IO benefit (writing two copies to three machines spreads the IO).
The ultimate data node will have 64GB of RAM on the machine, with 31GB allocated to elasticsearch.
You'll probably want to add non-data nodes, which handle the routing of data to be indexed and the 'reduce' phase when running queries. Put two of them behind a load balancer.

As Alain mentioned, adding more ES nodes will improve performance (and give you redundancy).
On the logstash front, we have two logstash servers feeding into ES - at the moment we just direct different servers to log to the different logstash servers, but we're likely to be adding a HA-Proxy layer in front to do this automatically, and again provide redundancy.
With Kibana, I wouldn't worry too much - as far as I'm aware most of the processing is done in the client browser, and that that isn't is more dependent on the performance of the ES cluster.

Related

Elasticsearch throughput on a single node

I am building a steaming + analytics application using kafka and elasticsearch. Using kafka streams apps I am continuously pushing data into elasticsearch. Can a single node elasticsearch with 16GB RAM setup handle a write load of 5000 msgs/sec? The message size is 10KB
There are many other conditions to consider, like cluster memory, network latency and read operations. Writing operations in Elasticsearch are slow. Also it seems like the indexes could grow quickly so performance might start to degrade over time and you'll need to scale vertically.
That said, I think this could work with enough RAM and a queue where pending items wait to be indexed when the cluster is slow.
Adding more nodes should help with uptime, which is a normally a concern with user-facing production apps.

Best way to configure load balancer before elastic search cluster

Currently we have a 3 node cluster running 6.2 version in GCP with no dedicated master nodes. All are identical in terms of configuration (4 vCPU, 15 GB RAM nodes - 7GB set as xmx for ES) and settings. We use org.elasticsearch.client.RestClient to access ES cluster. This setup has been running fine for about 2 months now. This morning we experienced some issues on our app server and on checking the logs, I could see that all operations (index and search) on ES were slow. Application that mainly creates data in the cluster accesses one of the nodes (say N1) and application that mainly searches (we have several instances of this application running on a cluster of 2-16 nodes) accesses another node (say N2). Thread dumps showed many threads waiting to get a connection to ES :
at org.elasticsearch.client.RestClientBuilder.createHttpClient(RestClientBuilder.java:202)
This raised doubts on whether one single node is being overburdened with all connection requests coming in. Hence I changed some of the search nodes to connect to third node (say N3) instead of N2. After this change, the situation improved and accessing data from ES became fast. I am not sure if this was the only reason or whether the load on our application server had reduced drastically by the time I figured this out and made the change. I feel this change would have made quite a difference. Hence I feel setting up a LB to distribute the load will be better.
I have read several posts in this forum as well as on elastic forum about the necessity of load balancer in front of ES cluster and I see different responses in different posts :
a) Some say LB is not necessary
b) Some recommend to setup a LB and include only master nodes
c) Some recommend to setup two LBs - one for writing/indexing documents in ES and include only data nodes in this LB and another for queries and include only client nodes in this LB
What is the recommended way of setting up LB?

Optimal way to set up ELK stack on three servers

I am looking to set up an ELK stack and have three servers to do so. While I have found plenty of documentation and tutorials about how to actually install, and configure elasticsearch, logstash, and kibana, I have found less information about how I should set up the software across my servers to maximize performance. For example, would it be better to set up elasticsearch, logstash, and kibana on all three instances, or perhaps install elasticsearch on two instances and logstash and kibana on the third?
Related to that question, if i have multiple elasticsearch servers in my cluster, will I need a load balancer to spread requests to them, or can I send the data to one server, and it will distribute it accordingly?
The size of your machines would also be important. Three machines with 8GB of RAM is much different than three with 64GB or more...
Kibana takes very few resources. Logstash is more CPU-heavy. Elasticsearch is more RAM heavy.
With an elasticsearch cluster, you usually want a replica of each shard for redundancy. That's usually done with two servers. If you have a third elasticsearch server, then you'll get an IO boost (writing two copies of the data to three servers lowers the load). Also, an even number of servers can get confused as to which is the master, so three will help prevent "split brain" problems.
Those two or three nodes would be "data" nodes, so if you throw queries or indexing requests at them, they may need to move the request to a different server (the one with the data, etc). A request also has a "reduce" phase, where the data from each node is combined before being returned. Having a smaller "client" node - where queries and index requests go - helps with that. Of course, you'd want two, to make them redundant.
Logstash is best run multithreaded, so having multiple cpus that you can dedicate is nice. Having a redundant/load-balanced logstash machine is also nice. Kibana could run on these machines as well.
So, we're quickly up to 7 machines. Not what you wanted to hear, right?
If you're firmly limited to 3 machines, you'd want to run elasticsearch on all three as mentioned above. You need to shoehorn in the rest.
Logstash on two, kibana on one? Then you have a single point of failure for kibana.
How about logstash on all three and kibana on all three? The load would be distributed around, so hopefully would be a small increment for each server. And, if the machines are beefy enough, it should be OK.
I have machines in one cluster that run logstash,
The general recommendation is to allocate 1/2 the system RAM (up to ~31GB) to elasticsearch, leaving the rest to the operating system. If you were going to run logstash and kibana on the same machines, you'd want to lower that (to maybe 40%?), give logstash some (15%?) and leave the rest to the OS.
Clearly, the size of your machines is important here.

Why do I need a broker for my production ELK stack + machine specs?

I've recently stood up a test ELK stack Ubuntu box to test the functionality and have been very happy with it. My use case for production would involve ingesting at least 100GB of logs per day. I want to be as scalable as possible, as this 100GB/day can quickly rise as we had more log sources.
I read some articles on ELK production, including the fantasic Logz.io ELK Deployment. While I have a general idea of what I need to do, I am unsure on some core concepts, how many machines I need for such a large amount of data and whether I need a broker like Redis included in my architecture.
What is the point of a broker like Redis? In my test instance, I have multiple log sources sending logs over TCP,syslog, and logstash forwarder to my Logstash directly on my ELK server (which also has Elasticsearch, Nginx, and Kibana installed configured with SSL).
In order to retain a high availability, state of the art production cluster, what machines+specs do I need for at least 100GB of data per day, likely scaling toward 150GB or more in the future? I am planning using my own servers. From what I've researched, the starting point should like something like (assuming I include Redis):
2/3 servers with a Redis+Logstash(indexer) instance for each server. For specs, I am thinking 32GB RAM, fast I/O disk 500GB maybe SSD, 8 cores (i7)
3 servers for Elasticsearch (this is the one I am most unsure about) -- I know I need at least 3 master nodes and 2 data nodes, so 2 servers will have 1 master/1 data each -- these will be beefy 64GB RAM, 20TB, 8 cores. The other remaining master node can be on a low spec machine, as it is not handling data.
2 servers for Nginx/Kibana -- these should be low spec machines, as they are just the web server and UI. Is a load balancer necessary here?
EDIT: Planning on keeping the logs for 60 days.
As for Redis, it acts as a buffer in case logstash and/or elasticsearch are down or slow. If you're using the full logstash or logstash-forwarder as a shipper, it will detect when logstash is unavailable and stop sending logs (remembering where it left off, at least for a while).
So, in a pure logstash/logstash-forwarder environment, I see little reason to use a broker like redis.
When it becomes important is for sources that don't care about logstash's status and don't buffer in their side. syslog, snmptrap, and others fall into this category. Since your sources include syslog, I would bring up brokers in your setup.
Redis is a RAM-intensive app, and that amount of memory that you have will dictate how long of a logstash outage you can withstand. On a 32GB server (shared with logstash), how much of the memory would you give yo redis? How large is your average document size? How many documents would it take to fill the memory? How long does it take to generate that many documents? In my experience, redis fails horribly when the memory fills, but that could just have been me.
Logstash is a CPU-intensive process as all the filters get executed.
As for the size of the elasticsearch cluster, #magnus already pointed you to some information that might help. Starting with 64GB machines is great, and then scale horizontally as needed.
You should have two client (non-data) nodes that are used as the access point for inserts (efficiently dispatching the requests to the correct data node) and searches (handling the 'reduce' phase with data returned from the data nodes). Two of these in a failover config would be a good start.
Two kibana machines will give you redundancy. Putting them in a failover config is also good. nginx was more used with kibana3, I believe. I don't know if people are using it with kibana4 or have moved to 'shield'.
Hope that helps.

How to setup ElasticSearch cluster with auto-scaling on Amazon EC2?

There is a great tutorial elasticsearch on ec2 about configuring ES on Amazon EC2. I studied it and applied all recommendations.
Now I have AMI and can run any number of nodes in the cluster from this AMI. Auto-discovery is configured and the nodes join the cluster as they really should.
The question is How to configure cluster in way that I can automatically launch/terminate nodes depending on cluster load?
For example I want to have only 1 node running when we don't have any load and 12 nodes running on peak load. But wait, if I terminate 11 nodes in cluster what would happen with shards and replicas? How to make sure I don't lose any data in cluster if I terminate 11 nodes out of 12 nodes?
I might want to configure S3 Gateway for this. But all the gateways except for local are deprecated.
There is an article in the manual about shards allocation. May be I'm missing something very basic but I should admit I failed to figure out if it is possible to configure one node to always hold all the shards copies. My goal is to make sure that if this would be the only node running in the cluster we still don't lose any data.
The only solution I can imagine now is to configure index to have 12 shards and 12 replicas. Then when up to 12 nodes are launched every node would have copy of every shard. But I don't like this solution cause I would have to reconfigure cluster if I might want to have more then 12 nodes on peak load.
Auto scaling doesn't make a lot of sense with ElasticSearch.
Shard moving and re-allocation is not a light process, especially if you have a lot of data. It stresses IO and network, and can degrade the performance of ElasticSearch badly. (If you want to limit the effect you should throttle cluster recovery using settings like cluster.routing.allocation.cluster_concurrent_rebalance, indices.recovery.concurrent_streams, indices.recovery.max_size_per_sec . This will limit the impact but will also slow the re-balancing and recovery).
Also, if you care about your data you don't want to have only 1 node ever. You need your data to be replicated, so you will need at least 2 nodes (or more if you feel safer with a higher replication level).
Another thing to remember is that while you can change the number of replicas, you can't change the number of shards. This is configured when you create your index and cannot be changed (if you want more shards you need to create another index and reindex all your data). So your number of shards should take into account the data size and the cluster size, considering the higher number of nodes you want but also your minimal setup (can fewer nodes hold all the shards and serve the estimated traffic?).
So theoretically, if you want to have 2 nodes at low time and 12 nodes on peak, you can set your index to have 6 shards with 1 replica. So on low times you have 2 nodes that hold 6 shards each, and on peak you have 12 nodes that hold 1 shard each.
But again, I strongly suggest rethinking this and testing the impact of shard moving on your cluster performance.
In cases where the elasticity of your application is driven by a variable query load you could setup ES nodes configured to not store any data (node.data = false, http.enabled = true) and then put them in for auto scaling. These nodes could offload all the HTTP and result conflation processing from your main data nodes (freeing them up for more indexing and searching).
Since these nodes wouldn't have shards allocated to them bringing them up and down dynamically shouldn't be a problem and the auto-discovery should allow them to join the cluster.
I think this is a concern in general when it comes to employing auto-scalable architecture to meet temporary demands, but data still needs to be saved. I think there is a solution that leverages EBS
map shards to specific EBS volumes. Lets say we need 15 shards. We will need 15 EBS Volumes
amazon allows you to mount multiple volumes, so when we start we can start with few instances that have multiple volumes attached to them
as load increase, we can spin up additional instance - upto 15.
The above solution is only advised if you know your max capacity requirements.
I can give you an alternative approach using aws elastic search service(it will cost little bit more than normal ec2 elasticsearch).Write a simple script which continuously monitor the load (through api/cli)on the service and if the load goes beyond the threshold, programatically increase the nodes of your aws elasticsearch-service cluster.Here the advantage is aws will take care of the scaling(As per the documentation they are taking a snaphost and launching a completely new cluster).This will work for scale down also.
Regarding Auto-scaling approach there is some challenges like shard movement has an impact on the existing cluster, also we need to more vigilant while scaling down.You can find a good article on scaling down here which I have tested.If you can do some kind of intelligent automation of the steps in the above link through some scripting(python, shell) or through automation tools like Ansible, then the scaling in/out is achievable.But again you need to start the scaling up well before the normal limits since the scale up activities can have an impact on existing cluster.
Question: is possible to configure one node to always hold all the shards copies?
Answer: Yes,its possible by explicit shard routing.More details here
I would be tempted to suggest solving this a different way in AWS. I dont know what ES data this is or how its updated etc... Making a lot of assumptions I would put the ES instance behind a ALB (app load balancer) I would have a scheduled process that creates updated AMI's regularly (if you do it often then it will be quick to do), then based on load of your single server I would trigger more instances to be created from the latest instance you have available. Add the new instances to the ALB to share some of the load. As this quiet down I would trigger the termination of the temp instances. If you go this route here are a couple more things to consider
Use spot instances since they are cheaper and if it fits your use case
The "T" instances dont fit well here since they need time to build up credits
Use lambdas for the task of turning things on and off, if you want to be fancy you can trigger it based on a webhook to the aws gateway
Making more assumptions about your use case, consider putting a Varnish server in front of your ES machine so that you can more cheaply provide scale based on a cache strategy (lots of assumptions here) based on the stress you can dial in the right TTL for cache eviction. Check out the soft-purge feature for our ES stuff we have gotten a lot of good value from this.
if you do any of what i suggest here make sure to make your spawned ES instances report any logs back to a central addressable place on the persistent ES machine so you don't lose logs when the machines die

Resources