Elasticsearch throughput on a single node - elasticsearch

I am building a steaming + analytics application using kafka and elasticsearch. Using kafka streams apps I am continuously pushing data into elasticsearch. Can a single node elasticsearch with 16GB RAM setup handle a write load of 5000 msgs/sec? The message size is 10KB

There are many other conditions to consider, like cluster memory, network latency and read operations. Writing operations in Elasticsearch are slow. Also it seems like the indexes could grow quickly so performance might start to degrade over time and you'll need to scale vertically.
That said, I think this could work with enough RAM and a queue where pending items wait to be indexed when the cluster is slow.
Adding more nodes should help with uptime, which is a normally a concern with user-facing production apps.

Related

What are the general guidelines for Elasticsearch cluster configuration for instance size, data nodes and sharding?

We regularly encounter several issues that crop up from time to time with Elasticsearch. They seem to be as follows:
Out of disk space
Slow query evaluation time
Slow/throttled data write times
Timeouts on queries
There are various areas of an Elasticsearch cluster that can be configured:
Cluster disk space
Instance type/size
Num data nodes
Sharding
It can sometimes be confusing which areas of the cluster you should be tuning depending on the problems outlined above.
Increasing the ES cluster total disk space is easy enough. Boosting the ES instance type seems to help when we experience slow data write times and slow query response times. Implementing sharding seems to be best when one particular ES index is extremely large. But it's never quite clear when we should boost the number of data nodes vs boosting the instance size.

how to know elastic-search cluster capacity?

Recently I'm working in a project which requires figuring out elastic-search capacity as we will increase a lot msgs in es system per second.
We have 3 types of nodes in es cluster: master, data, client.
how do we know the maximum insert count per second our client can handle? do we need to care about the bandwidth of the client nodes?
as per the above comments, you need to benchmark your cluster hardware and settings with your proposed data structure using a tool like https://esrally.readthedocs.io/en/stable/

Fastest way to index huge data in elastic

I am asked to index more than 3*10^12 documents in to elastic cluster, the cluster has 50 nodes with 40 cores, and 128G of memory.
I was able to do it with _bulk in python language (multi thread) but I could not reach more than 50,000 records per second for one node.
So I want to know:
What is the fastest way to index data?
As I know, I can index data to each data node, does it grow linear? I mean I can have 50,000 records for each node?
Per your question:
Balance your resources. Both Elasticsearch and Your Application will need to try to run at 60-80% of server utilization in order to achieve the best performance. You can achieve this utilization from Application side by using Multiple Processing in python or Unix xargs + Elasticsearch _bulk API.
Elasticsearch performance grows almost linearly with 99%, as my experience. If you have a correct design of your cluster / index-shards settings. 50,000 records/second for each node is possible.
Here are some useful links that would help:
https://www.elastic.co/guide/en/elasticsearch/guide/current/indexing-performance.html
https://qbox.io/support/article/choosing-a-size-for-nodes
https://www.elastic.co/guide/en/elasticsearch/reference/5.6/modules-threadpool.html (for monitoring your cluster during work loads)
It's recommended to do performance testing and then monitor your clusters + application servers closely during workloads. (I used unix htop + newrelic combined :D).

Why do I need a broker for my production ELK stack + machine specs?

I've recently stood up a test ELK stack Ubuntu box to test the functionality and have been very happy with it. My use case for production would involve ingesting at least 100GB of logs per day. I want to be as scalable as possible, as this 100GB/day can quickly rise as we had more log sources.
I read some articles on ELK production, including the fantasic Logz.io ELK Deployment. While I have a general idea of what I need to do, I am unsure on some core concepts, how many machines I need for such a large amount of data and whether I need a broker like Redis included in my architecture.
What is the point of a broker like Redis? In my test instance, I have multiple log sources sending logs over TCP,syslog, and logstash forwarder to my Logstash directly on my ELK server (which also has Elasticsearch, Nginx, and Kibana installed configured with SSL).
In order to retain a high availability, state of the art production cluster, what machines+specs do I need for at least 100GB of data per day, likely scaling toward 150GB or more in the future? I am planning using my own servers. From what I've researched, the starting point should like something like (assuming I include Redis):
2/3 servers with a Redis+Logstash(indexer) instance for each server. For specs, I am thinking 32GB RAM, fast I/O disk 500GB maybe SSD, 8 cores (i7)
3 servers for Elasticsearch (this is the one I am most unsure about) -- I know I need at least 3 master nodes and 2 data nodes, so 2 servers will have 1 master/1 data each -- these will be beefy 64GB RAM, 20TB, 8 cores. The other remaining master node can be on a low spec machine, as it is not handling data.
2 servers for Nginx/Kibana -- these should be low spec machines, as they are just the web server and UI. Is a load balancer necessary here?
EDIT: Planning on keeping the logs for 60 days.
As for Redis, it acts as a buffer in case logstash and/or elasticsearch are down or slow. If you're using the full logstash or logstash-forwarder as a shipper, it will detect when logstash is unavailable and stop sending logs (remembering where it left off, at least for a while).
So, in a pure logstash/logstash-forwarder environment, I see little reason to use a broker like redis.
When it becomes important is for sources that don't care about logstash's status and don't buffer in their side. syslog, snmptrap, and others fall into this category. Since your sources include syslog, I would bring up brokers in your setup.
Redis is a RAM-intensive app, and that amount of memory that you have will dictate how long of a logstash outage you can withstand. On a 32GB server (shared with logstash), how much of the memory would you give yo redis? How large is your average document size? How many documents would it take to fill the memory? How long does it take to generate that many documents? In my experience, redis fails horribly when the memory fills, but that could just have been me.
Logstash is a CPU-intensive process as all the filters get executed.
As for the size of the elasticsearch cluster, #magnus already pointed you to some information that might help. Starting with 64GB machines is great, and then scale horizontally as needed.
You should have two client (non-data) nodes that are used as the access point for inserts (efficiently dispatching the requests to the correct data node) and searches (handling the 'reduce' phase with data returned from the data nodes). Two of these in a failover config would be a good start.
Two kibana machines will give you redundancy. Putting them in a failover config is also good. nginx was more used with kibana3, I believe. I don't know if people are using it with kibana4 or have moved to 'shield'.
Hope that helps.

ELK Stack and scaling

Bear with me here. I have spent the last week or so familiarising myself with the ELK Stack.
I have a working single box solution running the ELK stack, and I have the basics down on how to forward more than one type of log, and how to put them into different ES indexes.
This is all working pretty well, I would like to expand operations.
My question is more how to scale the solution out to cover more data needs/requirements.
The current solution is handling a smaller subset of data, and working fine, but I would like to aggregate a lot more data. For example I am currently pushing message tracking logs from 4 mailbox servers, I want to do the same but for 40 mailbox servers, and much, much busier ones.
I would also like to push over IIS Log files from the Client Access servers, there are 18 CAS servers, and around 30 mins of IIS logs per server during peak time were 120MB in size, with almost 1 million records.
This volume of data would most likely collapse a single box running ELK.
I haven't really looked into it but I read that ES allows for some form of clustering to add more instances, does the same apply to Logstash as well? Should Kibana be run on more than one server? or a different server to both Logstash and ES?
You will hit limits with logstash if you're doing a lot of processing on the records - groks, conditionals, etc. Watch the cpu utilization of the machine for hints.
For elasticsearch itself, it's about RAM and disk IO. Having more nodes in a cluster should provide both.
With two elasticsearch nodes, you'll get redundancy (a copy on both machines). Add a third, and you can start to realize an IO benefit (writing two copies to three machines spreads the IO).
The ultimate data node will have 64GB of RAM on the machine, with 31GB allocated to elasticsearch.
You'll probably want to add non-data nodes, which handle the routing of data to be indexed and the 'reduce' phase when running queries. Put two of them behind a load balancer.
As Alain mentioned, adding more ES nodes will improve performance (and give you redundancy).
On the logstash front, we have two logstash servers feeding into ES - at the moment we just direct different servers to log to the different logstash servers, but we're likely to be adding a HA-Proxy layer in front to do this automatically, and again provide redundancy.
With Kibana, I wouldn't worry too much - as far as I'm aware most of the processing is done in the client browser, and that that isn't is more dependent on the performance of the ES cluster.

Resources