Fastest way to index huge data in elastic - elasticsearch

I am asked to index more than 3*10^12 documents in to elastic cluster, the cluster has 50 nodes with 40 cores, and 128G of memory.
I was able to do it with _bulk in python language (multi thread) but I could not reach more than 50,000 records per second for one node.
So I want to know:
What is the fastest way to index data?
As I know, I can index data to each data node, does it grow linear? I mean I can have 50,000 records for each node?

Per your question:
Balance your resources. Both Elasticsearch and Your Application will need to try to run at 60-80% of server utilization in order to achieve the best performance. You can achieve this utilization from Application side by using Multiple Processing in python or Unix xargs + Elasticsearch _bulk API.
Elasticsearch performance grows almost linearly with 99%, as my experience. If you have a correct design of your cluster / index-shards settings. 50,000 records/second for each node is possible.
Here are some useful links that would help:
https://www.elastic.co/guide/en/elasticsearch/guide/current/indexing-performance.html
https://qbox.io/support/article/choosing-a-size-for-nodes
https://www.elastic.co/guide/en/elasticsearch/reference/5.6/modules-threadpool.html (for monitoring your cluster during work loads)
It's recommended to do performance testing and then monitor your clusters + application servers closely during workloads. (I used unix htop + newrelic combined :D).

Related

What are the general guidelines for Elasticsearch cluster configuration for instance size, data nodes and sharding?

We regularly encounter several issues that crop up from time to time with Elasticsearch. They seem to be as follows:
Out of disk space
Slow query evaluation time
Slow/throttled data write times
Timeouts on queries
There are various areas of an Elasticsearch cluster that can be configured:
Cluster disk space
Instance type/size
Num data nodes
Sharding
It can sometimes be confusing which areas of the cluster you should be tuning depending on the problems outlined above.
Increasing the ES cluster total disk space is easy enough. Boosting the ES instance type seems to help when we experience slow data write times and slow query response times. Implementing sharding seems to be best when one particular ES index is extremely large. But it's never quite clear when we should boost the number of data nodes vs boosting the instance size.

how to know elastic-search cluster capacity?

Recently I'm working in a project which requires figuring out elastic-search capacity as we will increase a lot msgs in es system per second.
We have 3 types of nodes in es cluster: master, data, client.
how do we know the maximum insert count per second our client can handle? do we need to care about the bandwidth of the client nodes?
as per the above comments, you need to benchmark your cluster hardware and settings with your proposed data structure using a tool like https://esrally.readthedocs.io/en/stable/

Configure an Elasticsearch cluster with 3 Master nodes and 33 Data nodes on physical servers

I'm using Elasticsearch to deal with 10T, so I do all the work on how many shards, RAM, CPU and hard disk to use but as I try to configure these nodes , I'm very confusing with the number of feature to deal with and why we must use it , so if there is some guidelines or recommendations on how to do a standard configuration and best practice on this subject and if I need to configure other nodes
It heavily depends on your use case: is it indexing or search heavy, what is the document schema, what search queries are you going to run. For example, n-gram tokens might easily inflate resources needed 10x.
There are few general rules though.
You want your shards to be between 20-50 GB
You want less than 20k shards in your cluster
You want shards to be distributed evenly across machines
You want ~30 GB heap
You want your heap to take ~50% of RAM
You want as much CPU as you can eat
You want local (not network-attached) SSDs
Or, if you want the least hassle possible, you can go with Elastic Cloud which will take some of the hardware concerns away in exchange for a fee.

Elasticsearch throughput on a single node

I am building a steaming + analytics application using kafka and elasticsearch. Using kafka streams apps I am continuously pushing data into elasticsearch. Can a single node elasticsearch with 16GB RAM setup handle a write load of 5000 msgs/sec? The message size is 10KB
There are many other conditions to consider, like cluster memory, network latency and read operations. Writing operations in Elasticsearch are slow. Also it seems like the indexes could grow quickly so performance might start to degrade over time and you'll need to scale vertically.
That said, I think this could work with enough RAM and a queue where pending items wait to be indexed when the cluster is slow.
Adding more nodes should help with uptime, which is a normally a concern with user-facing production apps.

Indexing multiple indexes in elastic search at the same time

I am using logstash for ETL purpose and have 3 indexes in the Elastic search .Can I insert documents into my 3 indexes through 3 different logtash processes at the same time to improve the parallelization or should I insert documents into 1 index at a time.
My elastic search cluster configuration looks like:
3 data nodes
1 client node
3 data nodes - 64 GB RAM, SSD Disk
1 client node - 8 GB RAM
Shards - 20 Shards
Replica - 1
Thanks
As always it depends. The distribution concept of Elasticsearch is based on shards. Since the shards of an index live on different nodes, you are automatically spreading the load.
However, if Logstash is your bottleneck, you might gain performance from running multiple processes. Though if running multiple LS process on a single machine will make a positive impact is doubtful.
Short answer: Parallelising over 3 indexes won't make much sense, but if Logstash is your bottleneck, it might make sense to run those in parallel (on different machines).
PS: The biggest performance improvement generally is batching requests together, but Logstash does that by default.

Resources