we are using ES-HADOOP plugin to push data into Elasticsearch cluster from Hadoop HBASE table. below are the cluster details.
elasticsearch version: 2.3.5
data nodes: 3
master nodes: 3
client node: 1
the data nodes are master nodes as well.
data/master nodes heap: 20GB
client nodes heap: 3GB
Number of Primary Shards per index: 5
Number of Replica Shards per index: 1
when we execute jobs on Spark and on stages where we push data from Hadoop to Elasticsearch after some time we start getting ElasticSearch Bailing Out.
we suspect that number of concurrent connections which Elasticsearch can process for Bulk API is exceeding by the Spark Executors due to which post maximum numbers of connections Elasticsearch start rejecting the write requests.
How we can identify that how much concurrent bulk API connection can ElasticSearch Client node can process and successfully write the data and what should be the maximum number of documents per BULK API REQUEST?
What parameters we should look into optimise the ElasticSearch cluster for write operations where we need to index 80-90 GB data in a hour?
Related
I am using AWS Elasticsearch 6.8 and configured 9 nodes, 5 replicas shard, 1 primary shard. There are only 6 shards which means only 6 nodes are required.
Is one shard can only exist in one node? If yes, there should be 3 idle nodes in the cluster and how I can find out which of the 3 nodes are idle?
If there are 3 idle nodes, whether it impacts search? When the cluster receives a request, will it send the request to one of the idle node?
Yes, you are right, but in your case, we need some more information like our of 9 nodes, how many are dedicated master nodes(as on master nodes, shards(which actually contains) data is not allocated ).
If you are having all 9 data nodes, and just one index with 1 primary and 5 replica shards, then certainly they all will be allocated on different nodes(except in some weird rare case).
using the cerebo elasticsearch cluster-admin tool, you can quickly point your Elasticsearch cluster, and get to know what all nodes are there in your cluster and how shards(replicas and primary) allocated on them.
Below is the sample of how nodes and index created in my AWS-elasticsearch looks like. clearly you can see, my own index has (1 primary shard and 0 replica shard and ES cluster has just 1 node and that shard is allocated on that node).
If I have 3 data nodes and perform a query with a lot of aggregations, this search is distributed through all cluster data nodes?
Or the Elasticsearch elects one node to query and aggregate the data? Acting as a load balancer and not as like a "distributed map/reduce"
If the index you're querying contains more than one shard (whether primary or replica), then those shards will be located on different nodes, hence the query will be distributed to each node that hosts a shard of the index you're querying.
One data node will receive your request and act as the coordinating node. It will check the cluster state to figure out where the shards are located, then it will forward the request to each node hosting a shard, gather the results and send them back to the client.
I have an Elasticsearch cluster with 11 nodes. Five of these are data nodes and the other ones are client nodes from where I add and retrieve documents.
I am using the standard Elasticsearch configuration. Each index has 5 shards and replicas. In the cluster I have 55 indices and round about 150GB of data.
The cluster is very slow. With the Kopf plugin I can see the stats of each node. There I can see that one single data node (not the master) is permanently overloaded. Heap, disk, cpu are ok, but load is almost every time 100%. I have noticed, that every shard is a primary shard whereas all other data nodes have both primary shards and replicas. When I shutdown that node and then on again, the same problem occurs at another data node.
And I don't know why and how to solve this problem. I thought that the client nodes and the master node distribute the requests evenly? Why is always one data node overloaded?
Try the following settings:
cluster.routing.rebalance.enable:
Enable or disable rebalancing for specific kinds of shards:
all - (default) Allows shard balancing for all kinds of shards.
primaries - Allows shard balancing only for primary shards.
replicas - Allows shard balancing only for replica shards.
none - No shard balancing of any kind are allowed for any indices.
cluster.routing.allocation.allow_rebalance:
Specify when shard rebalancing is allowed:
always - Always allow rebalancing.
indices_primaries_active - Only when all primaries in the cluster are allocated.
indices_all_active - (default) Only when all shards (primaries and replicas) in the cluster are allocated.
cluster.routing.allocation.cluster_concurrent_rebalance:
Allow to control how many concurrent shard rebalances are allowed cluster wide.
Defaults to 2
Sample curl to apply desired settings:
curl -XPUT <elasticsearchserver>:9200/_cluster/settings -d '{
"transient" : {
"cluster.routing.rebalance.enable" : "all"
}
}
You can replace transient with persistent if you want your settings persist across restarts.
We have an elastic search cluster of 3 nodes of the following configurations
#Cpu Cores Memory(GB) Disk(GB) IO Performance
36 244.0 48000 very high
The machines are in 3 different zones namely eu-west-1c,eu-west-1a,eu-west-1b.
Each elastic search instance is being allocated 30GB of heap space.
we are using the above cluster for running aggregations only. The cluster has replication factor of 1 and all the string fields are not analyzed , doc_values is true for all the fields.
We are pumping data into this cluster running 6 instances of logstash in parallel ( having a batch size of 1000)
When more instances of logstash are started one by one the nodes of the ElasticSearch cluster starts throwing out of memory error.
What could be the possible optimizations to speed up bulk indexing rate on the cluster?= Will presence of nodes of cluster in the same zone increase bulk indexing? Will adding more nodes in the cluster help ?
Couple of steps taken so far
Increase the bulk queue size from 50 to 1000
Increase refresh interval from 1 seconds to 2 minutes
Changed segments merge throttling to none (
https://www.elastic.co/guide/en/elasticsearch/guide/current/indexing- performance.html)
We cannot set the replication factor to 0 due to inconsistency involved if one of the nodes goes down.
I make queries to Elasticsearch using Spark. As it says in the documentation Spark creates tasks accordingly to the number of Elasticsearch shards (e.g. for 32 shards there will be 32 Spark tasks). Each task connects and retrieves data from separate Elasticsearch shard.
Also there is a description of fetch phase (from Elasticsearch: The Definitive Guide — Distributed Search Execution » Fetch Phase):
The distributed phase consists of the following steps:
The coordinating node identifies which documents need to be fetched
and issues a multi GET request to the relevant shards.
Each shard loads the documents and enriches them, if required, and
then returns the documents to the coordinating node.
Once all documents have been fetched, the coordinating node returns
the results to the client.
In Elasticsearch-Spark solution we have different algorithm since there is no coordinating node:
Shard loads the documents and enriches them, if required.
Elasticsearch returns the shard results to the client (Spark task).
My question is as follows:
I look at the elapsed time of fetch phase in the slow log. Does elapsed time includes the transfer of all data from shard to client (Spark task)? Or it includes only time to retrieve data from filesystem?