How much heap memory required for elasticsearch? - elasticsearch

I have 8 node cluster(3Master+3Data+2Coordinate) in each data node the heap size is 10gb and disk space is 441gb each total 1.2TB
in each day i have 32.73GB data in each day 26 shards created for 11 indices.So lets suppose the retention period is 30 days.In 30th day the data on cluster would be 982GB and total shards would be 780 each node gets 260 shards.So the average shard size would be 260mb(approx).i read this documentation that a node of 30gb heap size can handle 600 shards.So the question is Can heap size of 10gb can handle 260 shards ?.

This article which you read can be considered a good general recommendations but there are various factors which can affect it, like size of indices, size of shards, size of documents, type of disk, current load on the system and so on, in the same document you can notice the shard size recommendation is between 10 to 50 GB, while you have very small shard size(260 MB as you mentioned), so based on this, I can say 10GB heap can easily handle 260 shards in your case, although you should benchmark your cluster and read more about how ES internally stores the data and searches them so that its easy for you to fine-tune it.

Related

Elasticsearch divide cluster or not

I have a cluster consisting of 4 data and 1 master node
each node has:
cpu: 4
ram: 16
and we have 1 largest index of 215 gigabytes (1 primary and 1 replica shard)
on peak days, this index is very heavily loaded (we use aggregation queries, since the index is used to store and send notifications to users), which negatively affects the operation of the entire cluster
the developers propose to allocate a separate cluster for this index, and I propose to add 3 more machines to the cluster and use shard allocation awareness only for this index and divide it into 3 primary and 1 replica shard
what do you think this is right approach? or what is the best way to do it in your opinion?
215 gigabytes in a single primary shard is clearly not optimal and most probably the cause of your issues.
The official recommendation is to not store more than 10GB to 50GB per shard, so you should split your index at least to 4 shards.
Then, if your index is loaded because of writes, then you should make sure to allocate each primary shard to each of your data node, so that all writes can happen in parallel on all four data nodes. If the index is loaded because of searches, it doesn't really matter where the primaries and replicas reside, but with big indexes it's always best to spread the load as evenly as possible among all the computing power you have.

Elasticsearch Memory Usage increases over time and is at 100%

I see that indexing performance degraded over a period of time in Elasticsearch. I see that the mem usage has slowly increased over a period of time until it became 100%. At this state I cannot index any more data. I have default shard settings - 5 primary and 1 replica. My index is time based with index created every hour to store coral service logs of various teams. An index size corresponds to about 3GB with 5 shards and with replica it is about 6GB. With a single shard and 0 replicas it comes to about 1.7 GB.
I am using ec2's i2.2x large hosts which offer 1.6TB space and 61GB RAM and 8 cores.
I have set the heap size to 30GB.
Following is node statistics:
https://jpst.it/1eznd
Could you please help in fixing this? My whole cluster came down that I had to delete all the indices.

How can i save disk space in elasticsearch

I have a three nodes cluster each of 114 GB disk capacity. I am pushing syslog events from around 190 devices to this server. The events flow at a rate of 100k withing 15 mins. The index has 5 shards and uses best_compression method. Inspite of this the disk gets filled up soon so i was forced to remove the replica for this index. The index size is 170 GB and each shard is of size 34.1 GB. Now if i get additional disk space and i try to re index this data to a new index with 3 shards and replica will it save disk space ?

Elasticsearch sizing of shards

I am new to elasticsearch.Suppose we have a two node cluster and have a config of 2 primary shards and one replica for our single index.So node 1 has P0,R1 and node 2 has P1,R0. Now suppose later on I reduce the number of replicas to 0.Then will the shards P0 and P1 automatically resize themselves to occupy the disk space vacated by replicas and allow me greater disk space for indexing then previously when I had replicas.
A replica shard takes more or less the same space as its primary since both contain the same documents. So, say, you have indexed 1 million documents in your index, then each primary shard contains more or less half that amount of documents, i.e. 500K document and each replica contains the same number of documents as well.
If each document weighs 1KB, then:
The primary shard P0 has 500K document weighing 500MB
The replica shard R0 has 500K document weighing 500MB
The primary shard P1 has 500K document weighing 500MB
The replica shard R1 has 500K document weighing 500MB
Which means that your index occupies 2GB of disk space on your node. If you later reduce the number of replicas to 0, then that will free up 1GB of space that your primary shards will be able to occupy, indeed.
However, note that by doing so, you certainly gain disk space, but you won't have any redundancy anymore and you will not be able to spread your index over two nodes, which is the main idea behind replicas to begin with.
The other thing is that the size of a shard is bounded by a physical limit that it will not be able to cross. That limit is dependent on many factors, among which the amount of heap and the total physical memory you have. If you have 2GB of heap and 50GB of disk space, you cannot expect to index 50GB of data into your index, that won't work, or will be very slow and unstable.
=> So the disk space only should not be the main driver for sizing your shards. Having enough disk space is necessary condition but not a sufficient one, you also need to look at the RAM and the heap allocated to your ES node.

How to improve percolator performance in ElasticSearch?

Summary
We need to increase percolator performance (throughput).
Most likely approach is scaling out to multiple servers.
Questions
How to do scaling out right?
1) Would increasing number of shards in underlying index allow running more percolate requests in parallel?
2) How much memory does ElasticSearch server need if it does percolation only?
Is it better to have 2 servers with 4GB RAM or one server with 16GB RAM?
3) Would having SSD meaningfully help percolator's performance, or it is better to increase RAM and/or number of nodes?
Our current situation
We have 200,000 queries (job search alerts) in our job index.
We are able to run 4 parallel queues that call percolator.
Every query is able to percolate batch of 50 jobs in about 35 seconds, so we can percolate about:
4 queues * 50 jobs per batch / 35 seconds * 60 seconds in minute = 343
jobs per minute
We need more.
Our jobs index have 4 shards and we are using .percolator sitting on top of that jobs index.
Hardware: 2 processors server with 32 cores total. 32GB RAM.
We allocated 8GB RAM to ElasticSearch.
When percolator is working, 4 percolation queues I mentioned above consume about 50% of CPU.
When we tried to increase number of parallel percolation queues from 4 to 6, CPU utilization jumped to 75%+.
What is worse, percolator started to fail with NoShardAvailableActionException:
[2015-03-04 09:46:22,221][DEBUG][action.percolate ] [Cletus
Kasady] [jobs][3] Shard multi percolate failure
org.elasticsearch.action.NoShardAvailableActionException: [jobs][3]
null
That error seems to suggest that we should increase number of shards and eventually add dedicated ElasticSearch server (+ later increase number of nodes).
Related:
How to Optimize elasticsearch percolator index Memory Performance
Answers
How to do scaling out right?
Q: 1) Would increasing number of shards in underlying index allow running more percolate requests in parallel?
A: No. Sharding is only really useful when creating a cluster. Additional shards on a single instance may in fact worsen performance. In general the number of shards should equal the number of nodes for optimal performance.
Q: 2) How much memory does ElasticSearch server need if it does percolation only?
Is it better to have 2 servers with 4GB RAM or one server with 16GB RAM?
A: Percolator Indices reside entirely in memory so the answer is A LOT. It is entirely dependent on the size of your index. In my experience 200 000 searches would require a 50MB index. In memory this index would occupy around 500MB of heap memory. Therefore 4 GB RAM should be enough if this is all you're running. I would suggest more nodes in your case. However as the size of your index grows, you will need to add RAM.
Q: 3) Would having SSD meaningfully help percolator's performance, or it is better to increase RAM and/or number of nodes?
A: I doubt it. As I said before percolators reside in memory so disk performance isn't much of a bottleneck.
EDIT: Don't take my word on those memory estimates. Check out the site plugins on the main ES site. I found Big Desk particularly helpful for watching performance counters for scaling and planning purposes. This should give you more valuable info on estimating your specific requirements.
EDIT in response to comment from #DennisGorelik below:
I got those numbers purely from observation but on reflection they make sense.
200K Queries to 50MB on disk: This ratio means the average query occupies 250 bytes when serialized to disk.
50MB index to 500MB on heap: Rather than serialized objects on disk we are dealing with in memory Java objects. Think about deserializing XML (or any data format really) you generally get 10x larger in-memory objects.

Resources