ElasticSearch is used as a cache for PostgreSQL database to avoid a lot of joins and speed up my application selects.
Initially everything is stored at single large server (32GB RAM): webapp, nginx, postgresql, celery, elasticsearch.
Now I have 2 additional smaller nodes which is not used at all (only for additional storage with nbd-server).
So I have:
- 1 Large node with ES. About 12-16GB of RAM is available for ES.
- 2 small nodes with 8 GB RAM. Everything is free for ES.
All 3 nodes have SSD and same CPU.
Later I will add more 8GB nodes (as storage + ES).
What will be the best way to built ES cluster on this 3 nodes? Should all of them be data/master nodes? Or it will be better to use large node as a master and 2 small as
Related
My setup:
two zoness: fast and slow with 5 nodes each.
fast nodes have ephemeral storage, whereas the slow nodes are NFS based.
Running Elasticsearch OSS v7.7.1. (I have no control over the version)
I have the following cluster setting: cluster.routing.allocation.awareness.attributes: zone
My index has 2 replicas, so 3 shard instances (1x primary, 2x replica)
I am trying to ensure the following:
1 of the 3 shard instances to be located in zone fast.
2 of the 3 shard instances to be located in zone slow (because it has persistent storage)
Queries to be run in shard in zone fast where available.
Inserts to only return as written once its written once its been replicated.
Is this setup possible?
Link to a related question: How do I control where my primary and replica shards are located?
EDIT to add extra information:
Both fast and slow nodes run on a PaaS offering where we are not in control of hardware restarts meaning there can technically be non-graceful shutdowns/restarts at any point.
I'm worried about unflushed data and/or index corruption so I am looking for multiple replicas to be on the slow zone nodes backed by NFS to reduce the likelihood of data loss, despite the fact that this will "overload" the slow zone with redundant data.
we have been using a 3 node Elasticsearch(7.6v) cluster running in docker container. I have been experiencing very high cpu usage on 2 nodes(97%) and moderate CPU load on the other node(55%). Hardware used are m5 xlarge servers.
There are 5 indices with 6 shards and 1 replica. The update operations take around 10 seconds even for updating a single field. similar case is with delete. however querying is quite fast. Is this because of high CPU load?
2 out of 5 indices, continuously undergo a update and write operations as they listen from a kafka stream. size of the indices are 15GB, 2Gb and the rest are around 100MB.
You need to provide more information to find the root cause:
All the ES nodes are running on different docker containers on the same host or different host?
Do you have resource limit on your ES docker containers?
How much heap size of ES and is it 50% of host machine RAM?
Node which have high CPU, holds the 2 write heavy indices which you mentioned?
what is the refresh interval of your indices which receives high indexing requests.
what is the segment size of your 15 GB indices, use https://www.elastic.co/guide/en/elasticsearch/reference/current/cat-segments.html to get this info.
What all you have debugged so far and is there is any interesting info you want to share to find the issue?
I am using logstash for ETL purpose and have 3 indexes in the Elastic search .Can I insert documents into my 3 indexes through 3 different logtash processes at the same time to improve the parallelization or should I insert documents into 1 index at a time.
My elastic search cluster configuration looks like:
3 data nodes
1 client node
3 data nodes - 64 GB RAM, SSD Disk
1 client node - 8 GB RAM
Shards - 20 Shards
Replica - 1
Thanks
As always it depends. The distribution concept of Elasticsearch is based on shards. Since the shards of an index live on different nodes, you are automatically spreading the load.
However, if Logstash is your bottleneck, you might gain performance from running multiple processes. Though if running multiple LS process on a single machine will make a positive impact is doubtful.
Short answer: Parallelising over 3 indexes won't make much sense, but if Logstash is your bottleneck, it might make sense to run those in parallel (on different machines).
PS: The biggest performance improvement generally is batching requests together, but Logstash does that by default.
I am setting up a spark cluster. I have hdfs data nodes and spark master nodes on same instances.
Current setup is
1-master (spark and hdfs)
6-spark workers and hdfs data nodes
All instances are same, 16gig dual core (unfortunately).
I have 3 more machines, again same specs.
Now I have three options
1. Just deploy es on these 3 machines. The cluster will look like
1-master (spark and hdfs)
6-spark workers and hdfs data nodes
3-elasticsearch nodes
Deploy es master on 1, extend spark and hdfs and es on all other.
Cluster will look like
1-master (spark and hdfs)
1-master elasticsearch
8-spark workers, hdfs data nodes, es data nodes
My application is heavily use spark for joins, ml etc but we are looking for search capabilities. Search we definitely not needed realtime and a refresh interval of upto 30 minutes is even good with us.
At the same time spark cluster has other long running task apart from es indexing.
The solution need not to be one of above, I am open with experimentation if some one suggest. It would be handy for other dev's also once concluded.
Also I am trying with es hadoop, es-spark project but I felt ingestion is very slow if I do 3 dedicated nodes, its like 0.6 million records/minute.
The optimal approach here mostly depends on your network bandwidth and whether or not it's the bottleneck in your operation in my opinion.
I would just check whether my network links are saturated via say
iftop -i any or similar and check if that is the case. If you see data rates close to the physical capacity of your network, then you could try and run hdfs + spark on the same machines that run ES to save the network round trip and speed things up.
If network turns out not to be the bottleneck here, I would look into the way Spark and HDFS are deployed next.
Are your using all the RAM available (Java Xmx set high enough?, Spark memory limits? Yarn memory limits if Spark is deployed via Yarn?)
Also you should check whether ES or Spark is the bottleneck here, in all likelihood it's ES. Maybe you could spawn additional ES instances, 3 ES nodes feeding 6 spark workers seems very sub-optimal.
If anything, I'd probably try to invert that ratio, fewer Spark executors and more ES capacity. ES is likely a lot slower at providing the data than HDFS is at writing it (though this really depends on the configuration of both ... just an educated guess here :)). It is highly likely that more ES nodes and fewer Spark workers will be the better approach here.
So in a nutshell:
Add more ES nodes and reduce Spark worker count
Check if your network links are saturated, if so put both on the same machines (this could be detrimental with only 2 cores, but I'd still give it a shot ... you gotta try this out)
Adding more ES nodes is the better bet of the two things you can do :)
Is it possible to use ceph as storage backend for elasticsearch?
It seems to me that elasticsearch only supports disk writes. But I want to use my ceph cluster to be used as storage backend for the elasticsearch cluster
We have been using it at my company. We have 3 elasticsearch nodes with ceph as storage. It usually ingests about 20,000 records per seconds (5-6 millions per 5 minutes) with the loads being around 1.5 - 2.5 on all nodes.
You need to make sure you have very fast local network though.
My setup:
No of primary shards: 3
No of replica: 1
https://nayarweb.com/blog/2017/high-load-on-one-of-elasticsearch-node-on-ceph/