Elastic search kubernetes Sudden rise in data disk usage - elasticsearch

Deployed elastic search Kubernetes in GKE. With 2GB memory and 1GB persistence disk.
We got an error out of storage exception. After that, we have Increased to 2GB on the next day itself it reached 2GB, but we haven’t run any big queries. Then again we have increased the persistence disk size to 10 GB. After that, there is no increase in the data persistence disk storage.
On further analysis, we have found total Indices take 20MB of memory unable to what are the data in the disk.
Used elastic search nodes stats API to get the details on disk and node statistics.
I am unable to find the exact reason why memory exceeds and what are the data in the disk. Also, suggest ways to prevent this future.

It is continuously receiving data and based on your config it creates multiple copies of indices and may create a new index daily. Check the config file.
if the elasticsearch cluster fails each time it creates a backup of data so you may need to delete old backups before restarting the cluster.

Related

ElasticSearch High Search response times on new Machine

I have 4 EC2 machines in ElasticSearch Cluster.
Configuration: c5d.large, Memory: 3.5G, Data Disk: 50GB NVME instance storage.
ElasticSearch Version: 6.8.21
I added the 5th machine with the same configuration c5d.large, Memory: 3.5G, Data Disk: 50GB NVME instance storage. After that, Search requests are taking more time than earlier. I enabled slow logs, which shows only shards that are present on the 5th node are taking more time for search. Also, I can see high disk Read IO happening on new node when I trigger search requests. The iowait% increases by the number of search requests and goes up to 90-95%. All old nodes do not show any read spikes.
I checked elasticsearch.yml, jvm.options and even sysctl -A configurations. there is no diff between config on new nodes vs old nodes.
What could be the issue here?

Elasticsearch throughput on a single node

I am building a steaming + analytics application using kafka and elasticsearch. Using kafka streams apps I am continuously pushing data into elasticsearch. Can a single node elasticsearch with 16GB RAM setup handle a write load of 5000 msgs/sec? The message size is 10KB
There are many other conditions to consider, like cluster memory, network latency and read operations. Writing operations in Elasticsearch are slow. Also it seems like the indexes could grow quickly so performance might start to degrade over time and you'll need to scale vertically.
That said, I think this could work with enough RAM and a queue where pending items wait to be indexed when the cluster is slow.
Adding more nodes should help with uptime, which is a normally a concern with user-facing production apps.

Changing AWS Elasticsearch properties (without elasticsearch.yml) like threadpool queue size

I would like to change my AWS Elasticsearch thread_pool.write.queue_size setting. I see that the recommended technique is to update the elasticsearch.yml file as it can't be done dynamically by the API in the newer versions.
However, since I am using AWS's Elasticsearch service, as far as I'm aware, I don't have access to that file. Is there anyway to make this change? I don't see it referenced for version 6.3 here so I don't know how to do it with AWS.
You do not have a lot of flexibility with AWS ES. In your case, scale your data node instance type to a bigger instance and that should provide you higher thread pool queue size. A note on increasing the number of shards - do not do it unless really required as it may cause performance issues while searching, aggregating etc. A shard can easily hold upto 50 GB of data, so if you have a lot of shards with very less data then think about shrinking the shards. Each shard in itself consumes resources (cpu, memory) etc and shard configuration should be proportional to the heap memory available on the node.

Elasticsearch: What if size of index is larger than available RAM?

Assuming a single machine system with an in-memory indexing schema.
I am not able to find this info in ES docs. Does ES start swapping out the overflowing data, loads it when needed and continue working or it gives an error?
In-memory indices provide better performance at the cost of limiting the index size to the amount of available physical memory.
Via the 1.7 documentation. Memory stores are no longer available in 2.0+.
Under the hood it uses the Lucene RAMDirectory, which will just consume RAM (and eventually swap) until either you hit Java heap limits and ES crashes with out-of-memory errors, or the system gives up and oomkills the Elasticsearch process. Don't use in-memory indexes for large indexes, or for any situation where persistence is important.

ScaleOut Software In Memory DataGrid Using Hadoop

I have been doing some reading on real time processing using hadoop and stumbled upon this http://www.scaleoutsoftware.com/hserver/
From what the documentation says, it looks like they implemented an in memory data grid using the hadoop worker/slave nodes. I have couple of questions here
From my understanding, if i have a data of size 100 GB, i would atleast need 100GB of ram across all nodes on my cluster just for the data + additional ram for task tracker, data node daemons + additional ram for the hServer service that would run on all these nodes. Is my understanding correct?
The software claims they can do real-time data processing by improving the latency issues in hadoop. Is it because, it allows us to write data to the in-memory grid instead of HDFS?
I am new to Big Data technologies. Apologize if some of the questions are naive.
[Full disclosure: I work at ScaleOut Software, the company which created ScaleOut hServer.]
In-memory data grids create a replica for every object to ensure high availability in case of failures.The aggregate amount of memory that is required is the memory used to store the objects with the addition of the memory used to store object replicas. In your example, you will need 200 GB of total memory: 100 GB for objects and 100 GB for replicas. For example, in a four-server cluster, each server needs 50 GB of memory available to the ScaleOut hServer service.
With the current release, ScaleOut hServer takes the first step in enabling real-time analytics by speeding up data access. It does this in two ways, which are implemented using different input/output formats. The first mode of operation uses the grid as a cache for HDFS, and the second uses the grid as the primary storage for a data set, providing support for fast-changing, memory-based data. Accessing data using an in-memory data grid reduces latency by eliminating disk I/O and minimizing network overhead. Also, caching HDFS data provides an additional performance boost by storing keys and values generated by the record reader instead of raw HDFS files in the grid.

Resources