saving elasticsearch shard per ssd parallel - performance

I'm testing for performance of elasticsearch.
I read that by using path.data config, io performanc could be increased.
So, I thought by setting path.data to multiple ssd path, indexing performance could be increased.
But it was equal with single path config.
Because 2 shards are written in same ssd, there were nothing better.
Should I set other config options to use multiple ssd parallel?
Or should I use multiple instance to use multiple ssd parallel on single machine?
Help me.
Thanks.

Related

AWS ElasticSearch Java Process Limit

AWS documentation makes clear the following:
Java Process Limit
Amazon ES limits Java processes to a heap size of 32 GB. Advanced users can specify the percentage of the heap used for field data. For more information, see Configuring Advanced Options and JVM OutOfMemoryError.
Elastic search instance types span right up to 500GB memory - so my question (as a Java / JVM amateur) is how many Java processes does ElasticSearch run? I assume a 500GB ElasticSearch instance (r4.16xlarge.elasticsearch) is somehow going to make use of more than 32GB + any host system overhead?
Elasticsearch uses one java process (per node).
Indeed as quoted it is advised not to go over the 32GB RAM from performance efficiency reasons (the JVM would need to use 64bits pointers, which would decrease performance).
Another recommendation is to keep memory for the file system cache, which lucene uses heavily in order to load doc-values, and info from disk into memory.
Depending on your workload, it is better to run multiple VMs on a single 500gb server. you better use 64gb-128gb VMs, each divided between 31gb for Elasticsearch and the rest for the file system cache.
multiple VMs on a server means that each VM is Elasticsearch node.

Changing AWS Elasticsearch properties (without elasticsearch.yml) like threadpool queue size

I would like to change my AWS Elasticsearch thread_pool.write.queue_size setting. I see that the recommended technique is to update the elasticsearch.yml file as it can't be done dynamically by the API in the newer versions.
However, since I am using AWS's Elasticsearch service, as far as I'm aware, I don't have access to that file. Is there anyway to make this change? I don't see it referenced for version 6.3 here so I don't know how to do it with AWS.
You do not have a lot of flexibility with AWS ES. In your case, scale your data node instance type to a bigger instance and that should provide you higher thread pool queue size. A note on increasing the number of shards - do not do it unless really required as it may cause performance issues while searching, aggregating etc. A shard can easily hold upto 50 GB of data, so if you have a lot of shards with very less data then think about shrinking the shards. Each shard in itself consumes resources (cpu, memory) etc and shard configuration should be proportional to the heap memory available on the node.

how to limit memory usage of elasticsearch in ubuntu 17.10?

My elasticsearch service is consuming around 1 gb.
My total memory is 2gb. The elasticsearch service keeps getting shut down. I guess the reason is because of the high memory consumption. How can i limit the usage to just 512 MB?
This is the memory before starting elastic search
After running sudo service elasticsearch start the memory consumption jumps
I appreciate any help! Thanks!
From the official doc
The default installation of Elasticsearch is configured with a 1 GB heap. For just about every deployment, this number is usually too small. If you are using the default heap values, your cluster is probably configured incorrectly.
So you can change it like this
There are two ways to change the heap size in Elasticsearch. The easiest is to set an environment variable called ES_HEAP_SIZE. When the server process starts, it will read this environment variable and set the heap accordingly. As an example, you can set it via the command line as follows: export ES_HEAP_SIZE=512m
But it's not recommended. You just can't run an Elasticsearch in the optimal way with so few RAM available.

MemSQL performance issues

I have a single node MemSQL install with one master aggregator and two leaves (all on a single box). The machine has 2 cores, 16Gb RAM, and MemSQL columnstore data is ~7Gb (coming from 21Gb CSV). When running queries on the data, memory usage caps at ~2150Mb (11Gb sitting free). I've configured both leaves to have maximum_memory = 7000 in the memsql.cnf files for both nodes (memsql-optimize does similar). During query execution, the master aggregator sits at 100% CPU, with the leaves 0-8% CPU.
This does not seems like an efficient use of system resources, but I'm not sure what I can do to configure the system or MemSQL to make more efficient use of CPU or memory. Any help would be greatly appreciated!
If during query execution your machine is at 100% cpu (on all cores), it doesn't really matter which MemSQL node it is, your workload throughput is still bottlenecked on cpu. However for most queries you wouldn't expect most of the cpu use to be on the aggregator, so you may want to take a look at EXPLAIN or PROFILE of your queries.
Columnstore data is cached in memory as part of the OS file cache - it isn't counted as memory reserved by MemSQL, which is why your memory usage is less than the size of the columnstore data.
My database was coming from some other place than the current memsql install (perhaps an older cluster configuration) despite there only being a single memsql cluster on the machine. Looking at the Databases section in the Web UI was displaying no databases/tables, but my queries were succeeded with the expected answers.
drop database/reload from CSV managed to remedy the situation. All core threads are now used during query.

How to increase AeroSpark read performance?

I am using latest AeroSpark connector to work with AeroSpike and Spark ML. But when i have inserted round 60M records to AeroSpike, i got too big time amount in read operations. For example for fetch round 500K records from set that contains 60M records, AeroSpark spend ~30 mins. When i look at htop cmd output, AeroSpike use only 7% of CPU.
Each record round contains 1k of data. The AeroSpike and Spark hosted on the same node. The data filtered by secondary index.
How can i speed up performance in read operations? Seems AeroSpark is working only by one thread, how i can parallelize this job? Any suggestions?
AeroSpike conf:
memory-size 8G
default-ttl 30d
storage-engine device {
file /vol/rmla.data
filesize 900G
}
Without knowing anything about your server, and with just a snippet of config, I'll stick to some generic recommendations that should improve your experience.
Disk IO
You are clearly bound by the read speed from your storage media, which you declared to be a file. If you're storing the data on disk, you can either use file or device in the storage-engine device config block.
There is a big difference in the read and write latency between a file on a HDD versus raw device access to an SSD. Typically Aerospike is used with data stored on enterprise-grade SSD devices. Read the section in the operations manual about initializing and setting up the drive. Declaring multiple devices for the namespace with give you a linear performance boost (two drives will have double the read and write throughput of one of the same kind).
In Amazon EC2 you could use the c3, i2, r3, or i3 instance families for this purpose. The ephemeral SSD devices of EC2 instances don't need to be over-provisioned, have their RAID turned off, etc. They only need to be initialized before they're first used. Do not use EBS drives for primary storage, as they're too slow.
Cluster Configuration
The Spark connector uses lots of scan operations. Make sure that you've configured scan-threads under your service config block to the number of cores. If you don't know how many cores you have, do cat /proc/cpuinfo. If Spark is the only client using the Aerospike cluster, you can tune the scan threads higher.
Connector Configuration
You can modify the connector config options for lower write latency. Optionally set aerospike.commitLevel to CommitLevel.COMMIT_MASTER.
Upgrade Version
As of November 28 2016 aerospike/aerospark supports Spark 2.0. Make sure you're using the latest code.
Note: See the new tutorial for Aerospark on the Aerospike website.

Resources