Advice on ElasticSearch Cluster Configuration - elasticsearch

Brand new to Elasticsearch. I've been doing tons of reading, but I am hoping that the experts on SO might be able to weigh in on my cluster configuration to see if there is something that I am missing.
Currently I am using ES (1.7.3) to index some very large text files (~700 million lines) per file and looking for one index per file. I am using logstash (V2.1) as my method of choice for indexing the files. Config file is here for my first index:
input {
file {
path => "L:/news/data/*.csv"
start_position => "beginning"
sincedb_path => "C:/logstash-2.1.0/since_db_news.txt"
}
}
filter {
csv {
separator => "|"
columns => ["NewsText", "Place", "Subject", "Time"]
}
mutate {
strip => ["NewsText"]
lowercase => ["NewsText"]
}
}
output {
elasticsearch {
action => "index"
hosts => ["xxx.xxx.x.xxx", "xxx.xxx.x.xxx"]
index => "news"
workers => 2
flush_size => 5000
}
stdout {}
}
My cluster contains 3 boxes running on Windows 10 with each running a single node. ES is not installed as a service and I am only standing up one master node:
Master node: 8GB RAM, ES_HEAP_SIZE = 3500m, Single Core i7
Data Node #1: 8GB RAM, ES_HEAP_SIZE = 3500m, Single Core i7
This node is currently running the logstash instance with LS_HEAP_SIZE= 3000m
Data Node #2: 16GB RAM, ES_HEAP_SIZE = 8000m, Single Core i7
I have ES currently configured at the default 5 shards + 1 duplicate per index.
At present, each node is configured to write data to an external HD and logs to another.
In my test run, I am averaging 10K events per second with Logstash. My main goal is to optimize the speed at which these files are loaded into ES. I am thinking that I should be closer to 80K based on what I have read.
I have played around with changing the number of workers and flush size, but can't seem to get beyond this threshold. I think I may be missing something fundamental.
My questions are two fold:
1) Is there anything that jumps out as fishy about my cluster configuration or some advice that may improve the process?
2) Would it help if I ran an instance of logstash on each data node indexing separate files?
Thanks so much for any and all help in advance and for taking the time to read.
-Zinga

i'd first have a look at whether logstash or es is the bottleneck in your setup. try ingesting the file without the es output. what throughtput are you getting from plain/naked logstash.
if this is considerably higher then you can continue on the es side of things. A good starter might be:
https://www.elastic.co/guide/en/elasticsearch/guide/current/indexing-performance.html
If plain logstash doesn't yield a significant increase in throughput you can can try increasing / parallelising logstash across your machines.
hope that helps

Related

Performance when using NFS storage for Elasticsearch

I have a server with 32 cores, 62 GB of RAM but we have NFS storage and I think it's starting to bottleneck our daily work. In our Kibana errors like queue_size are appearing more frequently. We just got a new (same) server to use it as a replica and share the load, will this help? What other recomendations you have? We have multiple dashboards with like 20 different variables each, will they be evenly distributed between the primary node and the replica? Unfortunately, local storage is not an option.
Are you actively indexing data on these nodes? If yes you can increase refresh_interval
PUT /myindex/_settings
{
"index" : {
"refresh_interval" : "30s"
}
}
https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-update-settings.html to make system less demanding for IO. You can completely disable refresh functionality and trigger it manually.
PUT /myindex/_settings
{
"index" : {
"refresh_interval" : "-1"
}
}
POST /myindex/_refresh
Take a look on Bulk API it significantly decrease load on indexing stage.
Adding new servers to cluster helps too. Elasticsearch designed to scale horizontally. From my experience you can run 6-8 virtual nodes on server you have described. Put more shards to evenly distribute load.
Do you see what is your bottleneck (Lan, IO, CPU)?

Best practices for ElasticSearch initial bulk import

I'm running a docker setup with ElasticSearch, Logstash, Filebeat and Kibana inspired by the Elastic Docker Compose. I need to initial load 15 GB og logfiles into the system (Filebeat->Logstash->ElasticSearch) but I'm having some issues with performance.
It seems that Filebeat/Logstash is outputting too much work for ElasticSearch. After some time I begin to see a bunch of errors in ElasticSearch like this:
[INFO ][o.e.i.IndexingMemoryController] [f8kc50d] now throttling indexing for shard [log-2017.06.30]: segment writing can't keep up
I've found this old documentation article on how to disable merge throttling: https://www.elastic.co/guide/en/elasticsearch/guide/master/indexing-performance.html#segments-and-merging.
PUT /_cluster/settings
{
"transient" : {
"indices.store.throttle.type" : "none"
}
}
But in current version (ElasticSearch 6) it gives me this error:
{
"error": {
"root_cause": [
{
"type": "illegal_argument_exception",
"reason": "transient setting [indices.store.throttle.type], not dynamically updateable"
}
],
"type": "illegal_argument_exception",
"reason": "transient setting [indices.store.throttle.type], not dynamically updateable"
},
"status": 400
}
How can I solve the above issue?
The VM has 4 CPU cores (Intel Xeon E5-2650) and ElasticSearch is assigned 4GB of RAM, Logstash and Kibana 1GB each. Swapping is disabled using "swapoff -a". X-pack and monitoring is enabled. I only have one ES node for this log server. Do I need to have multiple node for this initial bulk import?
EDIT1:
Changing the number_of_replicas and refresh_interval seems to make it perform better. Still testing.
PUT /log-*/_settings
{
"index.number_of_replicas" : "0",
"index.refresh_interval" : "-1"
}
Most likely the bottleneck is IO (you can confirm this running iostat, also it would be useful if you post ES monitoring screenshots), so you need to reduce pressure on it.
Default ES configuration causes generation of many index segments during a bulk load. To fix this, for the bulk load, increase index.refresh_interval (or set it to -1) - see doc. The default value is 1 sec, which causes new segment to be created every 1 second, also try to increase batch size and see if it helps.
Also if you use spinning disks,set index.merge.scheduler.max_thread_count to 1. This will allow only one thread to perform segments merging and will reduce contention for IO between segments merging and indexing.

Many Logstash instances reading from Redis

I have one Logstash process running inside one node consuming from a Redis list, but I'm afraid that just one process cannot handle the data throughput without a great delay.
I was wondering if I run one more process for Logstash inside this same machine will perform a little better, but I'm not certain about that. I know that my ES index is not a bottleneck.
Would Logstash duplicate my data, if I consume the same list? This approach seems to be a right thing to do?
Thanks!
Here my input configuration:
input {
redis {
data_type => "list"
batch_count => 300
key => "flight_pricing_stats"
host => "my-redis-host"
}
}
You could try adjusting logstash input threads, if you are going to run another logstash process in the same machine. Default is 1.
input {
redis {
data_type => "list"
batch_count => 300
key => "flight_pricing_stats"
host => "my-redis-host"
threads => 2
}
}
You could run more than one logstash against the same redis, events should not get duplicated. But I'm not sure that would help.
If you're not certain whats going on, I recommend the logstash monitoring API. It can help you narrow down your real bottlenck.
And also an interesting post from elastic on the subject: Logstash Lines Introducing a benchmarking tool for Logstash

How to add dynamic hosts in Elasticsearch and logstash

I have prototype working for me with Devices sending logs and then logstash parsing it and putting into elasticsearch.
Logstash output code :-
output{
if [type] == "json" {
elasticsearch {
hosts => ["host1:9200","host2:9200","host3:9200"]
index => "index-metrics-%{+xxxx.ww}"
}
}
}
Now My Question is :
I will be producing this solution. For simplicity assume that I have one Cluster and I have right now 5 nodes inside that cluster.
So I know I can give array of 5 nodes IP / Hostname in elasticsearch output plugin and then it will round robin to distribute data.
How can I avoid putting all my node IP / hostnames into logstash config file.
As system goes into production I don't want to manually go into each logstash instance and update these hosts.
What are the best practices one should follow in this case ?
My requirement is :
I want to run my ES cluster and I want to add / remove / update any number of node at any time. I need all of my logstash instances send data irrespective of changes at ES side.
Thanks.
If you want to add/remove/update you will need to run sed or some kind of string replacement before the service startup. Logstash configs are "compiled" and cannot be changed that way.
hosts => [$HOSTS]
...
$ HOSTS="\"host1:9200\",\"host2:9200\""
$ sed "s/\$HOSTS/$HOSTS/g" $config
Your other option is to use environment variables for the dynamic portion, but that won't allow you to use a dynamic amount of hosts.

logstash with elasticsearch_http

Apparently logstash OnDemand account does not work when I wanted to post an issue.
Anyways, I have a logstash setup with redis, elasticsearch, and kibana. My logstash are collecting logs from several files and putting in redis just fine.
Logstash version 1.3.3
Elasticsearch version 1.0.1
The only thing I have in elasticsearch_http for logstash is the host name. This all setup seems to glue together just fine.
The problem is that the elasticsearch_http is not consuming the redis entries as they come. What I have seen by running it in debug mode is that it flush about 100 entries after every 1 min (flush_size and idle_flush_time's default values). The documentation however states, from what I understand is, that it will force a flush in case the 100 flush_size is not satisfied (for example we had 10 messages in last 1 min). But it seems to work the other way. Its flushing about 100 messages every 1 min only. I changed the size to 2000 and it flush 2000 every min or so.
Here is my logstash-indexer.conf
input {
redis {
host => "1xx.xxx.xxx.93"
data_type => "list"
key => "testlogs"
codec => json
}
}
output {
elasticsearch_http {
host => "1xx.xxx.xxx.93"
}
}
Here is my elasticsearch.yml
cluster.name: logger
node.name: "logstash"
transport.tcp.port: 9300
http.port: 9200
discovery.zen.ping.unicast.hosts: ["1xx.xxx.xxx.93:9300"]
discovery.zen.ping.multicast.enabled: false
#discovery.zen.ping.unicast.enabled: true
network.bind_host: 1xx.xxx.xxx.93
network.publish_host: 1xx.xxx.xxx.93
The indexer, elasticsearch, redis, and kibana are on same server. The log collection from file is done on another server.
So I'm going to suggest a couple of different approaches to solve you problem. Logstash as you are discovering can be a bit quirky so I've found a these approaches useful in dealing with unexpected behavior from logstash.
Use the elasticsearch output instead of elasticsearch_http. You
can get the same functionality by using elasticsearch output with
protocol set to http. The elasticsearch output is more mature
(milestone 2 vs milestone 3) and I've seen this change make a
difference before.
Set the defaults for idle_flush_time and flush_size. There have
been issues with Logstash defaults previously, I've found it to be a
lot safer to set them explicitly. idle_flush_time is in seconds,
flush_size is the number of records to flush.
Upgrade to more recent versions of logstash. There is
enough of a change in how logstash is deployed with version 1.4.X
(http://logstash.net/docs/1.4.1/release-notes) that I'd that I'd
bite the bullet and upgrade. It's also significantly easier to get
attention if you still have a problem with the most recent stable
major release.
Make certain your Redis version matches those support by your
logstash version.
Experiment with setting the batch, batch_events and batch_timeout
values for the Redis output. You are using the list data_type.
list supports various batch options and as with some other
parameters it's best not to assume the defaults are always being set
correctly.
Do all of the above. In addition to trying the first set of
suggestions, I'd try all of them together in various combinations.
Keep careful records of each test run. Seems obvious but between all
the variations above it's easy to lose track - I'd keep careful
records and try to change only one variation at a time.

Resources