Bad indexing performance of elasticsearch - performance

Currently, I'm using elastic search to store and query some logs. We set up a five node elastic search cluster. Among them two indexing nodes and three query nodes. In the indexing node, we have redis, logstash and elasticsearch on both two servers. The elasticsearch uses NFS storage as data store. Our requirement is to index 300 log entries/second. But the best performance I can get from elasticsearch is only 25 log entries/second!
The XMX of the elasticsearch is 16G.
Version of each component:
Redis: 2.8.12
logstash: 1.4.2
elasticsearch: 1.5.0
Our current index settings are like this:
{
"userlog" : {
"settings" : {
"index" : {
"index" : {
"store" : {
"type" : "mmapfs"
},
"translog" : {
"flush_threshold_ops" : "50000"
}
},
"number_of_replicas" : "1",
"translog" : {
"flush_threshold_size" : "1G",
"durability" : "async"
},
"merge" : {
"scheduler" : {
"max_thread_count" : "1"
}
},
"indexing" : {
"slowlog" : {
"threshold" : {
"index" : {
"trace" : "2s",
"info" : "5s"
}
}
}
},
"memory" : {
"index_buffer_size" : "3G"
},
"refresh_interval" : "30s",
"version" : {
"created" : "1050099"
},
"creation_date" : "1447730702943",
"search" : {
"slowlog" : {
"threshold" : {
"fetch" : {
"debug" : "500ms"
},
"query" : {
"warn" : "10s",
"trace" : "1s"
}
}
}
},
"indices" : {
"memory" : {
"index_buffer_size" : "30%"
}
},
"uuid" : "E1ttme3fSxKVD5kRHEr_MA",
"index_currency" : "32",
"number_of_shards" : "5"
}
}
}
}
Here's my logstash config:
input {
redis {
host => "eanprduserreporedis01.eao.abn-iad.ea.com"
port => "6379"
type => "redis-input"
data_type => "list"
key => "userLog"
threads => 15
}
# Second reids block begin
redis {
host => "eanprduserreporedis02.eao.abn-iad.ea.com"
port => "6379"
type => "redis-input"
data_type => "list"
key => "userLog"
threads => 15
}
# Second reids block end
}
output {
elasticsearch {
cluster => "customizedlog_prod"
index => "userlog"
workers => 30
}
stdout{}
}
A very strange thing is although currently the indexing speed is only ~20/s, the IO wait is very high almost 70%. And mostly are read traffic. Through nfsiostat, current read speed is about 200Mbps! So basically, to index every log entry, it will read about 10Mbits of data which is insane because the average length of our log entry is less than 10K.
So, I took a jstack dump of the elastic search, here's the result of one RUNNING thread:
"elasticsearch[somestupidhostname][bulk][T#3]" daemon prio=10 tid=0x00007f230c109800 nid=0x79f6 runnable [0x00007f1ba85f0000]
java.lang.Thread.State: RUNNABLE
at sun.nio.ch.FileDispatcherImpl.pread0(Native Method)
at sun.nio.ch.FileDispatcherImpl.pread(FileDispatcherImpl.java:52)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:220)
at sun.nio.ch.IOUtil.read(IOUtil.java:197)
at sun.nio.ch.FileChannelImpl.readInternal(FileChannelImpl.java:730)
at sun.nio.ch.FileChannelImpl.read(FileChannelImpl.java:715)
at org.apache.lucene.store.NIOFSDirectory$NIOFSIndexInput.readInternal(NIOFSDirectory.java:179)
at org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:342)
at org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:54)
at org.apache.lucene.store.DataInput.readVInt(DataInput.java:122)
at org.apache.lucene.store.BufferedIndexInput.readVInt(BufferedIndexInput.java:221)
at org.apache.lucene.codecs.blocktree.SegmentTermsEnumFrame.loadBlock(SegmentTermsEnumFrame.java:152)
at org.apache.lucene.codecs.blocktree.SegmentTermsEnum.seekExact(SegmentTermsEnum.java:506)
at org.elasticsearch.common.lucene.uid.PerThreadIDAndVersionLookup.lookup(PerThreadIDAndVersionLookup.java:104)
at org.elasticsearch.common.lucene.uid.Versions.loadDocIdAndVersion(Versions.java:150)
at org.elasticsearch.common.lucene.uid.Versions.loadVersion(Versions.java:161)
at org.elasticsearch.index.engine.InternalEngine.loadCurrentVersionFromIndex(InternalEngine.java:1002)
at org.elasticsearch.index.engine.InternalEngine.innerCreate(InternalEngine.java:277)
- locked <0x00000005fc76b938> (a java.lang.Object)
at org.elasticsearch.index.engine.InternalEngine.create(InternalEngine.java:256)
at org.elasticsearch.index.shard.IndexShard.create(IndexShard.java:455)
at org.elasticsearch.action.bulk.TransportShardBulkAction.shardIndexOperation(TransportShardBulkAction.java:437)
at org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:149)
at org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction.performOnPrimary(TransportShardReplicationOperationAction.java:515)
at org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction$1.run(TransportShardReplicationOperationAction.java:422)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Can anyone tell me what is elastic search doing and why the indexing is so slow? And is it possible to improve it?

It may not be entirely responsible for your poor performance, but check out the batch_size option for redis. I'll bet it'll get better if you're pulling more than 1 document from redis at a time.

Related

Why auto delete index policy doesn't work in elasticsearch?

I created a index policy in Kibana to delete index order than 7 days. Below is the configuration:
And I have indexes who are using this policy but none of them get deleted. Below is one of the index setting configuration. It has already specified the policy to use: metrics-log-retention. Is there anything I missed?
{
"aws-logs-2022-02-01" : {
"settings" : {
"index" : {
"lifecycle" : {
"name" : "metrics-log-retention"
},
"routing" : {
"allocation" : {
"include" : {
"_tier_preference" : "data_content"
}
}
},
"number_of_shards" : "1",
"provided_name" : "aws-logs-2022-02-01",
"creation_date" : "1643673636747",
"priority" : "100",
"number_of_replicas" : "1",
"uuid" : "lLmO753nRpuw6bauKIJI2Q",
"version" : {
"created" : "7150299"
}
}
}
}
}
Below is the hot phase. I have disabled all options under hot as shown in below screenshot. but it still doesn't work.
Below is the raw data for the index policy:
{
"metrics-log-retention" : {
"version" : 4,
"modified_date" : "2022-02-10T22:24:14.492Z",
"policy" : {
"phases" : {
"hot" : {
"min_age" : "0ms",
"actions" : {
"rollover" : {
"max_size" : "50gb",
"max_primary_shard_size" : "50gb",
"max_age" : "1d"
}
}
},
"delete" : {
"min_age" : "6d",
"actions" : {
"delete" : {
"delete_searchable_snapshot" : true
}
}
}
}
},
"in_use_by" : {
"indices" : [
"aws-logs-2022-02-01",
"aws-logs-2022-02-04",
"aws-logs-2022-02-05",
"aws-logs-2022-02-02",
"aws-logs-2022-02-03",
"aws-metrics-2022-02-01",
"aws-metrics-2022-02-07",
"aws-logs-2022-02-08",
"aws-metrics-2022-02-06",
"aws-logs-2022-02-09",
"aws-logs-2022-02-06",
"aws-metrics-2022-02-09",
"aws-logs-2022-02-07",
"aws-metrics-2022-02-08",
"aws-metrics-2022-02-03",
"aws-metrics-2022-02-02",
"aws-metrics-2022-02-05",
"aws-metrics-2022-02-04",
"aws-logs-2022-02-11",
"aws-logs-2022-02-12",
"aws-logs-2022-02-10",
"aws-logs-2022-02-13",
"aws-metrics-2022-02-10",
"aws-metrics-2022-02-12",
"aws-metrics-2022-02-11",
"aws-metrics-2022-02-13"
],
"data_streams" : [ ],
"composable_templates" : [ ]
}
}
}
As you can see on the hot phase advanced settings, the default rollover settings are 30 days or 50GB, so your indexes will stay in the hot phase for 30 days, unless they grow over 50GB before.
Once the index gets out of the hot phase it gets into the delete phase and if you hover over the (i) icon, you can see that the 7 days are calculated AFTER the roll over from the hot phase.
So if you really want your indexes to be deleted after 7 days, you need to:
configure the hot phase to be shorter (say 6 days)
configure the delete phase to kick in after 1 day from rollover
That way, the index will be created and stay six days in the hot phase and then be deleted after one day.
Just add it to crontab on ES host, it will delete old indices automatically
0 7 * * * curl -u LOGIN:PASSWORD -XDELETE http://localhost:9200/aws-logs-$(date --date="7 days ago" +"%Y.%m.%d")

creating data stream through logstash

I have installed elasticsearch cluster v 7.14.
I have created ILM policy and Index template. However data stream parameters mentioned under logstash pipeline file are giving error.
ILM policy -
{
"testpolicy" : {
"version" : 1,
"modified_date" : "2021-08-28T02:58:25.942Z",
"policy" : {
"phases" : {
"hot" : {
"min_age" : "0ms",
"actions" : {
"rollover" : {
"max_primary_shard_size" : "900mb",
"max_age" : "2d"
},
"set_priority" : {
"priority" : 100
}
}
},
"delete" : {
"min_age" : "2d",
"actions" : {
"delete" : {
"delete_searchable_snapshot" : true
}
}
}
}
},
"in_use_by" : {
"indices" : [ ],
"data_streams" : [ ],
"composable_templates" : [ ]
}
}
}
Index temaplate -
{
"index_templates" : [
{
"name" : "access_template",
"index_template" : {
"index_patterns" : [
"test-data-stream*"
],
"template" : {
"settings" : {
"index" : {
"number_of_shards" : "1",
"number_of_replicas" : "0"
}
},
"mappings" : {
"_routing" : {
"required" : false
},
"dynamic_date_formats" : [
"strict_date_optional_time",
"yyyy/MM/dd HH:mm:ss Z||yyyy/MM/dd Z"
],
"numeric_detection" : true,
"_source" : {
"excludes" : [ ],
"includes" : [ ],
"enabled" : true
},
"dynamic" : true,
"dynamic_templates" : [ ],
"date_detection" : true
}
},
"composed_of" : [ ],
"priority" : 500,
"version" : 1,
"data_stream" : {
"hidden" : false
}
}
}
]
}
logstash pipeline config file -
input {
beats {
port => 5044
}
}
filter {
if [log_type] == "access_server" and [app_id] == "pa"
{
grok {
match => {
"message" => "%{YEAR}-%{MONTHNUM}-%{MONTHDAY}[T ]%{HOUR}:%{MINUTE}(?::?%{SECOND})\| %{USERNAME:exchangeId}\| %{DATA:trackingId}\| %{NUMBER:RoundTrip:int}%{SPACE}ms\| %{NUMBER:ProxyRoundTrip:int}%{SPACE}ms\| %{NUMBER:UserInfoRoundTrip:int}%{SPACE}ms\| %{DATA:Resource}\| %{DATA:subject}\| %{DATA:authmech}\| %{DATA:scopes}\| %{IPV4:Client}\| %{WORD:method}\| %{DATA:Request_URI}\| %{INT:response_code}\| %{DATA:failedRuleType}\| %{DATA:failedRuleName}\| %{DATA:APP_Name}\| %{DATA:Resource_Name}\| %{DATA:Path_Prefix}"
}
}
mutate {
replace => {
"[type]" => "access_server"
}
}
}
}
output {
if [log_type] == "access_server" {
elasticsearch {
hosts => ['http://10.10.10.76:9200']
user => elastic
password => xxx
data_stream => "true"
data_stream_type => "logs"
data_stream_dataset => "access"
data_stream_namespace => "default"
ilm_rollover_alias => "access"
ilm_pattern => "000001"
ilm_policy => "testpolicy"
template => "/tmp/access_template"
template_name => "access_template"
}
}
elasticsearch {
hosts => ['http://10.10.10.76:9200']
index => "%{[#metadata][beat]}-%{[#metadata][version]}-%{+YYYY.MM.dd}"
user => elastic
password => xxx
}
}
After all deployment done, can only see system indices but data stream is not created.
[2021-08-28T12:42:50,103][ERROR][logstash.outputs.elasticsearch][main] Invalid data stream configuration, following parameters are not supported: {"template"=>"/tmp/pingaccess_template", "ilm_pattern"=>"000001", "template_name"=>"pingaccess_template", "ilm_rollover_alias"=>"pingaccess", "ilm_policy"=>"testpolicy"}
[2021-08-28T12:42:50,547][ERROR][logstash.javapipeline ][main] Pipeline error {:pipeline_id=>"main", :exception=>#<LogStash::ConfigurationError: Invalid data stream configuration: ["template", "ilm_pattern", "template_name", "ilm_rollover_alias", "ilm_policy"]>, :backtrace=>["/usr/share/logstash/vendor/bundle/jruby/2.5.0/gems/logstash-output-elasticsearch-11.0.2-java/lib/logstash/outputs/elasticsearch/data_stream_support.rb:57:in `check_data_stream_config!'"
[2021-08-28T12:42:50,702][ERROR][logstash.agent ] Failed to execute action {:id=>:main, :action_type=>LogStash::ConvergeResult::FailedAction, :message=>"Could not execute action: PipelineAction::Create<main>, action_result: false", :backtrace=>nil}
error is saying parameters like template"=>"/tmp/pingaccess_template", "ilm_pattern"=>"000001", "template_name"=>"pingaccess_template", "ilm_rollover_alias"=>"pingaccess", "ilm_policy"=>"testpolicy" are not valid but in below link they are mentioned.
https://www.elastic.co/guide/en/logstash/current/plugins-outputs-elasticsearch.html#plugins-outputs-elasticsearch-data-streams
The solution is to use logstash without be "aware" of data_stream.
FIRST of all (before running logstash) create your ILM and index_template BUT adding the "index.lifecycle.name" in the settings. That way, you are linking the template and ILM. Also, don't forget the data_stream in the index template.
{
"index_templates" : [
{
"name" : "access_template",
"index_template" : {
"index_patterns" : [
"test-data-stream*"
],
"template" : {
"settings" : {
"index" : {
"number_of_shards" : "1",
"number_of_replicas" : "0",
"index.lifecycle.name": "testpolicy"
}
},
"mappings" : {
...
}
},
"composed_of" : [ ],
"priority" : 500,
"version" : 1,
"data_stream" : {
"hidden" : false
}
}
}
]
}
Keep Logstash output like if data_stream doesn't exist but add action => create. This is because you can't use "index" API with data streams. Need the _create API call.
output { elasticsearch {
hosts => ['http://10.10.10.76:9200']
index => "test-data-stream"
user => elastic
password => xxx
action => "create"
}
That way, logstash will output to ES, but the index template will be applied automatically (because of pattern match) and also the ILM and data_stream will be applied.
Important: To make it work, you need to start from scratch. If the index "test-data-stream" already exists in ES (as a traditional index), then data_stream will NOT be created. Make the test with another index name to make sure it works.
The documentation is unclear, but the plugin does not support those options when datastream output is enabled. The plugin is logging the options returned by the invalid_data_stream_params function, which allows action, routing, data_stream, anything else that starts with data_stream_, the shared options defined by the mixin, and the common options defined by the output plugin base.

elasticsearch - moving from multi servers to one server

I have a cluster of 5 servers for elasticsearch, all with the same version of elasticsearch.
I need to move all data from servers 2, 3, 4, 5 to server 1.
How can I do it?
How can I know which server has data at all?
After change of _cluster/settings with:
PUT _cluster/settings
{
"persistent" : {
"cluster.routing.allocation.require._host" : "server1"
}
}
I get for: curl -GET http://localhost:9200/_cat/allocation?v
the following:
shards disk.indices disk.used disk.avail disk.total disk.percent host ip node
6 54.5gb 170.1gb 1.9tb 2.1tb 7 *.*.*.* *.*.*.* node-5
6 50.4gb 167.4gb 1.9tb 2.1tb 7 *.*.*.* *.*.*.* node-3
6 22.6gb 139.8gb 2tb 2.1tb 6 *.*.*.* *.*.*.* node-2
6 49.8gb 166.6gb 1.9tb 2.1tb 7 *.*.*.* *.*.*.* node-4
6 54.8gb 172.1gb 1.9tb 2.1tb 7 *.*.*.* *.*.*.* node-1
and for: GET _cluster/settings?include_defaults
the following:
#! Deprecation: [node.max_local_storage_nodes] setting was deprecated in Elasticsearch and will be removed in a future release!
{
"persistent" : {
"cluster" : {
"routing" : {
"allocation" : {
"require" : {
"_host" : "server1"
}
}
}
}
},
"transient" : { },
"defaults" : {
"cluster" : {
"max_voting_config_exclusions" : "10",
"auto_shrink_voting_configuration" : "true",
"election" : {
"duration" : "500ms",
"initial_timeout" : "100ms",
"max_timeout" : "10s",
"back_off_time" : "100ms",
"strategy" : "supports_voting_only"
},
"no_master_block" : "write",
"persistent_tasks" : {
"allocation" : {
"enable" : "all",
"recheck_interval" : "30s"
}
},
"blocks" : {
"read_only_allow_delete" : "false",
"read_only" : "false"
},
"remote" : {
"node" : {
"attr" : ""
},
"initial_connect_timeout" : "30s",
"connect" : "true",
"connections_per_cluster" : "3"
},
"follower_lag" : {
"timeout" : "90000ms"
},
"routing" : {
"use_adaptive_replica_selection" : "true",
"rebalance" : {
"enable" : "all"
},
"allocation" : {
"node_concurrent_incoming_recoveries" : "2",
"node_initial_primaries_recoveries" : "4",
"same_shard" : {
"host" : "false"
},
"total_shards_per_node" : "-1",
"shard_state" : {
"reroute" : {
"priority" : "NORMAL"
}
},
"type" : "balanced",
"disk" : {
"threshold_enabled" : "true",
"watermark" : {
"low" : "85%",
"flood_stage" : "95%",
"high" : "90%"
},
"include_relocations" : "true",
"reroute_interval" : "60s"
},
"awareness" : {
"attributes" : [ ]
},
"balance" : {
"index" : "0.55",
"threshold" : "1.0",
"shard" : "0.45"
},
"enable" : "all",
"node_concurrent_outgoing_recoveries" : "2",
"allow_rebalance" : "indices_all_active",
"cluster_concurrent_rebalance" : "2",
"node_concurrent_recoveries" : "2"
}
},
...
"nodes" : {
"reconnect_interval" : "10s"
},
"service" : {
"slow_master_task_logging_threshold" : "10s",
"slow_task_logging_threshold" : "30s"
},
...
"name" : "cluster01",
...
"max_shards_per_node" : "1000",
"initial_master_nodes" : [ ],
"info" : {
"update" : {
"interval" : "30s",
"timeout" : "15s"
}
}
},
...
You can use shard allocation filtering to move all your data to server 1.
Simply run this:
PUT _cluster/settings
{
"persistent" : {
"cluster.routing.allocation.require._name" : "node-1",
"cluster.routing.allocation.exclude._name" : "node-2,node-3,node-4,node-5"
}
}
Instead of _name you can also use _ip or _host depending on what is more practical for you.
After running this command, all primary shards will migrate to server1 (the replicas will be unassigned). You just need to make sure that server1 has enough storage space to store all the primary shards.
If you want to get rid of the unassigned replicas (and get back to green state), simply run this:
PUT _all/_settings
{
"index" : {
"number_of_replicas" : 0
}
}

ILM using Logstash Elasticsearch output plugin doesn't work

I'm trying to implement ILM for an index to properly use hardware, using the Elasticsearch output plugin. Looks like I misunderstand how Logstash manages ILM.
I have ELK stack version 7.1.0 in docker. X-Pack is activated by trial license.
The index template is managed by Logstash Elasticsearch output plugin and the index lifecycle policy was created using Kibana.
Here is the output section of Logstash pipeline:
output {
elasticsearch {
hosts => ["http://eshost:9200"]
user => "logstash_writer"
password => "pass"
template => "/usr/share/logstash/es_templates/ilm-template.json"
template_name => "ilm-template"
template_overwrite => true
ilm_enabled => true
ilm_rollover_alias => "ilm-index"
ilm_pattern => "000001"
ilm_policy => "base-policy"
}
}
User logstash_writer has default role logstash_writer with permissions for ILM management.
Elasticsearch index template ilm-template.json:
{
"settings" : {
"index.number_of_replicas" : "1",
"index.number_of_shards" : "1",
"index.refresh_interval" : "5s"
}
}
Elasticsearch index template _template/ilm-template that was actually created by Logstash:
{
"ilm-template" : {
"order" : 0,
"index_patterns" : [
"ilm-index-*"
],
"settings" : {
"index" : {
"lifecycle" : {
"name" : "base-policy",
"rollover_alias" : "ilm-index"
},
"refresh_interval" : "5s",
"number_of_shards" : "1",
"number_of_replicas" : "1"
}
},
"mappings" : { },
"aliases" : { }
}
}
Policy base-policy created using Kibana:
{
"policy": {
"phases": {
"hot": {
"min_age": "0ms",
"actions": {
"rollover": {
"max_size": "100mb",
"max_docs": 100000
},
"set_priority": {
"priority": 100
}
}
},
"delete": {
"min_age": "2d",
"actions": {
"delete": {}
}
}
}
}
}
I expect the set of indices ilm-index-*, but only ilm-index-000001 is created and constantly growing, despite the limitations of base-policy. So I only see in Kibana one index (ilm-index-000001) associated with base-policy.
The provided configuration is correct. The problem is in interpretation of max_size and max_docs parameters when they have small value. Elasticsearch doesn't rollover indices when it's pri.store.size and docs.count become larger than set in max_size and max_docs.

Elasticsearch TTL not working

I use elasticsearch for logs, I don't want to use daily index to delete them with a cron job but with the TTL. I 've actived and set TTL with the value: 30s. I have a succesfull answer when I send this operation and I can see the TTL value(in milliseconds) when I do the mapping request.
All seems good but documents are not be deleted...
_mapping :
{
"logs" : {
"webservers" : {
"_ttl" : {
"default" : 30000
},
"properties" : {
#timestamp" : {
"type" : "date",
"format" : "dateOptionalTime"
}
}
}
}
}
I guess you just need to enable _ttl for your type, which is disabled by default. Have a look here.
{
"webservers" : {
"_ttl" : { "enabled" : true, "default" : "30s" }
}
}

Resources