creating data stream through logstash - elasticsearch

I have installed elasticsearch cluster v 7.14.
I have created ILM policy and Index template. However data stream parameters mentioned under logstash pipeline file are giving error.
ILM policy -
{
"testpolicy" : {
"version" : 1,
"modified_date" : "2021-08-28T02:58:25.942Z",
"policy" : {
"phases" : {
"hot" : {
"min_age" : "0ms",
"actions" : {
"rollover" : {
"max_primary_shard_size" : "900mb",
"max_age" : "2d"
},
"set_priority" : {
"priority" : 100
}
}
},
"delete" : {
"min_age" : "2d",
"actions" : {
"delete" : {
"delete_searchable_snapshot" : true
}
}
}
}
},
"in_use_by" : {
"indices" : [ ],
"data_streams" : [ ],
"composable_templates" : [ ]
}
}
}
Index temaplate -
{
"index_templates" : [
{
"name" : "access_template",
"index_template" : {
"index_patterns" : [
"test-data-stream*"
],
"template" : {
"settings" : {
"index" : {
"number_of_shards" : "1",
"number_of_replicas" : "0"
}
},
"mappings" : {
"_routing" : {
"required" : false
},
"dynamic_date_formats" : [
"strict_date_optional_time",
"yyyy/MM/dd HH:mm:ss Z||yyyy/MM/dd Z"
],
"numeric_detection" : true,
"_source" : {
"excludes" : [ ],
"includes" : [ ],
"enabled" : true
},
"dynamic" : true,
"dynamic_templates" : [ ],
"date_detection" : true
}
},
"composed_of" : [ ],
"priority" : 500,
"version" : 1,
"data_stream" : {
"hidden" : false
}
}
}
]
}
logstash pipeline config file -
input {
beats {
port => 5044
}
}
filter {
if [log_type] == "access_server" and [app_id] == "pa"
{
grok {
match => {
"message" => "%{YEAR}-%{MONTHNUM}-%{MONTHDAY}[T ]%{HOUR}:%{MINUTE}(?::?%{SECOND})\| %{USERNAME:exchangeId}\| %{DATA:trackingId}\| %{NUMBER:RoundTrip:int}%{SPACE}ms\| %{NUMBER:ProxyRoundTrip:int}%{SPACE}ms\| %{NUMBER:UserInfoRoundTrip:int}%{SPACE}ms\| %{DATA:Resource}\| %{DATA:subject}\| %{DATA:authmech}\| %{DATA:scopes}\| %{IPV4:Client}\| %{WORD:method}\| %{DATA:Request_URI}\| %{INT:response_code}\| %{DATA:failedRuleType}\| %{DATA:failedRuleName}\| %{DATA:APP_Name}\| %{DATA:Resource_Name}\| %{DATA:Path_Prefix}"
}
}
mutate {
replace => {
"[type]" => "access_server"
}
}
}
}
output {
if [log_type] == "access_server" {
elasticsearch {
hosts => ['http://10.10.10.76:9200']
user => elastic
password => xxx
data_stream => "true"
data_stream_type => "logs"
data_stream_dataset => "access"
data_stream_namespace => "default"
ilm_rollover_alias => "access"
ilm_pattern => "000001"
ilm_policy => "testpolicy"
template => "/tmp/access_template"
template_name => "access_template"
}
}
elasticsearch {
hosts => ['http://10.10.10.76:9200']
index => "%{[#metadata][beat]}-%{[#metadata][version]}-%{+YYYY.MM.dd}"
user => elastic
password => xxx
}
}
After all deployment done, can only see system indices but data stream is not created.
[2021-08-28T12:42:50,103][ERROR][logstash.outputs.elasticsearch][main] Invalid data stream configuration, following parameters are not supported: {"template"=>"/tmp/pingaccess_template", "ilm_pattern"=>"000001", "template_name"=>"pingaccess_template", "ilm_rollover_alias"=>"pingaccess", "ilm_policy"=>"testpolicy"}
[2021-08-28T12:42:50,547][ERROR][logstash.javapipeline ][main] Pipeline error {:pipeline_id=>"main", :exception=>#<LogStash::ConfigurationError: Invalid data stream configuration: ["template", "ilm_pattern", "template_name", "ilm_rollover_alias", "ilm_policy"]>, :backtrace=>["/usr/share/logstash/vendor/bundle/jruby/2.5.0/gems/logstash-output-elasticsearch-11.0.2-java/lib/logstash/outputs/elasticsearch/data_stream_support.rb:57:in `check_data_stream_config!'"
[2021-08-28T12:42:50,702][ERROR][logstash.agent ] Failed to execute action {:id=>:main, :action_type=>LogStash::ConvergeResult::FailedAction, :message=>"Could not execute action: PipelineAction::Create<main>, action_result: false", :backtrace=>nil}
error is saying parameters like template"=>"/tmp/pingaccess_template", "ilm_pattern"=>"000001", "template_name"=>"pingaccess_template", "ilm_rollover_alias"=>"pingaccess", "ilm_policy"=>"testpolicy" are not valid but in below link they are mentioned.
https://www.elastic.co/guide/en/logstash/current/plugins-outputs-elasticsearch.html#plugins-outputs-elasticsearch-data-streams

The solution is to use logstash without be "aware" of data_stream.
FIRST of all (before running logstash) create your ILM and index_template BUT adding the "index.lifecycle.name" in the settings. That way, you are linking the template and ILM. Also, don't forget the data_stream in the index template.
{
"index_templates" : [
{
"name" : "access_template",
"index_template" : {
"index_patterns" : [
"test-data-stream*"
],
"template" : {
"settings" : {
"index" : {
"number_of_shards" : "1",
"number_of_replicas" : "0",
"index.lifecycle.name": "testpolicy"
}
},
"mappings" : {
...
}
},
"composed_of" : [ ],
"priority" : 500,
"version" : 1,
"data_stream" : {
"hidden" : false
}
}
}
]
}
Keep Logstash output like if data_stream doesn't exist but add action => create. This is because you can't use "index" API with data streams. Need the _create API call.
output { elasticsearch {
hosts => ['http://10.10.10.76:9200']
index => "test-data-stream"
user => elastic
password => xxx
action => "create"
}
That way, logstash will output to ES, but the index template will be applied automatically (because of pattern match) and also the ILM and data_stream will be applied.
Important: To make it work, you need to start from scratch. If the index "test-data-stream" already exists in ES (as a traditional index), then data_stream will NOT be created. Make the test with another index name to make sure it works.

The documentation is unclear, but the plugin does not support those options when datastream output is enabled. The plugin is logging the options returned by the invalid_data_stream_params function, which allows action, routing, data_stream, anything else that starts with data_stream_, the shared options defined by the mixin, and the common options defined by the output plugin base.

Related

Elastic Pipelines: Skip Import On Failure

Individual processors in an Elastic pipelines have an on_failure attribute. This allows you to handle a failure/error in a pipeline. The example in the docs show setting an additional field on your document.
{
"description" : "my first pipeline with handled exceptions",
"processors" : [
{
"rename" : {
"field" : "foo",
"to" : "bar",
"on_failure" : [
{
"set" : {
"field" : "error.message",
"value" : "{{ _ingest.on_failure_message }}"
}
}
]
}
}
]
}
Is it possible to tell the pipeline to SKIP importing a document if any processors in the pipeline fail?
You can "hijack" the drop processor to skip either directly in the on_failure step (no need for _ingest.on_failure_message if you're aborting anyways):
{
"description": "my first pipeline with handled exceptions",
"processors": [
{
"rename" : {
"field" : "foo",
"target_field": "bar",
"on_failure" : [
{
"drop" : {
"if" : "true"
}
}
]
}
}
]
}
or use it as a separate processor, perhaps at the very end, after ctx.error has been set by any of the processors' on_failure handlers:
{
"description": "my first pipeline with handled exceptions",
"processors": [
{
"rename" : {
"field" : "foo",
"to" : "bar",
"on_failure" : [
{
"set" : {
"field" : "error.message",
"value" : "{{ _ingest.on_failure_message }}"
}
}
]
}
},
{
"drop": {
"if": "ctx.error.size() != null"
}
}
]
}
Both of these will result in a noop when the pipeline is applied.

ILM using Logstash Elasticsearch output plugin doesn't work

I'm trying to implement ILM for an index to properly use hardware, using the Elasticsearch output plugin. Looks like I misunderstand how Logstash manages ILM.
I have ELK stack version 7.1.0 in docker. X-Pack is activated by trial license.
The index template is managed by Logstash Elasticsearch output plugin and the index lifecycle policy was created using Kibana.
Here is the output section of Logstash pipeline:
output {
elasticsearch {
hosts => ["http://eshost:9200"]
user => "logstash_writer"
password => "pass"
template => "/usr/share/logstash/es_templates/ilm-template.json"
template_name => "ilm-template"
template_overwrite => true
ilm_enabled => true
ilm_rollover_alias => "ilm-index"
ilm_pattern => "000001"
ilm_policy => "base-policy"
}
}
User logstash_writer has default role logstash_writer with permissions for ILM management.
Elasticsearch index template ilm-template.json:
{
"settings" : {
"index.number_of_replicas" : "1",
"index.number_of_shards" : "1",
"index.refresh_interval" : "5s"
}
}
Elasticsearch index template _template/ilm-template that was actually created by Logstash:
{
"ilm-template" : {
"order" : 0,
"index_patterns" : [
"ilm-index-*"
],
"settings" : {
"index" : {
"lifecycle" : {
"name" : "base-policy",
"rollover_alias" : "ilm-index"
},
"refresh_interval" : "5s",
"number_of_shards" : "1",
"number_of_replicas" : "1"
}
},
"mappings" : { },
"aliases" : { }
}
}
Policy base-policy created using Kibana:
{
"policy": {
"phases": {
"hot": {
"min_age": "0ms",
"actions": {
"rollover": {
"max_size": "100mb",
"max_docs": 100000
},
"set_priority": {
"priority": 100
}
}
},
"delete": {
"min_age": "2d",
"actions": {
"delete": {}
}
}
}
}
}
I expect the set of indices ilm-index-*, but only ilm-index-000001 is created and constantly growing, despite the limitations of base-policy. So I only see in Kibana one index (ilm-index-000001) associated with base-policy.
The provided configuration is correct. The problem is in interpretation of max_size and max_docs parameters when they have small value. Elasticsearch doesn't rollover indices when it's pri.store.size and docs.count become larger than set in max_size and max_docs.

Logstash not storing data in correct format on ES

My setup is following:
Application is writing logs in a file : file.json.
Each line in the file contains a single json record which I want to store in ES index.
I have configured logstash to read from this file and store the data on ES.
But logstash is not storing the complete json record as a source in ES. Instead its putting the complete json in a single field.
More details:
Logstash Configuration:
input {
file {
type => "json"
path => "file.json"
start_position => beginning
}
}
output {
elasticsearch {
hosts => [ "10.1.1.1:9200" ]
index => "index-name"
}
}
file.json:
{"name":"John", "age":31, "city":"Abc"}
Stored record in ES:
"hits" : {
"total" : 1,
"max_score" : 1.0,
"hits" : [
{
"_index" : "index_name",
"_type" : "doc",
"_id" : "eHD8KGUBy7HkgIAhGQln",
"_score" : 1.0,
"_source" : {
"type" : "json",
"host" : "qa-box",
"#version" : "1",
"message" : "{name:John,age:31,city:Abc}",
"#timestamp" : "2018-08-11T12:35:34.184Z",
"path" : "file.json"
}
}
]
}
I think you need a filter to treat it like JSON:
input {
file {
type => "json"
path => "file.json"
start_position => beginning
}
}
filter {
json {
source => "message"
}
}
output {
elasticsearch {
hosts => [ "10.1.1.1:9200" ]
index => "index-name"
}
}
Logstash treats the input as plain text by default and stores it in message field.
You should use json filter as #dwjv said.
Also if you want to remove original messagefield (and you may want to remove extra fields generated by logstash like path and host) you can use mutate filter too as follows:
filter {
json {
source => "message"
}
mutate {
remove_field => [ "message", "path", "host" ]
}
}

CSV geodata into elasticsearch as a geo_point type using logstash

Below is a reproducible example of the problem I am having using to most recent versions of logstash and elasticsearch.
I am using logstash to input geospatial data from a csv into elasticsearch as geo_points.
The CSV looks like the following:
$ head simple_base_map.csv
"lon","lat"
-1.7841,50.7408
-1.7841,50.7408
-1.78411,50.7408
-1.78412,50.7408
-1.78413,50.7408
-1.78414,50.7408
-1.78415,50.7408
-1.78416,50.7408
-1.78416,50.7408
I have create a mapping template that looks like the following:
$ cat simple_base_map_template.json
{
"template": "base_map_template",
"order": 1,
"settings": {
"number_of_shards": 1
},
"mappings": {
"node_points" : {
"properties" : {
"location" : { "type" : "geo_point" }
}
}
}
}
and have a logstash config file that looks like the following:
$ cat simple_base_map.conf
input {
stdin {}
}
filter {
csv {
columns => [
"lon", "lat"
]
}
if [lon] == "lon" {
drop { }
} else {
mutate {
remove_field => [ "message", "host", "#timestamp", "#version" ]
}
mutate {
convert => { "lon" => "float" }
convert => { "lat" => "float" }
}
mutate {
rename => {
"lon" => "[location][lon]"
"lat" => "[location][lat]"
}
}
}
}
output {
stdout { codec => dots }
elasticsearch {
index => "base_map_simple"
template => "simple_base_map_template.json"
document_type => "node_points"
}
}
I then run the following:
$cat simple_base_map.csv | logstash-2.1.3/bin/logstash -f simple_base_map.conf
Settings: Default filter workers: 16
Logstash startup completed
....................................................................................................Logstash shutdown completed
However when looking at the index base_map_simple, it suggests the documents would not have a location: geo_point type in it...and rather it would be two doubles of lat and lon.
$ curl -XGET 'localhost:9200/base_map_simple?pretty'
{
"base_map_simple" : {
"aliases" : { },
"mappings" : {
"node_points" : {
"properties" : {
"location" : {
"properties" : {
"lat" : {
"type" : "double"
},
"lon" : {
"type" : "double"
}
}
}
}
}
},
"settings" : {
"index" : {
"creation_date" : "1457355015883",
"uuid" : "luWGyfB3ToKTObSrbBbcbw",
"number_of_replicas" : "1",
"number_of_shards" : "5",
"version" : {
"created" : "2020099"
}
}
},
"warmers" : { }
}
}
How would i need to change any of the above files to ensure that it goes into elastic search as a geo_point type?
Finally, I would like to be able to carry out a nearest neighbour search on the geo_points by using a command such as the following:
curl -XGET 'localhost:9200/base_map_simple/_search?pretty' -d'
{
"size": 1,
"sort": {
"_geo_distance" : {
"location" : {
"lat" : 50,
"lon" : -1
},
"order" : "asc",
"unit": "m"
}
}
}'
Thanks
The problem is that in your elasticsearch output you named the index base_map_simple while in your template the template property is base_map_template, hence the template is not being applied when creating the new index. The template property needs to somehow match the name of the index being created in order for the template to kick in.
It will work if you simply change the latter to base_map_*, i.e. as in:
{
"template": "base_map_*", <--- change this
"order": 1,
"settings": {
"index.number_of_shards": 1
},
"mappings": {
"node_points": {
"properties": {
"location": {
"type": "geo_point"
}
}
}
}
}
UPDATE
Make sure to delete the current index as well as the template first., i.e.
curl -XDELETE localhost:9200/base_map_simple
curl -XDELETE localhost:9200/_template/logstash

Bad indexing performance of elasticsearch

Currently, I'm using elastic search to store and query some logs. We set up a five node elastic search cluster. Among them two indexing nodes and three query nodes. In the indexing node, we have redis, logstash and elasticsearch on both two servers. The elasticsearch uses NFS storage as data store. Our requirement is to index 300 log entries/second. But the best performance I can get from elasticsearch is only 25 log entries/second!
The XMX of the elasticsearch is 16G.
Version of each component:
Redis: 2.8.12
logstash: 1.4.2
elasticsearch: 1.5.0
Our current index settings are like this:
{
"userlog" : {
"settings" : {
"index" : {
"index" : {
"store" : {
"type" : "mmapfs"
},
"translog" : {
"flush_threshold_ops" : "50000"
}
},
"number_of_replicas" : "1",
"translog" : {
"flush_threshold_size" : "1G",
"durability" : "async"
},
"merge" : {
"scheduler" : {
"max_thread_count" : "1"
}
},
"indexing" : {
"slowlog" : {
"threshold" : {
"index" : {
"trace" : "2s",
"info" : "5s"
}
}
}
},
"memory" : {
"index_buffer_size" : "3G"
},
"refresh_interval" : "30s",
"version" : {
"created" : "1050099"
},
"creation_date" : "1447730702943",
"search" : {
"slowlog" : {
"threshold" : {
"fetch" : {
"debug" : "500ms"
},
"query" : {
"warn" : "10s",
"trace" : "1s"
}
}
}
},
"indices" : {
"memory" : {
"index_buffer_size" : "30%"
}
},
"uuid" : "E1ttme3fSxKVD5kRHEr_MA",
"index_currency" : "32",
"number_of_shards" : "5"
}
}
}
}
Here's my logstash config:
input {
redis {
host => "eanprduserreporedis01.eao.abn-iad.ea.com"
port => "6379"
type => "redis-input"
data_type => "list"
key => "userLog"
threads => 15
}
# Second reids block begin
redis {
host => "eanprduserreporedis02.eao.abn-iad.ea.com"
port => "6379"
type => "redis-input"
data_type => "list"
key => "userLog"
threads => 15
}
# Second reids block end
}
output {
elasticsearch {
cluster => "customizedlog_prod"
index => "userlog"
workers => 30
}
stdout{}
}
A very strange thing is although currently the indexing speed is only ~20/s, the IO wait is very high almost 70%. And mostly are read traffic. Through nfsiostat, current read speed is about 200Mbps! So basically, to index every log entry, it will read about 10Mbits of data which is insane because the average length of our log entry is less than 10K.
So, I took a jstack dump of the elastic search, here's the result of one RUNNING thread:
"elasticsearch[somestupidhostname][bulk][T#3]" daemon prio=10 tid=0x00007f230c109800 nid=0x79f6 runnable [0x00007f1ba85f0000]
java.lang.Thread.State: RUNNABLE
at sun.nio.ch.FileDispatcherImpl.pread0(Native Method)
at sun.nio.ch.FileDispatcherImpl.pread(FileDispatcherImpl.java:52)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:220)
at sun.nio.ch.IOUtil.read(IOUtil.java:197)
at sun.nio.ch.FileChannelImpl.readInternal(FileChannelImpl.java:730)
at sun.nio.ch.FileChannelImpl.read(FileChannelImpl.java:715)
at org.apache.lucene.store.NIOFSDirectory$NIOFSIndexInput.readInternal(NIOFSDirectory.java:179)
at org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:342)
at org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:54)
at org.apache.lucene.store.DataInput.readVInt(DataInput.java:122)
at org.apache.lucene.store.BufferedIndexInput.readVInt(BufferedIndexInput.java:221)
at org.apache.lucene.codecs.blocktree.SegmentTermsEnumFrame.loadBlock(SegmentTermsEnumFrame.java:152)
at org.apache.lucene.codecs.blocktree.SegmentTermsEnum.seekExact(SegmentTermsEnum.java:506)
at org.elasticsearch.common.lucene.uid.PerThreadIDAndVersionLookup.lookup(PerThreadIDAndVersionLookup.java:104)
at org.elasticsearch.common.lucene.uid.Versions.loadDocIdAndVersion(Versions.java:150)
at org.elasticsearch.common.lucene.uid.Versions.loadVersion(Versions.java:161)
at org.elasticsearch.index.engine.InternalEngine.loadCurrentVersionFromIndex(InternalEngine.java:1002)
at org.elasticsearch.index.engine.InternalEngine.innerCreate(InternalEngine.java:277)
- locked <0x00000005fc76b938> (a java.lang.Object)
at org.elasticsearch.index.engine.InternalEngine.create(InternalEngine.java:256)
at org.elasticsearch.index.shard.IndexShard.create(IndexShard.java:455)
at org.elasticsearch.action.bulk.TransportShardBulkAction.shardIndexOperation(TransportShardBulkAction.java:437)
at org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:149)
at org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction.performOnPrimary(TransportShardReplicationOperationAction.java:515)
at org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction$1.run(TransportShardReplicationOperationAction.java:422)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Can anyone tell me what is elastic search doing and why the indexing is so slow? And is it possible to improve it?
It may not be entirely responsible for your poor performance, but check out the batch_size option for redis. I'll bet it'll get better if you're pulling more than 1 document from redis at a time.

Resources