Logstash parsing different line than 1st line as header - elasticsearch
I have a sample data:
employee_name,user_id,O,C,E,A,N
Yvette Vivien Donovan,YVD0093,38,19,29,15,36
Troy Alvin Craig,TAC0118,34,40,24,15,34
Eden Jocelyn Mcclain,EJM0952,20,37,48,35,34
Alexa Emma Wood,AEW0655,25,20,18,40,38
Celeste Maris Griffith,CMG0936,36,13,18,50,29
Tanek Orson Griffin,TOG0025,40,36,24,19,26
Colton James Lowery,CJL0436,39,41,27,25,28
Baxter Flynn Mcknight,BFM0761,42,32,28,17,22
Olivia Calista Hodges,OCH0195,37,36,39,38,32
Price Zachery Maldonado,PZM0602,24,46,30,18,29
Daryl Delilah Atkinson,DDA0185,17,43,33,18,25
And logstash config file as:
input {
file {
path => "/path/psychometric_data.csv"
start_position => "beginning"
}
}
filter {
csv {
separator => ","
autodetect_column_names => true
autogenerate_column_names => true
}
}
output {
amazon_es {
hosts => [ "https://xxx-xxx-es-xxx.xx-xx-1.es.amazonaws.com:443" ]
ssl => true
region => "ap-south-1"
index => "psychometric_data"
}
}
I am expecting 1st row(i.e. employee_name,user_id,O,C,E,A,N) as a Elasticsearch field name(header), but I am gettting 3rd row(i.e.Troy Alvin Craig,TAC0118,34,40,24,15,34) as header as follows.
{
"_index": "psychometric_data",
"_type": "_doc",
"_id": "md4hm3YB8",
"_score": 1,
"_source": {
"15": "21",
"24": "17",
"34": "39",
"40": "37",
"#version": "1",
"#timestamp": "2020-12-25T18:20:00.759Z",
"message": "Ishmael Mannix Velazquez,IMV0086,22,37,17,21,39\r",
"path": "/path/psychometric_data.csv",
"Troy Alvin Craig": "Ishmael Mannix Velazquez",
"host": "xx-ThinkPad-xx",
"TAC0118": "IMV0086"
}
}
What might be the reason for it?
If you set autodetect_column_names to true then the filter interprets the first line that it sees as the column names. If pipeline.workers is set to more than one then it is a race to see which thread sets the column names first. Since different workers are processing different lines this means it may not use the first line. You must set pipeline.workers to 1.
In addition to that, the java execution engine (enabled by default) does not always preserve the order of events. There is a setting pipeline.ordered in logstash.yml that controls that. In 7.9 that keeps event order iff pipeline.workers is set to 1.
You do not say which version you are running. For anything from 7.0 (when java_execution became the default) to 7.6 the fix is to disable the java engine using either pipeline.java_execution: false in logstash.yml or --java_execution false on the command line. For any 7.x release from 7.7 onwards, make sure pipeline.ordered is set to auto or true (auto is the default in 7.x). In future releases (8.x perhaps) pipeline.ordered will default to false.
Related
Create a Kibana graph from logstash logs
I need to create a graph in kibana according to a specific value. Here is my raw log from logstash : 2016-03-14T15:01:21.061Z Accueil-PC 14-03-2016 16:01:19.926 [pool-3-thread-1] INFO com.github.vspiewak.loggenerator.SearchRequest - id=300,ip=84.102.53.31,brand=Apple,name=iPhone 5S,model=iPhone 5S - Gris sideral - Disque 64Go,category=Mobile,color=Gris sideral,options=Disque 64Go,price=899.0 In this log line, I have the id information "id=300". In order to create graphics in Kibana using the id value, I want a new field. So I have a specific grok configuration : grok { match => ["message", "(?<mycustomnewfield>id=%{INT}+)"] } With this transformation I get the following JSON : { "_index": "metrics-2016.03.14", "_type": "logs", "_id": "AVN1k-cJcXxORIbORG7w", "_score": null, "_source": { "message": "{\"message\":\"14-03-2016 15:42:18.739 [pool-1950-thread-1] INFO com.github.vspiewak.loggenerator.SellRequest - id=300,ip=54.226.24.77,email=client951#gmail.com,sex=F,brand=Apple,name=iPad R\\\\xE9tina,model=iPad R\\\\xE9tina - Noir,category=Tablette,color=Noir,price=509.0\\\\r\",\"#version\":\"1\",\"#timestamp\":\"2016-03-14T14:42:19.040Z\",\"path\":\"D:\\\\LogStash\\\\logstash-2.2.2\\\\logstash-2.2.2\\\\bin\\\\logs.logs.txt\",\"host\":\"Accueil-PC\",\"type\":\"metrics-type\",\"mycustomnewfield\":\"300\"}", "#version": "1", "#timestamp": "2016-03-14T14:42:19.803Z", "host": "127.0.0.1", "port": 57867 }, "fields": { "#timestamp": [ 1457966539803 ] }, "sort": [ 1457966539803 ]} A new field was actually created (the field 'mycustomnewfield') but within the message field ! As a result I can't see it in kibana when I try to create a graph. I tried to create a "scripted field" in Kibana but only numeric field can be accessed. Should I create an index in elasticSearch with a specific mapping to create a new field ?
There was actually something wrong with my configuration. I should have paste the whole configuration with my question. In fact i'm using logstash as a shipper and also as a log server. On the server side, I modified the configuration : input { tcp { port => "yyyy" host => "x.x.x.x" mode => "server" codec => json # I forgot this option }} Because the logstash shipper is actually sending json, I need to advice the server about this. Now I no longer have a message field within a message field, and my new field is inserted at the right place.
Multiple Logstash Outputs depending from collectd
I'm facing a configuration failure which I can't solve on my own, tried to get the solution with the documentation, but without luck. I'm having a few different hosts which send their metrics via collectd to logstash. Inside the logstash configuration I'd like to seperate each host and pipe it into an own ES-index. When I try to configtest my settings logstash throws a failure - maybe someone can help me. The seperation should be triggered by the hostname collectd delivers: [This is an old raw json output, so please don't mind the wrong set index] { "_index": "wv-metrics", "_type": "logs", "_id": "AVHyJunyGanLcfwDBAon", "_score": null, "_source": { "host": "somefqdn.com", "#timestamp": "2015-12-30T09:10:15.211Z", "plugin": "disk", "plugin_instance": "dm-5", "collectd_type": "disk_merged", "read": 0, "write": 0, "#version": "1" }, "fields": { "#timestamp": [ 1451466615211 ] }, "sort": [ 1451466615211 ] } Please see my config: Input Config (Working so far) input { udp { port => 25826 buffer_size => 1452 codec => collectd { } } } Output Config File: filter { if [host] == "somefqdn.com" { output { elasticsearch { hosts => "someip:someport" user => logstash password => averystrongpassword index => "somefqdn.com" } } } } Error which is thrown: root#test-collectd1:/home/username# service logstash configtest Error: Expected one of #, => at line 21, column 17 (byte 314) after filter { if [host] == "somefqdn.com" { output { elasticsearch I understand, that there's a character possible missing in my config, but I can't locate it. Thx in advance!
I spot two errors in a quick scan: First, your output stanza should not be wrapped with a filter{} block. Second, your output stanza should start with output{} (put the conditional inside): output { if [host] == "somefqdn.com" { elasticsearch { ... } } }
_grokparsefailure without Filters
I have some simple logstash configuration: input { syslog { port => 5140 type => "fortigate" } } output { elasticsearch { cluster => "logging" node_name => "logstash-logging-03" bind_host => "10.100.19.77" } } Thats it. Problem is that the documents that end up in elasticsearch do contain a _grokparsefailure: { "_index": "logstash-2014.12.19", ... "_source": { "message": ...", ... "tags": [ "_grokparsefailure" ], ... }, ... } How come? There are no (grok) filters...
OK: The syslog input obviously makes use of gork internally. Therefore, if some other log format than "syslog" hits the input a "_grokparsefailure" will occure. Instead, I just used "tcp" and "udp" inputs to achieve the required result (I was not aware of them before). Cheers
Remove Duplicate Fields Used for document_id Before Elasticsearch in Logstash
I wrote my own filter for Logstash and I'm trying to calculate my own document_id something like this: docIdClean = "%d %s %s %s" % [ event["#timestamp"].to_f * 1000, event["type"], event["message"] ] event["docId"] = Digest::MD5.hexdigest(docIdClean) And the Logstash configuration looks like this: output { elasticsearch { ... index => "analysis-%{+YYYY.MM.dd}" document_id => "%{docId}" template_name => "logstash_per_index" } } The more or less minor downside is that all documents in Elasticsearch contain _id and docId holding the same value. Since docId is completely pointless as nobody searches for an MD5-hash I want to remove it, but I don't know how. The docId has to exist when the event hits the output, otherwise the output can't refer to it. Therefore, I can't remove it beforehand. Since I can't remove it afterwards, the docId sits there occupying space. I tried to set the event field _id, but that only causes an exception in Elasticsearch that the id of the document is different. Maybe for explanation here one document: { "_index": "analysis-2014.09.16", "_type": "access", "_id": "022d9055423cdd0756b6cfa06886f866", "_score": 1, "_source": { "#timestamp": "2014-09-16T19:36:31.000+02:00", "type": "access", "tags": [ "personalized" ], "importDate": "2014/09/17", "docId": "022d9055423cdd0756b6cfa06886f866" } } EDIT: This is about Logstash 1.3
There's nothing you can do about this in Logstash 1.4. In Logstash 1.5, you can use #metadata fields, which are not passed to Elasticsearch.
How to stop logstash from creating a default mapping in ElasticSearch
I am using logstash to feed logs into ElasticSearch. I am configuring logstash output as: input { file { path => "/tmp/foo.log" codec => plain { format => "%{message}" } } } output { elasticsearch { #host => localhost codec => json {} manage_template => false index => "4glogs" } } I notice that as soon as I start logstash it creates a mapping ( logs ) in ES as below. { "4glogs": { "mappings": { "logs": { "properties": { "#timestamp": { "type": "date", "format": "dateOptionalTime" }, "#version": { "type": "string" }, "message": { "type": "string" } } } } } } How can I prevent logstash from creating this mapping ? UPDATE: I have now resolved this error too. "object mapping for [logs] tried to parse as object, but got EOF, has a concrete value been provided to it?" As John Petrone has stated below, once you define a mapping, you have to ensure that your documents conform to the mapping. In my case, I had defined a mapping of "type: nested" but the output from logstash was a string. So I removed all codecs ( whether json or plain ) from my logstash config and that allowed the json document to pass through without changes. Here is my new logstash config ( with some additional filters for multiline logs ). input { kafka { zk_connect => "localhost:2181" group_id => "logstash_group" topic_id => "platform-logger" reset_beginning => false consumer_threads => 1 queue_size => 2000 consumer_id => "logstash-1" fetch_message_max_bytes => 1048576 } file { path => "/tmp/foo.log" } } filter { multiline { pattern => "^\s" what => "previous" } multiline { pattern => "[0-9]+$" what => "previous" } multiline { pattern => "^$" what => "previous" } mutate{ remove_field => ["kafka"] remove_field => ["#version"] remove_field => ["#timestamp"] remove_tag => ["multiline"] } } output { elasticsearch { manage_template => false index => "4glogs" } }
You will need a mapping to store data in Elasticsearch and to search on it - that's how ES knows how to index and search those content types. You can either let logstash create it dynamically or you can prevent it from doing so and instead create it manually. Keep in mind you cannot change existing mappings (although you can add to them). So first off you will need to delete the existing index. You would then modify your settings to prevent dynamic mapping creation. At the same time you will want to create your own mapping. For example, this will create the mappings for the logstash data but also restrict any dynamic mapping creation via "strict": $ curl -XPUT 'http://localhost:9200/4glogs/logs/_mapping' -d ' { "logs" : { "dynamic": "strict", "properties" : { "#timestamp": { "type": "date", "format": "dateOptionalTime" }, "#version": { "type": "string" }, "message": { "type": "string" } } } } ' Keep in mind that the index name "4glogs" and the type "logs" need to match what is coming from logstash. For my production systems I generally prefer to turn off dynamic mapping as it avoids accidental mapping creation. The following links should be useful if you want to make adjustments to your dynamic mappings: https://www.elastic.co/guide/en/elasticsearch/guide/current/dynamic-mapping.html http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/custom-dynamic-mapping.html http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/dynamic-mapping.html
logs in this case is the index_type. If you don't want to create it as logs, specify some other index_type on your elasticsearch element. Every record in elasticsearch is required to have an index and a type. Logstash defaults to logs if you haven't specified it. There's always an implicit mapping created when you insert records into Elasticsearch, so you can't prevent it from being created. You can create the mapping yourself before you insert anything (via say a template mapping). The setting manage_template of false just prevents it from creating the template mapping for the index you've specified. You can delete the existing template if it's already been created by using something like curl -XDELETE http://localhost:9200/_template/logstash?pretty
Index templates can help you. Please see this jira for more details. You can create index templates with wildcard support to match an index name and put your default mappings.