Logstash date parsing error ( date field doesn't have any time) - elasticsearch

My data has date in the format yyyy-MM-dd ex : "2015-10-12"
My logstash date filter is as below
input {
file {
path => "/etc/logstash/immport.csv"
codec => multiline {
pattern => "^S*"
negate => true
what => "previous"
}
start_position => "beginning"
}
}
filter {
csv {
separator => ","
autodetect_column_names => true
skip_empty_columns => true
}
date {
match => ["start_date", "yyyy-MM-dd"]
target => "start_date"
}
mutate {
rename => {"start_date" => "[study][startDate]"}
}
}
output {
elasticsearch {
action => "index"
hosts => ["elasticsearch-5-6:9200"]
index => "immport12"
document_type => "dataset"
template => "/etc/logstash/immport-mapping.json"
template_name => "mapping_template"
template_overwrite => true
}
stdout { codec => rubydebug }
}
However, my es instance is not able to parse it and I'm getting following error
"error"=>{"type"=>"mapper_parsing_exception", "reason"=>"failed to parse [study.startDate]", "caused_by"=>{"type"=>"illegal_argument_exception", "reason"=>"Invalid format: \"2012-04-17T00:00:00.000Z\" is malformed at \"T00:00:00.000Z\""}}}}}
Sample Data Row
][logstash.outputs.elasticsearch] Could not index event to Elasticsearch. {:status=>400, :action=>["index", {:_id=>nil, :_index=>"immport_2017_12_02", :_type=>"dataset", :_routing=>nil}, 2017-12-20T08:55:45.367Z 878192e51991 SDY816,HEPSV_COHORT: Participants that received Heplisav,,,2012-04-17,,10.0,Systems Biology Analysis of the response to Licensed Hepatitis B Vaccine (HEPLISAV) in specific cell subsets (see companion studies SDY299 and SDY690),Interventional,http://www.immport.org/immport-open/public/study/study/displayStudyDetail/SDY816,,Interventional,Vaccine Response,Homo sapiens,Cell,DNA microarray], :response=>{"index"=>{"_index"=>"immport_2017_12_02", "_type"=>"dataset", "_id"=>"AWBzIsBPov62ZQtaldxQ", "status"=>400, "error"=>{"type"=>"mapper_parsing_exception", "reason"=>"failed to parse [study.startDate]", "caused_by"=>{"type"=>"illegal_argument_exception", "reason"=>"Invalid format: \"2012-04-17T00:00:00.000Z\" is malformed at \"T00:00:00.000Z\""}}}}}
I want my logstash to output date in this format yyyy-MM-dd without timestamp
Mapping template
"startDate": {
"type": "date",
"format": "yyyy-MM-dd"
},

I tried this on my machine with reference to your logstash conf file and it worked fine.
My Logstash conf file :
input {
file {
path => "D:\testdata\stack.csv"
codec => multiline {
pattern => "^S*"
negate => true
what => "previous"
}
start_position => "beginning"
}
}
filter {
csv {
separator => ","
autodetect_column_names => true
skip_empty_columns => true
}
date {
match => ["dob", "yyyy-MM-dd"]
target => "dob"
}
mutate {
rename => {"dob" => "[study][dob]"}
}
}
output {
elasticsearch {
action => "index"
hosts => ["localhost:9200"]
index => "stack"
}
stdout { codec => rubydebug }
}
CSV file :
id,name,rollno,dob,age,gender,comments
1,hatim,88,1992-07-30,25,male,qsdsdadasd asdas das dasd asd asd asd as dd sa d
2,hatim,89,1992-07-30,25,male,qsdsdadasd asdas das dasd asd asd asd as dd sa d
Elasticsearch document after indexing :
{
"_index": "stack",
"_type": "doc",
"_id": "wuBTeGABQ7gwBQSQTX1q",
"_score": 1,
"_source": {
"path": """D:\testdata\stack.csv""",
"study": {
"dob": "1992-07-29T18:30:00.000Z"
},
"#timestamp": "2017-12-21T09:06:52.465Z",
"comments": "qsdsdadasd asdas das dasd asd asd asd as dd sa d",
"gender": "male",
"#version": "1",
"host": "INMUCHPC03284",
"name": "hatim",
"rollno": "88",
"id": "1",
"message": "1,hatim,88,1992-07-30,25,male,qsdsdadasd asdas das dasd asd asd asd as dd sa d\r",
"age": "25"
}
}
And everything worked perfectly. See if this example might help you with something.

The issue was I changed the logstash mapping template name to new name, I didn't delete the old template file and hence the index was still pointing to old template file
once I deleted the old template file
curl -XDELETE 'http://localhost:9200/_templates/test_template'
it worked, so whenever we are using new template it's required to delete old template and then process records

Related

How can I create JSON format log when entering into Elasticsearch by logstash

i been told that, by using logstash pipeline i can re-create a log format(i.e JSON) when entering into elasticsearch. but not understanding how to do it .
current LOGStash Configure ( I took bellow from Google , not for any perticular reason)
/etc/logstash/conf.d/metrics-pipeline.conf
input {
beats {
port => 5044
client_inactivity_timeout => "3600"
}
}
filter {
if [message] =~ />/ {
dissect {
mapping => {
"message" => "%{start_of_message}>%{content}"
}
}
kv {
source => "content"
value_split => ":"
field_split => ","
trim_key => "\[\]"
trim_value => "\[\]"
target => "message"
}
mutate {
remove_field => ["content","start_of_message"]
}
}
}
filter {
if [system][process] {
if [system][process][cmdline] {
grok {
match => {
"[system][process][cmdline]" => "^%{PATH:[system][process][cmdline_path]}"
}
remove_field => "[system][process][cmdline]"
}
}
}
grok {
match => { "message" => "%{COMBINEDAPACHELOG}" }
}
date {
match => [ "timestamp" , "dd/MMM/yyyy:HH:mm:ss Z" ]
}
}
output {
elasticsearch {
hosts => "1.2.1.1:9200"
manage_template => false
index => "%{[#metadata][beat]}-%{[#metadata][version]}-%{+YYYY.MM.dd}"
}
}
I have couple of log file located at
/root/logs/File.log
/root/logs/File2.log
the Log format there is :
08:26:51,753 DEBUG [ABC] (default-threads - 78) (1.2.3.4)(368)>[TIMESTAMP:Wed Sep 11 08:26:51 UTC 2019],[IMEI:03537],[COMMAND:INFO],[GPS STATUS:true],[INFO:true],[SIGNAL:false],[ENGINE:0],[DOOR:0],[LON:90.43],[LAT:23],[SPEED:0.0],[HEADING:192.0],[BATTERY:100.0%],[CHARGING:1],[O&E:CONNECTED],[GSM_SIGNAL:100],[GPS_SATS:5],[GPS POS:true],[FUEL:0.0V/0.0%],[ALARM:NONE][SERIAL:01EE]
in Kibana by default it shows like ethis
https://imgshare.io/image/stackflow.I0u7S
https://imgshare.io/image/jsonlog.IHQhp
"message": "21:33:42,004 DEBUG [LOG] (default-threads - 100) (1.2.3.4)(410)>[TIMESTAMP:Sat Sep 07 21:33:42 UTC 2019],[TEST:123456],[CMD:INFO],[STATUS:true],[INFO:true],[SIGNAL:false],[ABC:0],[DEF:0],[GHK:1111],[SERIAL:0006]"
but i want to get it like bellow :-
"message": {
"TIMESTAMP": "Sat Sep 07 21:33:42 UTC 2019",
"TEST": "123456",
"CMD":INFO,
"STATUS":true,
"INFO":true,
"SIGNAL":false,
"ABC":0,
"DEF":0,
"GHK":0,
"GHK":1111
}
Can this be done ? if yes how ?
Thanks
With the if [message] =~ />/, the filters will only apply to messages containing a >. The dissect filter will split the message between the >. The kv filter will apply a key-value transformation on the second part of the message, removing the []. The mutate.remove_field remove any extra field.
filter {
if [message] =~ />/ {
dissect {
mapping => {
"message" => "%{start_of_message}>%{content}"
}
}
kv {
source => "content"
value_split => ":"
field_split => ","
trim_key => "\[\]"
trim_value => "\[\]"
target => "message"
}
mutate {
remove_field => ["content","start_of_message"]
}
}
}
Result, using the provided log line:
{
"#version": "1",
"host": "YOUR_MACHINE_NAME",
"message": {
"DEF": "0",
"TIMESTAMP": "Sat Sep 07 21:33:42 UTC 2019",
"TEST": "123456",
"CMD": "INFO",
"SERIAL": "0006]\r",
"GHK": "1111",
"INFO": "true",
"STATUS": "true",
"ABC": "0",
"SIGNAL": "false"
},
"#timestamp": "2019-09-10T09:21:16.422Z"
}
In addition to doing the filtering with if [message] =~ />/, you can also do the comparison on the path field, which is set by the file input plugin. Also if you have multiple file inputs, you can set the type field and use this one, see https://stackoverflow.com/a/20562031/6113627.

How to preprocess a document before indexation?

I'm using logstash and elasticsearch to collect tweet using the Twitter plug in. My problem is that I receive a document from twitter and I would like to make some preprocessing before indexing my document. Let's say that I have this as a document result from twitter:
{
"tweet": {
"tweetId": 1025,
"tweetContent": "Hey this is a fake document for stackoverflow #stackOverflow #elasticsearch",
"hashtags": ["stackOverflow", "elasticsearch"],
"publishedAt": "2017 23 August",
"analytics": {
"likeNumber": 400,
"shareNumber": 100,
}
},
"author":{
"authorId": 819744,
"authorAt": "the_expert",
"authorName": "John Smith",
"description": "Haha it's a fake description"
}
}
Now out of this document that twitter is sending me I would like to generate two documents:
the first one will be indexed in twitter/tweet/1025 :
# The id for this document should be the one from tweetId `"tweetId": 1025`
{
"content": "Hey this is a fake document for stackoverflow #stackOverflow #elasticsearch", # this field has been renamed
"hashtags": ["stackOverflow", "elasticsearch"],
"date": "2017/08/23", # the date has been formated
"shareNumber": 100 # This field has been flattened
}
The second one will be indexed in twitter/author/819744:
# The id for this document should be the one from authorId `"authorId": 819744 `
{
"authorAt": "the_expert",
"description": "Haha it's a fake description"
}
I have defined my output as follow:
output {
stdout { codec => dots }
elasticsearch {
hosts => [ "localhost:9200" ]
index => "twitter"
document_type => "tweet"
}
}
How can I process the information from twitter?
EDIT:
So my full config file should look like:
input {
twitter {
consumer_key => "consumer_key"
consumer_secret => "consumer_secret"
oauth_token => "access_token"
oauth_token_secret => "access_token_secret"
keywords => [ "random", "word"]
full_tweet => true
type => "tweet"
}
}
filter {
clone {
clones => ["author"]
}
if([type] == "tweet") {
mutate {
remove_field => ["authorId", "authorAt"]
}
} else {
mutate {
remove_field => ["tweetId", "tweetContent"]
}
}
}
output {
stdout { codec => dots }
if [type] == "tweet" {
elasticsearch {
hosts => [ "localhost:9200" ]
index => "twitter"
document_type => "tweet"
document_id => "%{[tweetId]}"
}
} else {
elasticsearch {
hosts => [ "localhost:9200" ]
index => "twitter"
document_type => "author"
document_id => "%{[authorId]}"
}
}
}
You could use the clone filter plugin on logstash.
With a sample logstash configuration file that takes a JSON input from stdin and simply shows the output on stdout:
input {
stdin {
codec => json
type => "tweet"
}
}
filter {
mutate {
add_field => {
"tweetId" => "%{[tweet][tweetId]}"
"content" => "%{[tweet][tweetContent]}"
"date" => "%{[tweet][publishedAt]}"
"shareNumber" => "%{[tweet][analytics][shareNumber]}"
"authorId" => "%{[author][authorId]}"
"authorAt" => "%{[author][authorAt]}"
"description" => "%{[author][description]}"
}
}
date {
match => ["date", "yyyy dd MMMM"]
target => "date"
}
ruby {
code => '
event.set("hashtags", event.get("[tweet][hashtags]"))
'
}
clone {
clones => ["author"]
}
mutate {
remove_field => ["author", "tweet", "message"]
}
if([type] == "tweet") {
mutate {
remove_field => ["authorId", "authorAt", "description"]
}
} else {
mutate {
remove_field => ["tweetId", "content", "hashtags", "date", "shareNumber"]
}
}
}
output {
stdout {
codec => rubydebug
}
}
Using as input:
{"tweet": { "tweetId": 1025, "tweetContent": "Hey this is a fake document", "hashtags": ["stackOverflow", "elasticsearch"], "publishedAt": "2017 23 August","analytics": { "likeNumber": 400, "shareNumber": 100 } }, "author":{ "authorId": 819744, "authorAt": "the_expert", "authorName": "John Smith", "description": "fake description" } }
You would get these two documents:
{
"date" => 2017-08-23T00:00:00.000Z,
"hashtags" => [
[0] "stackOverflow",
[1] "elasticsearch"
],
"type" => "tweet",
"tweetId" => "1025",
"content" => "Hey this is a fake document",
"shareNumber" => "100",
"#timestamp" => 2017-08-23T20:36:53.795Z,
"#version" => "1",
"host" => "my-host"
}
{
"description" => "fake description",
"type" => "author",
"authorId" => "819744",
"#timestamp" => 2017-08-23T20:36:53.795Z,
"authorAt" => "the_expert",
"#version" => "1",
"host" => "my-host"
}
You could alternatively use a ruby script to flatten the fields, and then use rename on mutate, when necessary.
If you want elasticsearch to use authorId and tweetId, instead of default ID, you could probably configure elasticsearch output with document_id.
output {
stdout { codec => dots }
if [type] == "tweet" {
elasticsearch {
hosts => [ "localhost:9200" ]
index => "twitter"
document_type => "tweet"
document_id => "%{[tweetId]}"
}
} else {
elasticsearch {
hosts => [ "localhost:9200" ]
index => "twitter"
document_type => "tweet"
document_id => "%{[authorId]}"
}
}
}

insert data into elasticsearch using logstash and visualize in kibana

I have the following CSV file
tstp,voltage_A_real,voltage_B_real,voltage_C_real #header not present in actual file
2000-01-01 00:00:00,2535.53,-1065.7,-575.754
2000-01-01 01:00:00,2528.31,-1068.67,-576.866
2000-01-01 02:00:00,2528.76,-1068.49,-576.796
2000-01-01 03:00:00,2530.12,-1067.93,-576.586
2000-01-01 04:00:00,2531.02,-1067.56,-576.446
2000-01-01 05:00:00,2533.28,-1066.63,-576.099
2000-01-01 06:00:00,2535.53,-1065.7,-575.754
2000-01-01 07:00:00,2535.53,-1065.7,-575.754
....
I am trying to insert the data into elasticsearch through logstash and have the following logstash config
input {
file {
path => "path_to_csv_file"
sincedb_path=> "/dev/null"
start_position => beginning
}
}
filter {
csv {
columns => [
"tstp",
"Voltage_A_real",
"Voltage_B_real",
"Voltage_C_real"
]
separator => ","
}
date {
match => [ "tstp", "yyyy-MM-dd HH:mm:ss"]
}
mutate {
convert => ["Voltage_A_real", "float"]
convert => ["Voltage_B_real", "float"]
convert => ["Voltage_C_real", "float"]
}
}
output {
stdout { codec => rubydebug }
elasticsearch {
hosts => ["localhost:9200"]
action => "index"
index => "temp_load_index"
}
}
My output from rubydebug when I run logstash -f conf_file -v is
{
"message" => "2000-02-18 16:00:00,2532.38,-1067,-576.238",
"#version" => "1",
"#timestamp" => "2000-02-18T21:00:00.000Z",
"path" => "path_to_csv",
"host" => "myhost",
"tstp" => "2000-02-18 16:00:00",
"Voltage_A_real" => 2532.38,
"Voltage_B_real" => -1067.0,
"Voltage_C_real" => -576.238
}
However I see only 2 events in kibana when I look at the dashboard and both have the current datetime stamp and not that of the year 2000 which is the range of my data. Could someone please help me figure out what is happening?
A sample kibana object is as follows
{
"_index": "temp_load_index",
"_type": "logs",
"_id": "myid",
"_score": null,
"_source": {
"message": "2000-04-02 02:00:00,2528.76,-1068.49,-576.796",
"#version": "1",
"#timestamp": "2016-09-27T05:15:29.753Z",
"path": "path_to_csv",
"host": "myhost",
"tstp": "2000-04-02 02:00:00",
"Voltage_A_real": 2528.76,
"Voltage_B_real": -1068.49,
"Voltage_C_real": -576.796,
"tags": [
"_dateparsefailure"
]
},
"fields": {
"#timestamp": [
1474953329753
]
},
"sort": [
1474953329753
]
}
When you open Kibana, it usually show you only events in the last 15 min, according to the #timestamp field. So you need to set the time filter to the appropriate time range (cf documentation), in your case, using the absolute option and starting 2000-01-01.
Or you can put the parsed timestamp in another field (for example original_tst), so that the #timestamp added by Logstash will be kept.
date {
match => [ "tstp", "yyyy-MM-dd HH:mm:ss"]
target => "original_tst"
}

load array data mysql to ElasticSearch using logstash jdbc

Hi i am new to ES and i m trying to load data from 'MYSQL' to 'Elasticsearch'
I am getting below error when trying to loadata in array format, any help
Here is mysql data, need array data for new & hex value columns
cid color new hex create modified
1 100 euro abcd #86c67c 5/5/2016 15:48 5/13/2016 14:15
1 100 euro 1234 #fdf8ff 5/5/2016 15:48 5/13/2016 14:15
Here us the logstash config
input {
jdbc {
jdbc_driver_library => "/etc/logstash/mysql/mysql-connector-java-5.1.39-bin.jar"
jdbc_driver_class => "com.mysql.jdbc.Driver"
jdbc_connection_string => "jdbc:mysql://127.0.0.1:3306/test"
jdbc_user => "root"
jdbc_password => "*****"
schedule => "* * * * *"
statement => "select cid,color, new as 'cvalue.new',hexa_value as 'cvalue.hexa',created,modified from colors_hex_test order by cid"
jdbc_paging_enabled => "true"
jdbc_page_size => "50000"
}
}
output {
elasticsearch {
index => "colors_hexa"
document_type => "colors"
document_id => "%{cid}"
hosts => "localhost:9200"
Need array data for cvalue (new, hexa) like
{
"_index": "colors_hexa",
"_type": "colors",
"_id": "1",
"_version": 218,
"found": true,
"_source": {
"cid": 1,
"color": "100 euro",
"cvalue" : {
"new": "1234",
"hexa_value": "#fdf8ff",
}
"created": "2016-05-05T10:18:51.000Z",
"modified": "2016-05-13T08:45:30.000Z",
"#version": "1",
"#timestamp": "2016-05-14T01:30:00.059Z"
}
}
this is the error i m getting while running logstash
"status"=>400, "error"=>{"type"=>"mapper_parsing_exception",
"reason"=>"Field name [cvalue.hexa] cannot contain '.'"}}}, :level=>:warn}
You cant give a field name with .. But you can try to add:
filter {
mutate {
rename => { "new" => "[cvalue][new]" }
rename => { "hexa" => "[cvalue][hexa]" }
}
}

logstash multiline codec with java stack trace

I am trying to parse a log file with grok. the configuration I use allows me to parse a single lined event but not if multilined (with java stack trace).
#what i get on KIBANA for a single line:
{
"_index": "logstash-2015.02.05",
"_type": "logs",
"_id": "mluzA57TnCpH-XBRbeg",
"_score": null,
"_source": {
"message": " - 2014-01-14 11:09:35,962 [main] INFO (api.batch.ThreadPoolWorker) user.country=US",
"#version": "1",
"#timestamp": "2015-02-05T09:38:21.310Z",
"path": "/root/test2.log",
"time": "2014-01-14 11:09:35,962",
"main": "main",
"loglevel": "INFO",
"class": "api.batch.ThreadPoolWorker",
"mydata": " user.country=US"
},
"sort": [
1423129101310,
1423129101310
]
}
#what i get for a multiline with Stack trace:
{
"_index": "logstash-2015.02.05",
"_type": "logs",
"_id": "9G6LsSO-aSpsas_jOw",
"_score": null,
"_source": {
"message": "\tat oracle.jdbc.driver.DatabaseError.throwSqlException(DatabaseError.java:20)",
"#version": "1",
"#timestamp": "2015-02-05T09:38:21.380Z",
"path": "/root/test2.log",
"tags": [
"_grokparsefailure"
]
},
"sort": [
1423129101380,
1423129101380
]
}
input {
file {
path => "/root/test2.log"
start_position => "beginning"
codec => multiline {
pattern => "^ - %{TIMESTAMP_ISO8601} "
negate => true
what => "previous"
}
}
}
filter {
grok {
match => [ "message", " -%{SPACE}%{SPACE}%{TIMESTAMP_ISO8601:time} \[%{WORD:main}\] %{LOGLEVEL:loglevel}%{SPACE}%{SPACE}\(%{JAVACLASS:class}\) %{GREEDYDATA:mydata} %{JAVASTACKTRACEPART}"]
}
date {
match => [ "timestamp" , "dd/MMM/yyyy:HH:mm:ss Z" ]
}
}
output {
elasticsearch {
host => "194.3.227.23"
}
# stdout { codec => rubydebug}
}
Can anyone please tell me what i'm doing wrong on my configuration file? Thanks.
here's a sample of my log file:
- 2014-01-14 11:09:36,447 [main] INFO (support.context.ContextFactory) Creating default context
- 2014-01-14 11:09:38,623 [main] ERROR (support.context.ContextFactory) Error getting connection to database jdbc:oracle:thin:#HAL9000:1521:DEVPRINT, with user cisuser and driver oracle.jdbc.driver.OracleDriver
java.sql.SQLException: ORA-28001: the password has expired
at oracle.jdbc.driver.SQLStateMapping.newSQLException(SQLStateMapping.java:70)
at oracle.jdbc.driver.DatabaseError.newSQLException(DatabaseError.java:131)
**
*> EDIT: here's the latest configuration i'm using
https://gist.github.com/anonymous/9afe80ad604f9a3d3c00#file-output-L1*
**
First point, when repeating testing with the file input, be sure to use sincedb_path => "/dev/null" to be sure to read from the beginning of the file.
About multiline, there must be something wrong either with your question content or your multiline pattern because none of the event have the multiline tag that is added by the multiline codec or filter when aggregating the lines.
Your message field should contains all lines separated by line feed characters \n (\r\n in my case being on windows). Here is the expected output from your input configuration
{
"#timestamp" => "2015-02-10T11:03:33.298Z",
"message" => " - 2014-01-14 11:09:35,962 [main] INFO (api.batch.ThreadPoolWorker) user.country=US\r\n\tat oracle.jdbc.driver.DatabaseError.throwSqlException(DatabaseError.java:20\r",
"#version" => "1",
"tags" => [
[0] "multiline"
],
"host" => "localhost",
"path" => "/root/test.file"
}
About grok, as you want to match a multiline string you should use a pattern like this.
filter {
grok {
match => {"message" => [
"(?m)^ -%{SPACE}%{TIMESTAMP_ISO8601:time} \[%{WORD:main}\] % {LOGLEVEL:loglevel}%{SPACE}\(%{JAVACLASS:class}\) %{DATA:mydata}\n%{GREEDYDATA:stack}",
"^ -%{SPACE}%{TIMESTAMP_ISO8601:time} \[%{WORD:main}\] %{LOGLEVEL:loglevel}%{SPACE}\(%{JAVACLASS:class}\) %{GREEDYDATA:mydata}"]
}
}
}
(?m) prefix instruct the regex engine to do multiline matching.
And then you get an event like
{
"#timestamp" => "2015-02-10T10:47:20.078Z",
"message" => " - 2014-01-14 11:09:35,962 [main] INFO (api.batch.ThreadPoolWorker) user.country=US\r\n\tat oracle.jdbc.driver.DatabaseError.throwSqlException(DatabaseError.java:20\r",
"#version" => "1",
"tags" => [
[0] "multiline"
],
"host" => "localhost",
"path" => "/root/test.file",
"time" => "2014-01-14 11:09:35,962",
"main" => "main",
"loglevel" => "INFO",
"class" => "api.batch.ThreadPoolWorker",
"mydata" => " user.country=US\r",
"stack" => "\tat oracle.jdbc.driver.DatabaseError.throwSqlException(DatabaseError.java:20\r"
}
You can build and validate your multiline patterns with this online tool http://grokconstructor.appspot.com/do/match
A final warning, there is currently a bug in Logstash file input with multiline codec that mixup content from several files if you use a list or wildcard in path setting. The only workaroud is to use the multiline filter
HTH
EDIT: I was focusing on the multiline strings, you need to add a similar pattern for non single lines string

Resources