how to map an input document field to the elasticsearch _id field? - elasticsearch

I have a fairly simple pipeline for taking json messages from Kafka and sending them to Elasticsearch:
input {
kafka {
bootstrap_servers => "kafka04-prod01.messagehub.services.eu-de.bluemix.net:9093,kafka05-prod01.messagehub.services.eu-de.bluemix.net:9093,kafka01-prod01.messagehub.services.eu-de.bluemix.net:9093,kafka03-prod01.messagehub.services.eu-de.bluemix.net:9093,kafka02-prod01.messagehub.services.eu-de.bluemix.net:9093"
topics => [ "transactions_load" ]
}
}
filter {
json {
source => "message"
}
mutate{
remove_field => ["kafka"]
remove_field => ["#version"]
remove_field => ["#timestamp"]
remove_field => ["message"]
remove_tag => ["multiline"]
}
}
output {
elasticsearch {
hosts => [
"xxxxx.ibm-343.composedb.com:16915",
"xxxxx.ibm-343.composedb.com:16915"
]
ssl => true
user => "logstash_kafka"
password => "*****"
index => "pos_transactions"
}
}
The json records have a TransactionID field that uniquely identifies each record:
{"TransactionID": "5440772161", "InvoiceNo": 5440772, "StockCode": 22294, "Description": "HEART FILIGREE DOVE SMALL", "Quantity": 4, "InvoiceDate": 1507777440000, "UnitPrice": 1.25, "CustomerID": 14825, "Country": "United Kingdom", "LineNo": 16, "InvoiceTime": "03:04:00", "StoreID": 1}
{"TransactionID": "5440772191", "InvoiceNo": 5440772, "StockCode": 21733, "Description": "RED HANGING HEART T-LIGHT HOLDER", "Quantity": 4, "InvoiceDate": 1507777440000, "UnitPrice": 2.95, "CustomerID": 14825, "Country": "United Kingdom", "LineNo": 19, "InvoiceTime": "03:04:00", "StoreID": 1}
Can I configure logstash to use the TransactionID as the _id field so that if I process duplicate records for the same transaction, these updates are idempotent?

I figured the answer out myself. Posting here because it may be useful for others:
output {
elasticsearch {
hosts => [
"xxxxx.ibm-343.composedb.com:16915",
"xxxxx.ibm-343.composedb.com:16915"
]
ssl => true
user => "logstash_kafka"
password => "*****"
index => "pos_transactions"
document_id => "%{TransactionID}"
}
}
The document_id => "%{TransactionID}" configuration entry uses the incoming document TransactionID field for the elasticsearch _id

Related

Incorrect document_id for Logstash elastic search output

I'm using Logstash to read json messages from Solace queue and write it to elastic Search.I'm using the doc_as_upsert => true along with the document_id parameters in the output.This is how my logstash configuration looks like
logstash.conf
input
{
jms {
include_header => false
include_properties => false
include_body => true
use_jms_timestamp => false
destination => 'SpringBatchTestQueue'
pub_sub => false
jndi_name => '/JMS/CF/MDM'
jndi_context => {
'java.naming.factory.initial' => 'com.solacesystems.jndi.SolJNDIInitialContextFactory'
'java.naming.security.principal' => 'EDM_Test_User#NovartisDevVPN'
'java.naming.provider.url' => 'tcp://localhost:55555'
'java.naming.security.credentials' => 'EDM_Test_User'
}
require_jars=> ['/app/elasticsearch/jms/commons-lang-2.6.jar',
'/app/elasticsearch/jms/sol-jms-10.10.0.jar',
'/app/elasticsearch/jms/geronimo-jms_1.1_spec-1.1.1.jar']
}
}
output
{
elasticsearch
{
hosts => ["https://glchbs-sd220240.eu.novartis.net:9200/"]
index => "test-%{+YYYY.MM.dd}"
document_id => "%{customerId}"
doc_as_upsert => true
ssl => true
ssl_certificate_verification => true
cacert => "/app/elasticsearch/config/ssl/Novartis_Silver_Three_Chain.pem"
}
}
Json Message:
{
"customerId": "N-CA-Z9II2YJ1YJ",
"name": "Alan Birch",
"customerRecordType": "Health Care Professional",
"country": "CA",
"language": "EN",
"privacyLawStatus": false,
"salutation": "Mr.",
"firstName": "Alan",
"lastName": "Birch",
"customerType": "Non Prescriber",
"hcpType": "Pharmacist Assistant",
"isMedicalExpert": false,
"customerAddresses": [
{
"addressType": "Primary Address",
"addressLine1": "4001 Leslie Street"
},
{
"addressType": "Other",
"addressLine1": "3004 Center St"
}
],
"meansOfContact": [
{
"type": "Email1",
"value": "alab#noname.com",
"status": "Active"
},
{
"type": "Email2",
"value": "balan#gmail.com",
"status": "Active"
}
],
"specialities": [
{
"specialtyType": "Primary Specialty",
"specialty": "Pharmacy Technician",
"status": "Active"
}
]
}
As you can see, I'm trying to use the customerId field of the JSON message as the document id for elasticsearch. But this is what a document inserted into Elasticsearch looks like:
As you can see document_id field should be mapped to customerId field but this is not case..Document is being inserted as %{customerId}
How to fix this?Appreciate your help
That is telling you that the [customerId] field does not exist on that event. If the [message] field is JSON then you should add a json filter to parse it. That will create the [customerId] field, which you can then use as the document_id.
json { source => "message" }

geoip.location is defined as an object in mapping [doc] but this name is already used for a field in other types

I'm getting this error:
Could not index event to Elasticsearch. {:status=>400,
:action=>["index", {:_id=>nil, :_index=>"nginx-access-2018-06-15",
:_type=>"doc", :_routing=>nil}, #],
:response=>{"index"=>{"_index"=>"nginx-access-2018-06-15",
"_type"=>"doc", "_id"=>"jo-rfGQBDK_ao1ZhmI8B", "status"=>400,
"error"=>{"type"=>"illegal_argument_exception",
"reason"=>"[geoip.location] is defined as an object in mapping [doc]
but this name is already used for a field in other types"}}}}
I'm getting the above error but don't understand why, this is loading into a brand new ES instance with no data. This is the first record that is inserted. Why am I getting this error? Here is the config:
input {
file {
type => "nginx-access"
start_position => "beginning"
path => [ "/var/log/nginx-archived/access.log.small"]
start_position => "beginning"
sincedb_path => "/dev/null"
}
}
filter {
if [type] == "nginx-access" {
grok {
patterns_dir => "/etc/logstash/patterns"
match => { "message" => "%{NGINX_ACCESS}" }
remove_tag => ["_grokparsefailure"]
}
geoip {
source => "visitor_ip"
}
date {
# 11/Jun/2018:06:23:45 +0000
match => [ "timestamp", "dd/MMM/yyyy:HH:mm:ss Z" ]
target => "#request_time"
}
if "_grokparsefailure" not in [tags] {
ruby {
code => "
thetime = event.get('#request_time').time
event.set('index_date', 'nginx-access-' + thetime.strftime('%Y-%m-%d'))
"
}
}
if "_grokparsefailure" in [tags] {
ruby {
code => "
event.set('index_date', 'nginx-access-error')
"
}
}
}
}
output {
elasticsearch {
hosts => "elasticsearch:9200"
index => "%{index_date}"
template => "/etc/logstash/templates/nginx-access.json"
template_overwrite => true
manage_template => true
template_name => "nginx-access"
}
stdout { }
}
Here's a sample record:
{
"method" => "GET",
"#version" => "1",
"geoip" => {
"continent_code" => "AS",
"latitude" => 39.9289,
"country_name" => "China",
"ip" => "220.181.108.103",
"location" => {
"lon" => 116.3883,
"lat" => 39.9289
},
"region_code" => "11",
"region_name" => "Beijing",
"longitude" => 116.3883,
"timezone" => "Asia/Shanghai",
"city_name" => "Beijing",
"country_code2" => "CN",
"country_code3" => "CN"
},
"index_date" => "nginx-access-2018-06-15",
"ignore" => "\"-\"",
"bytes" => "2723",
"request" => "/wp-login.php",
"#request_time" => 2018-06-15T06:29:40.000Z,
"message" => "220.181.108.103 - - [15/Jun/2018:06:29:40 +0000] \"GET /wp-login.php HTTP/1.1\" 200 2723 \"-\" \"Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)\"",
"path" => "/var/log/nginx-archived/access.log.small",
"#timestamp" => 2018-07-09T01:32:56.952Z,
"host" => "ab1526efddec",
"visitor_ip" => "220.181.108.103",
"timestamp" => "15/Jun/2018:06:29:40 +0000",
"response" => "200",
"referrer" => "\"Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)\"",
"httpversion" => "1.1",
"type" => "nginx-access"
}
Figured out the answer, based on this:
https://www.elastic.co/guide/en/elasticsearch/reference/6.x/removal-of-types.html#_schedule_for_removal_of_mapping_types
The basic problem is that for each Elasticsearch index, each field must be the same type, even if the records are different types.
That is, if I have a person { "status": "A" } stored as text I cannot have a record for a car { "status": 23 } stored as a number in the same index. Based on the info in the link above, I'm storing one "type" per index.
My output section for Logstash looks like this:
output {
elasticsearch {
hosts => "elasticsearch:9200"
index => "%{index_date}"
# Can test loading this with:
# curl -XPUT -H 'Content-Type: application/json' -d#/docker-elk/logstash/templates/nginx-access.json http://localhost:9200/_template/nginx-access
template => "/etc/logstash/templates/nginx-access.json"
template_overwrite => true
manage_template => true
template_name => "nginx-access"
}
stdout { }
}
My template looks like this:
{
"index_patterns": ["nginx-access*"],
"settings": {
},
"mappings": {
"doc": {
"_source": {
"enabled": true
},
"properties": {
"type" : { "type": "keyword" },
"response_time": { "type": "float" },
"geoip" : {
"properties" : {
"location": {
"type": "geo_point"
}
}
}
}
}
}
}
I'm also using the one type per index pattern described in the link above.

How to preprocess a document before indexation?

I'm using logstash and elasticsearch to collect tweet using the Twitter plug in. My problem is that I receive a document from twitter and I would like to make some preprocessing before indexing my document. Let's say that I have this as a document result from twitter:
{
"tweet": {
"tweetId": 1025,
"tweetContent": "Hey this is a fake document for stackoverflow #stackOverflow #elasticsearch",
"hashtags": ["stackOverflow", "elasticsearch"],
"publishedAt": "2017 23 August",
"analytics": {
"likeNumber": 400,
"shareNumber": 100,
}
},
"author":{
"authorId": 819744,
"authorAt": "the_expert",
"authorName": "John Smith",
"description": "Haha it's a fake description"
}
}
Now out of this document that twitter is sending me I would like to generate two documents:
the first one will be indexed in twitter/tweet/1025 :
# The id for this document should be the one from tweetId `"tweetId": 1025`
{
"content": "Hey this is a fake document for stackoverflow #stackOverflow #elasticsearch", # this field has been renamed
"hashtags": ["stackOverflow", "elasticsearch"],
"date": "2017/08/23", # the date has been formated
"shareNumber": 100 # This field has been flattened
}
The second one will be indexed in twitter/author/819744:
# The id for this document should be the one from authorId `"authorId": 819744 `
{
"authorAt": "the_expert",
"description": "Haha it's a fake description"
}
I have defined my output as follow:
output {
stdout { codec => dots }
elasticsearch {
hosts => [ "localhost:9200" ]
index => "twitter"
document_type => "tweet"
}
}
How can I process the information from twitter?
EDIT:
So my full config file should look like:
input {
twitter {
consumer_key => "consumer_key"
consumer_secret => "consumer_secret"
oauth_token => "access_token"
oauth_token_secret => "access_token_secret"
keywords => [ "random", "word"]
full_tweet => true
type => "tweet"
}
}
filter {
clone {
clones => ["author"]
}
if([type] == "tweet") {
mutate {
remove_field => ["authorId", "authorAt"]
}
} else {
mutate {
remove_field => ["tweetId", "tweetContent"]
}
}
}
output {
stdout { codec => dots }
if [type] == "tweet" {
elasticsearch {
hosts => [ "localhost:9200" ]
index => "twitter"
document_type => "tweet"
document_id => "%{[tweetId]}"
}
} else {
elasticsearch {
hosts => [ "localhost:9200" ]
index => "twitter"
document_type => "author"
document_id => "%{[authorId]}"
}
}
}
You could use the clone filter plugin on logstash.
With a sample logstash configuration file that takes a JSON input from stdin and simply shows the output on stdout:
input {
stdin {
codec => json
type => "tweet"
}
}
filter {
mutate {
add_field => {
"tweetId" => "%{[tweet][tweetId]}"
"content" => "%{[tweet][tweetContent]}"
"date" => "%{[tweet][publishedAt]}"
"shareNumber" => "%{[tweet][analytics][shareNumber]}"
"authorId" => "%{[author][authorId]}"
"authorAt" => "%{[author][authorAt]}"
"description" => "%{[author][description]}"
}
}
date {
match => ["date", "yyyy dd MMMM"]
target => "date"
}
ruby {
code => '
event.set("hashtags", event.get("[tweet][hashtags]"))
'
}
clone {
clones => ["author"]
}
mutate {
remove_field => ["author", "tweet", "message"]
}
if([type] == "tweet") {
mutate {
remove_field => ["authorId", "authorAt", "description"]
}
} else {
mutate {
remove_field => ["tweetId", "content", "hashtags", "date", "shareNumber"]
}
}
}
output {
stdout {
codec => rubydebug
}
}
Using as input:
{"tweet": { "tweetId": 1025, "tweetContent": "Hey this is a fake document", "hashtags": ["stackOverflow", "elasticsearch"], "publishedAt": "2017 23 August","analytics": { "likeNumber": 400, "shareNumber": 100 } }, "author":{ "authorId": 819744, "authorAt": "the_expert", "authorName": "John Smith", "description": "fake description" } }
You would get these two documents:
{
"date" => 2017-08-23T00:00:00.000Z,
"hashtags" => [
[0] "stackOverflow",
[1] "elasticsearch"
],
"type" => "tweet",
"tweetId" => "1025",
"content" => "Hey this is a fake document",
"shareNumber" => "100",
"#timestamp" => 2017-08-23T20:36:53.795Z,
"#version" => "1",
"host" => "my-host"
}
{
"description" => "fake description",
"type" => "author",
"authorId" => "819744",
"#timestamp" => 2017-08-23T20:36:53.795Z,
"authorAt" => "the_expert",
"#version" => "1",
"host" => "my-host"
}
You could alternatively use a ruby script to flatten the fields, and then use rename on mutate, when necessary.
If you want elasticsearch to use authorId and tweetId, instead of default ID, you could probably configure elasticsearch output with document_id.
output {
stdout { codec => dots }
if [type] == "tweet" {
elasticsearch {
hosts => [ "localhost:9200" ]
index => "twitter"
document_type => "tweet"
document_id => "%{[tweetId]}"
}
} else {
elasticsearch {
hosts => [ "localhost:9200" ]
index => "twitter"
document_type => "tweet"
document_id => "%{[authorId]}"
}
}
}

logstash splits event field values and assign to #metadata field

I have a logstash event, which has the following field
{
"_index": "logstash-2016.08.09",
"_type": "log",
"_id": "AVZvz2ix",
"_score": null,
"_source": {
"message": "function_name~execute||line_no~128||debug_message~id was not found",
"#version": "1",
"#timestamp": "2016-08-09T14:57:00.147Z",
"beat": {
"hostname": "coredev",
"name": "coredev"
},
"count": 1,
"fields": null,
"input_type": "log",
"offset": 22299196,
"source": "/project_root/project_1/log/core.log",
"type": "log",
"host": "coredev",
"tags": [
"beats_input_codec_plain_applied"
]
},
"fields": {
"#timestamp": [
1470754620147
]
},
"sort": [
1470754620147
]
}
I am wondering how to use filter (kv maybe?) to extract core.log from "source": "/project_root/project_1/log/core.log", and put it in e.g. [#metadata][log_type], and so later on, I can use log_type in output to create an unique index, composing of hostname + logtype + timestamp, e.g.
output {
elasticsearch {
hosts => "localhost:9200"
manage_template => false
index => "%{[#metadata][_source][host]}-%{[#metadata][log_type]}-%{+YYYY.MM.dd}"
document_type => "%{[#metadata][type]}"
}
stdout { codec => rubydebug }
}
You can leverage the mutate/gsub filter in order to achieve this:
filter {
# add the log_type metadata field
mutate {
add_field => {"[#metadata][log_type]" => "%{source}"}
}
# remove everything up to the last slash
mutate {
gsub => [ "[#metadata][log_type]", "^.*\/", "" ]
}
}
Then you can modify your elasticsearch output like this:
output {
elasticsearch {
hosts => ["localhost:9200"]
manage_template => false
index => "%{host}-%{[#metadata][log_type]}-%{+YYYY.MM.dd}"
document_type => "%{[#metadata][type]}"
}
stdout { codec => rubydebug }
}

Logstash elapsed filter

I am trying to use the elapsed.rb filter in the ELK stack and cant seem to figure it out. I am not very familiar with grok and I believe that is where my issue lives. Can anyone help?
Example Log Files:
{
"application_name": "Application.exe",
"machine_name": "Machine1",
"user_name": "testuser",
"entry_date": "2015-03-12T18:12:23.5187552Z",
"chef_environment_name": "chefenvironment1",
"chef_logging_cookbook_version": "0.1.9",
"logging_level": "INFO",
"performance": {
"process_name": "account_search",
"process_id": "Machine1|1|635617555435187552",
"event_type": "enter"
},
"thread_name": "1",
"logger_name": "TestLogger",
"#version": "1",
"#timestamp": "2015-03-12T18:18:48.918Z",
"type": "rabbit",
"log_from": "rabbit"
}
{
"application_name": "Application.exe",
"machine_name": "Machine1",
"user_name": "testuser",
"entry_date": "2015-03-12T18:12:23.7527462Z",
"chef_environment_name": "chefenvironment1",
"chef_logging_cookbook_version": "0.1.9",
"logging_level": "INFO",
"performance": {
"process_name": "account_search",
"process_id": "Machine1|1|635617555435187552",
"event_type": "exit"
},
"thread_name": "1",
"logger_name": "TestLogger",
"#version": "1",
"#timestamp": "2015-03-12T18:18:48.920Z",
"type": "rabbit",
"log_from": "rabbit"
}
Example .conf file
input {
rabbitmq {
host => "SERVERNAME"
add_field => ["log_from", "rabbit"]
type => "rabbit"
user => "testuser"
password => "testuser"
durable => "true"
exchange => "Logging"
queue => "testqueue"
codec => "json"
exclusive => "false"
passive => "true"
}
}
filter {
grok {
match => ["message", "%{TIMESTAMP_ISO8601} START id: (?<process_id>.*)"]
add_tag => [ "taskStarted" ]
}
grok {
match => ["message", "%{TIMESTAMP_ISO8601} END id: (?<process_id>.*)"]
add_tag => [ "taskTerminated"]
}
elapsed {
start_tag => "taskStarted"
end_tag => "taskTerminated"
unique_id_field => "process_id"
timeout => 10000
new_event_on_match => false
}
}
output {
file {
codec => json { charset => "UTF-8" }
path => "test.log"
}
}
You would not need to use a grok filter because your input is already in json format. You'd need to do something like this:
if [performance][event_type] == "enter" {
mutate { add_tag => ["taskStarted"] }
} else if [performance][event_type] == "exit" {
mutate { add_tag => ["taskTerminated"] }
}
elapsed {
start_tag => "taskStarted"
end_tag => "taskTerminated"
unique_id_field => "performance.process_id"
timeout => 10000
new_event_on_match => false
}
I'm not positive on that unique_id_field -- I think it should work, but if it doesn't you could just change it to process_id only and add_field => { "process_id" => "%{[performance][process_id]}" }

Resources