Incorrect document_id for Logstash elastic search output - elasticsearch

I'm using Logstash to read json messages from Solace queue and write it to elastic Search.I'm using the doc_as_upsert => true along with the document_id parameters in the output.This is how my logstash configuration looks like
logstash.conf
input
{
jms {
include_header => false
include_properties => false
include_body => true
use_jms_timestamp => false
destination => 'SpringBatchTestQueue'
pub_sub => false
jndi_name => '/JMS/CF/MDM'
jndi_context => {
'java.naming.factory.initial' => 'com.solacesystems.jndi.SolJNDIInitialContextFactory'
'java.naming.security.principal' => 'EDM_Test_User#NovartisDevVPN'
'java.naming.provider.url' => 'tcp://localhost:55555'
'java.naming.security.credentials' => 'EDM_Test_User'
}
require_jars=> ['/app/elasticsearch/jms/commons-lang-2.6.jar',
'/app/elasticsearch/jms/sol-jms-10.10.0.jar',
'/app/elasticsearch/jms/geronimo-jms_1.1_spec-1.1.1.jar']
}
}
output
{
elasticsearch
{
hosts => ["https://glchbs-sd220240.eu.novartis.net:9200/"]
index => "test-%{+YYYY.MM.dd}"
document_id => "%{customerId}"
doc_as_upsert => true
ssl => true
ssl_certificate_verification => true
cacert => "/app/elasticsearch/config/ssl/Novartis_Silver_Three_Chain.pem"
}
}
Json Message:
{
"customerId": "N-CA-Z9II2YJ1YJ",
"name": "Alan Birch",
"customerRecordType": "Health Care Professional",
"country": "CA",
"language": "EN",
"privacyLawStatus": false,
"salutation": "Mr.",
"firstName": "Alan",
"lastName": "Birch",
"customerType": "Non Prescriber",
"hcpType": "Pharmacist Assistant",
"isMedicalExpert": false,
"customerAddresses": [
{
"addressType": "Primary Address",
"addressLine1": "4001 Leslie Street"
},
{
"addressType": "Other",
"addressLine1": "3004 Center St"
}
],
"meansOfContact": [
{
"type": "Email1",
"value": "alab#noname.com",
"status": "Active"
},
{
"type": "Email2",
"value": "balan#gmail.com",
"status": "Active"
}
],
"specialities": [
{
"specialtyType": "Primary Specialty",
"specialty": "Pharmacy Technician",
"status": "Active"
}
]
}
As you can see, I'm trying to use the customerId field of the JSON message as the document id for elasticsearch. But this is what a document inserted into Elasticsearch looks like:
As you can see document_id field should be mapped to customerId field but this is not case..Document is being inserted as %{customerId}
How to fix this?Appreciate your help

That is telling you that the [customerId] field does not exist on that event. If the [message] field is JSON then you should add a json filter to parse it. That will create the [customerId] field, which you can then use as the document_id.
json { source => "message" }

Related

Beat input in Logstash is losing fields

I have the following infrastructure:
ELK installed as docker containers, each in its own container. And on a virtual machine running CentOS I installed nginx web server and Filebeat to collect the logs.
I enabled the nginx module in filebeat.
> filebeat modules enable nginx
Before starting filebeat I set it up with elasticsearch and installed it's dashboards on kibana.
config file (I have removed unnecessary comments from the file):
filebeat.config.modules:
path: ${path.config}/modules.d/*.yml
reload.enabled: false
setup.kibana:
host: "172.17.0.1:5601"
output.elasticsearch:
hosts: ["172.17.0.1:9200"]
then to set it up in elasticsearch and kibana
> filebeat setup -e --dashboards
This works fine. In fact if I keep it this way everything works perfectly. I can use the collected logs in kibana and use the dashboards for NGinX I installed with the above command.
I want though to pass the logs through to Logstash.
And here's my Logstash configuration uses the following pipelines:
- pipeline.id: filebeat
path.config: "config/filebeat.conf"
filebeat.conf:
input {
beats {
port => 5044
}
}
#filter {
# mutate {
# add_tag => ["filebeat"]
# }
#}
output {
elasticsearch {
hosts => ["elasticsearch0:9200"]
index => "%{[#metadata][beat]}-%{[#metadata][version]}-%{+YYYY.MM.dd}"
}
stdout { }
}
Making the logs go through Logstash the resulting log is just:
{
"offset" => 6655,
"#version" => "1",
"#timestamp" => 2019-02-20T13:34:06.886Z,
"message" => "10.0.2.2 - - [20/Feb/2019:08:33:58 -0500] \"GET / HTTP/1.1\" 304 0 \"-\" \"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/71.0.3578.98 Chrome/71.0.3578.98 Safari/537.36\" \"-\"",
"beat" => {
"version" => "6.5.4",
"name" => "localhost.localdomain",
"hostname" => "localhost.localdomain"
},
"source" => "/var/log/nginx/access.log",
"host" => {
"os" => {
"version" => "7 (Core)",
"codename" => "Core",
"family" => "redhat",
"platform" => "centos"
},
"name" => "localhost.localdomain",
"id" => "18e7cb2506624fb6ae2dc3891d5d7172",
"containerized" => true,
"architecture" => "x86_64"
},
"fileset" => {
"name" => "access",
"module" => "nginx"
},
"tags" => [
[0] "beats_input_codec_plain_applied"
],
"input" => {
"type" => "log"
},
"prospector" => {
"type" => "log"
}
}
A lot of fields are missing from my object. There should have been many more structured information
UPDATE: This is what I'm expecting instead
{
"_index": "filebeat-6.5.4-2019.02.20",
"_type": "doc",
"_id": "ssJPC2kBLsya0HU-3uwW",
"_version": 1,
"_score": null,
"_source": {
"offset": 9639,
"nginx": {
"access": {
"referrer": "-",
"response_code": "404",
"remote_ip": "10.0.2.2",
"method": "GET",
"user_name": "-",
"http_version": "1.1",
"body_sent": {
"bytes": "3650"
},
"remote_ip_list": [
"10.0.2.2"
],
"url": "/access",
"user_agent": {
"patch": "3578",
"original": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/71.0.3578.98 Chrome/71.0.3578.98 Safari/537.36",
"major": "71",
"minor": "0",
"os": "Ubuntu",
"name": "Chromium",
"os_name": "Ubuntu",
"device": "Other"
}
}
},
"prospector": {
"type": "log"
},
"read_timestamp": "2019-02-20T14:29:36.393Z",
"source": "/var/log/nginx/access.log",
"fileset": {
"module": "nginx",
"name": "access"
},
"input": {
"type": "log"
},
"#timestamp": "2019-02-20T14:29:32.000Z",
"host": {
"os": {
"codename": "Core",
"family": "redhat",
"version": "7 (Core)",
"platform": "centos"
},
"containerized": true,
"name": "localhost.localdomain",
"id": "18e7cb2506624fb6ae2dc3891d5d7172",
"architecture": "x86_64"
},
"beat": {
"hostname": "localhost.localdomain",
"name": "localhost.localdomain",
"version": "6.5.4"
}
},
"fields": {
"#timestamp": [
"2019-02-20T14:29:32.000Z"
]
},
"sort": [
1550672972000
]
}
The answer provided by #baudsp was mostly correct, but it was incomplete. I had exactly the same problem, and I also had exactly the same filter mentioned in the documentation (and in #baudsp's answer), but documents in Elastic Search still did not contain any of the expected fields.
I finally found the problem: because I had Filebeat configured to send Nginx logs via the Nginx module and not the Log input, the data coming from Logbeat didn't match quite what the example Logstash filter was expecting.
The conditional in the example is if [fileset][module] == "nginx", which is correct if Filebeat was sending data from a Log input. However, since the log data is coming from the Nginx module, the fileset property doesn't contain a module property.
To make the filter work with Logstash data coming from the Nginx module, the conditional needs to be modified to look for something else. I found the [event][module] to work in place of [fileset][module].
The working filter:
filter {
if [event][module] == "nginx" {
if [fileset][name] == "access" {
grok {
match => { "message" => ["%{IPORHOST:[nginx][access][remote_ip]} - %{DATA:[nginx][access][user_name]} \[%{HTTPDATE:[nginx][access][time]}\] \"%{WORD:[nginx][access][method]} %{DATA:[nginx][access][url]} HTTP/%{NUMBER:[nginx][access][http_version]}\" %{NUMBER:[nginx][access][response_code]} %{NUMBER:[nginx][access][body_sent][bytes]} \"%{DATA:[nginx][access][referrer]}\" \"%{DATA:[nginx][access][agent]}\""] }
remove_field => "message"
}
mutate {
add_field => { "read_timestamp" => "%{#timestamp}" }
}
date {
match => [ "[nginx][access][time]", "dd/MMM/YYYY:H:m:s Z" ]
remove_field => "[nginx][access][time]"
}
useragent {
source => "[nginx][access][agent]"
target => "[nginx][access][user_agent]"
remove_field => "[nginx][access][agent]"
}
geoip {
source => "[nginx][access][remote_ip]"
target => "[nginx][access][geoip]"
}
}
else if [fileset][name] == "error" {
grok {
match => { "message" => ["%{DATA:[nginx][error][time]} \[%{DATA:[nginx][error][level]}\] %{NUMBER:[nginx][error][pid]}#%{NUMBER:[nginx][error][tid]}: (\*%{NUMBER:[nginx][error][connection_id]} )?%{GREEDYDATA:[nginx][error][message]}"] }
remove_field => "message"
}
mutate {
rename => { "#timestamp" => "read_timestamp" }
}
date {
match => [ "[nginx][error][time]", "YYYY/MM/dd H:m:s" ]
remove_field => "[nginx][error][time]"
}
}
}
}
Now, documents in Elastic Search have all of the expected fields:
Note: You'll have the same problem with other Filebeat modules, too. Just use [event][module] in place of [fileset][module].
From your logstash configuration, it doesn't look like you are parsing the log message.
There's an example in the logstash documentation on how to parse nginx logs:
Nginx Logs
The Logstash pipeline configuration in this example shows how to ship and parse access and error logs collected by the nginx Filebeat module.
input {
beats {
port => 5044
host => "0.0.0.0"
}
}
filter {
if [fileset][module] == "nginx" {
if [fileset][name] == "access" {
grok {
match => { "message" => ["%{IPORHOST:[nginx][access][remote_ip]} - %{DATA:[nginx][access][user_name]} \[%{HTTPDATE:[nginx][access][time]}\] \"%{WORD:[nginx][access][method]} %{DATA:[nginx][access][url]} HTTP/%{NUMBER:[nginx][access][http_version]}\" %{NUMBER:[nginx][access][response_code]} %{NUMBER:[nginx][access][body_sent][bytes]} \"%{DATA:[nginx][access][referrer]}\" \"%{DATA:[nginx][access][agent]}\""] }
remove_field => "message"
}
mutate {
add_field => { "read_timestamp" => "%{#timestamp}" }
}
date {
match => [ "[nginx][access][time]", "dd/MMM/YYYY:H:m:s Z" ]
remove_field => "[nginx][access][time]"
}
useragent {
source => "[nginx][access][agent]"
target => "[nginx][access][user_agent]"
remove_field => "[nginx][access][agent]"
}
geoip {
source => "[nginx][access][remote_ip]"
target => "[nginx][access][geoip]"
}
}
else if [fileset][name] == "error" {
grok {
match => { "message" => ["%{DATA:[nginx][error][time]} \[%{DATA:[nginx][error][level]}\] %{NUMBER:[nginx][error][pid]}#%{NUMBER:[nginx][error][tid]}: (\*%{NUMBER:[nginx][error][connection_id]} )?%{GREEDYDATA:[nginx][error][message]}"] }
remove_field => "message"
}
mutate {
rename => { "#timestamp" => "read_timestamp" }
}
date {
match => [ "[nginx][error][time]", "YYYY/MM/dd H:m:s" ]
remove_field => "[nginx][error][time]"
}
}
}
}
I know it doesn't deal with why filebeat doesn't send to logstash the full object, but it should give a start on how to parse the nginx logs in logstash.

how to map an input document field to the elasticsearch _id field?

I have a fairly simple pipeline for taking json messages from Kafka and sending them to Elasticsearch:
input {
kafka {
bootstrap_servers => "kafka04-prod01.messagehub.services.eu-de.bluemix.net:9093,kafka05-prod01.messagehub.services.eu-de.bluemix.net:9093,kafka01-prod01.messagehub.services.eu-de.bluemix.net:9093,kafka03-prod01.messagehub.services.eu-de.bluemix.net:9093,kafka02-prod01.messagehub.services.eu-de.bluemix.net:9093"
topics => [ "transactions_load" ]
}
}
filter {
json {
source => "message"
}
mutate{
remove_field => ["kafka"]
remove_field => ["#version"]
remove_field => ["#timestamp"]
remove_field => ["message"]
remove_tag => ["multiline"]
}
}
output {
elasticsearch {
hosts => [
"xxxxx.ibm-343.composedb.com:16915",
"xxxxx.ibm-343.composedb.com:16915"
]
ssl => true
user => "logstash_kafka"
password => "*****"
index => "pos_transactions"
}
}
The json records have a TransactionID field that uniquely identifies each record:
{"TransactionID": "5440772161", "InvoiceNo": 5440772, "StockCode": 22294, "Description": "HEART FILIGREE DOVE SMALL", "Quantity": 4, "InvoiceDate": 1507777440000, "UnitPrice": 1.25, "CustomerID": 14825, "Country": "United Kingdom", "LineNo": 16, "InvoiceTime": "03:04:00", "StoreID": 1}
{"TransactionID": "5440772191", "InvoiceNo": 5440772, "StockCode": 21733, "Description": "RED HANGING HEART T-LIGHT HOLDER", "Quantity": 4, "InvoiceDate": 1507777440000, "UnitPrice": 2.95, "CustomerID": 14825, "Country": "United Kingdom", "LineNo": 19, "InvoiceTime": "03:04:00", "StoreID": 1}
Can I configure logstash to use the TransactionID as the _id field so that if I process duplicate records for the same transaction, these updates are idempotent?
I figured the answer out myself. Posting here because it may be useful for others:
output {
elasticsearch {
hosts => [
"xxxxx.ibm-343.composedb.com:16915",
"xxxxx.ibm-343.composedb.com:16915"
]
ssl => true
user => "logstash_kafka"
password => "*****"
index => "pos_transactions"
document_id => "%{TransactionID}"
}
}
The document_id => "%{TransactionID}" configuration entry uses the incoming document TransactionID field for the elasticsearch _id

logstash splits event field values and assign to #metadata field

I have a logstash event, which has the following field
{
"_index": "logstash-2016.08.09",
"_type": "log",
"_id": "AVZvz2ix",
"_score": null,
"_source": {
"message": "function_name~execute||line_no~128||debug_message~id was not found",
"#version": "1",
"#timestamp": "2016-08-09T14:57:00.147Z",
"beat": {
"hostname": "coredev",
"name": "coredev"
},
"count": 1,
"fields": null,
"input_type": "log",
"offset": 22299196,
"source": "/project_root/project_1/log/core.log",
"type": "log",
"host": "coredev",
"tags": [
"beats_input_codec_plain_applied"
]
},
"fields": {
"#timestamp": [
1470754620147
]
},
"sort": [
1470754620147
]
}
I am wondering how to use filter (kv maybe?) to extract core.log from "source": "/project_root/project_1/log/core.log", and put it in e.g. [#metadata][log_type], and so later on, I can use log_type in output to create an unique index, composing of hostname + logtype + timestamp, e.g.
output {
elasticsearch {
hosts => "localhost:9200"
manage_template => false
index => "%{[#metadata][_source][host]}-%{[#metadata][log_type]}-%{+YYYY.MM.dd}"
document_type => "%{[#metadata][type]}"
}
stdout { codec => rubydebug }
}
You can leverage the mutate/gsub filter in order to achieve this:
filter {
# add the log_type metadata field
mutate {
add_field => {"[#metadata][log_type]" => "%{source}"}
}
# remove everything up to the last slash
mutate {
gsub => [ "[#metadata][log_type]", "^.*\/", "" ]
}
}
Then you can modify your elasticsearch output like this:
output {
elasticsearch {
hosts => ["localhost:9200"]
manage_template => false
index => "%{host}-%{[#metadata][log_type]}-%{+YYYY.MM.dd}"
document_type => "%{[#metadata][type]}"
}
stdout { codec => rubydebug }
}

Show location points in a tile map with kibi

I'm using logstash 2.3.1, elasticsearch 2.3.1 and kibi 0.3.2. I have problems visualizing locations in a map with kibi.
I have the following configuration in logstash:
input {
file {
path => "/opt/logstash-2.3.1/logTest/Dades.csv"
type => "Dades"
start_position => "beginning"
}
}
filter {
csv {
columns => ["c1", "c2", "c3", "c4", "c5", "c6", "c7", "c8", "c9", "c10", "c11", "c12", "c13", "c14", "c15", "c16", "c17", "c18", "c19", "c20", "c21", "c22", "c23"]
separator => ";"
}
ruby {
code => "
temp = event['c17']
event['c17'] = temp[0..1].to_f+ (temp[2..8].to_f/60)
temp = event['c19']
event['c19'] = temp[0..2].to_f+ (temp[3..8].to_f/60)
"
}
mutate {
convert => {
"c3" => "float"
"c5" => "float"
"c7" => "float"
"c9" => "float"
"c11" => "float"
"c13" => "float"
"c15" => "float"
"c21" => "float"
"c23" => "float"
}
}
date {
match => [ "c1", "dd/MM/YYYY HH:mm:ss.SSS", "ISO8601"]
target => "ts_date"
}
mutate {
rename => [ "c17", "[location][lat]",
"c19", "[location][lon]" ]
}
}
output {
elasticsearch {
hosts => localhost
index => "tram3"
manage_template => false
template => "tram3_template.json"
template_name => "tram3"
template_overwrite => "true"
}
stdout {
codec => rubydebug
}
}
The mapping configuration file (tram3_template.json) is like this:
{
"template": "tram3",
"order": 1,
"settings": {
"number_of_shards": 1
},
"mappings": {
"tram3": {
"_all": {
"enabled": false
},
"properties": {
"location": {
"type": "geo_point"
}
}
}
}
}
When I import de csv file to elasticsearch it seems that all works ok. The output is something like this:
{
"message" => "26/02/2016 00:00:22.984;Total;4231.143555;Trac1;26.547932;Trac2;-338.939697;AA1;-364.611511;AA2;3968.135010;Reo1;0.000000;Reo2;0.000000;Latitud;4125.1846;Longitud;00213.5219;Speed;0.000000;CVS;3873.429443;\r",
"#version" => "1",
"#timestamp" => "2016-04-25T14:02:52.901Z",
"path" => "/opt/logstash-2.3.1/logTest/Dades.csv",
"host" => "ubuntu",
"type" => "Dades",
"c1" => "26/02/2016 00:00:22.984",
"c2" => "Total",
"c3" => 4231.143555,
"c4" => "Trac1",
"c5" => 26.547932,
"c6" => "Trac2",
"c7" => -338.939697,
"c8" => "AA1",
"c9" => -364.611511,
"c10" => "AA2",
"c11" => 3968.13501,
"c12" => "Reo1",
"c13" => 0.0,
"c14" => "Reo2",
"c15" => 0.0,
"c16" => "Latitud",
"c18" => "Longitud",
"c20" => "Speed",
"c21" => 0.0,
"c22" => "CVS",
"c23" => 3873.429443,
"column24" => nil,
"ts_date" => "2016-02-25T23:00:22.984Z",
"location" => {
"lat" => 41.41974333333334,
"lon" => 2.22535
}
}
But when I try to visualize the location parameter in a map it doesn't show any result:
I don't know what I'm doing wrong. Why the location point doesn't appear in the map?
In your ES mapping file, you probably need to enable the storage of the geohash sub-field (defaults to false) as the geohash aggregation cannot work without it.
{
"template": "tram3",
"order": 1,
"settings": {
"number_of_shards": 1
},
"mappings": {
"tram3": {
"_all": {
"enabled": false
},
"properties": {
"location": {
"type": "geo_point",
"geohash": true, <-- add this
"geohash_prefix": true <-- add this
}
}
}
}
}
Then you can build a geohash aggregation on the location.geohash field
Note that if you want to also index all geohash prefixes, you can also add "geohash_prefix": true to your field mapping.
UPDATE
After reproducing the case, here are some more fixes to do:
You need to change the type in your file input as it will be used as the document type and your mapping specifies that the mapping type is named dades2 not Dades:
file {
path => "/opt/logstash-2.3.1/logTest/Dades.csv"
type => "dades2"
start_position => "beginning"
sincedb_path => "/dev/null"
}
Your elasticsearch output should look like below, namely, manage_template should be true and use the full path to your dades2_template.json file (make sure to change /full/path/to with the actual path name.
elasticsearch {
hosts => localhost
index => "dades2"
manage_template => true
template => "/full/path/to/dades2_template.json"
template_name => "dades2"
template_overwrite => "true"
}
The new dades2_template.json file should look like this
{
"template": "dades2",
"order": 1,
"settings": {
"number_of_shards": 1
},
"mappings": {
"dades2": {
"_all": {
"enabled": false
},
"properties": {
"location": {
"type": "geo_point",
"geohash": true,
"geohash_prefix": true
}
}
}
}
}

logstash - geoip in Kibana can not show any information using the IP addresses

I want to display the number of users accessing my app in a World Map using ElasticSearch, Kibana and Logstash.
Here is my log (Json format):
{
"device": "",
"public_ip": "70.90.17.210",
"mac": "00:01:02:03:04:05",
"ip": "192.16.1.10",
"event": {
"timestamp": "2014-08-15T00:00:00.000Z",
"source": "system",
"name": "status"
},
"status": {
"channel": "channelname",
"section": "pictures",
"downlink": 1362930,
"network": "Wi-Fi"
}
}
And this is my config file:
input {
file {
path => ["/mnt/logs/stb.events"]
codec => "json"
type => "event"
}
}
filter {
date {
match => [ "timestamp", "yyyy-MM-dd HH:mm:ss", "ISO8601" ]
}
}
filter {
mutate {
convert => [ "downlink", "integer" ]
}
}
filter {
geoip {
add_tag => [ "geoip" ]
database => "/opt/logstash/vendor/geoip/GeoLiteCity.dat"
source => "public_ip"
target => "geoip"
add_field => [ "[geoip][coordinates]", "%{[geoip][longitude]}" ]
add_field => [ "[geoip][coordinates]", "%{[geoip][latitude]}" ]
}
mutate {
convert => [ "[geoip][coordinates]", "float" ]
}
}
output {
elasticsearch {
host => localhost
}
}
At the end in Kibana I see only an empty geoip tag
Can someone help me and to point me where is my mistake?
Since Logstash 1.3.0 you can use the geoip.location field that is created automatically instead of creating the coordinates field and converting it to float manually.
One curly bracket seems to be missing from your log, I guess this is the correct format:
{
"device": {
"public_ip": "70.90.17.210",
"mac": "00:01:02:03:04:05",
"ip": "192.16.1.10"
},
"event": {
"timestamp": "2014-08-15T00:00:00.000Z",
"source": "system",
"name": "status"
},
"status": {
"channel": "channelname",
"section": "pictures",
"downlink": 1362930,
"network": "Wi-Fi"
}
}
In this case I would suggest you to try the following configuration for the filter (without mutate):
filter {
geoip {
source => "[device][public_ip]"
}
}
Then you should be able to use "geoip.location" in your map. I did quite some research and debugging to find out that in order to be resolved correctly, nested fields should be surrounded by [ ] when used as source.

Resources