Set _Id as update key in logstash elasticsearch - elasticsearch

Im having an index as below:
{
"_index": "mydata",
"_type": "_doc",
"_id": "PuhnbG0B1IIlyY9-ArdR",
"_score": 1,
"_source": {
"age": 9,
"#version": "1",
"updated_on": "2019-01-01T00:00:00.000Z",
"id": 4,
"name": "Emma",
"#timestamp": "2019-09-26T07:09:11.947Z"
}
So my logstash conf for updaing data is input {
jdbc {
jdbc_connection_string => "***"
jdbc_driver_class => "***"
jdbc_driver_library => "***"
jdbc_user => ***
statement => "SELECT * from agedata WHERE updated_on > :sql_last_value ORDER BY updated_on"
use_column_value =>true
tracking_column =>updated_on
tracking_column_type => "timestamp"
}
}
output {
elasticsearch { hosts => ["localhost:9200"]
index => "mydata"
action => update
document_id => "{_id}"
doc_as_upsert =>true}
stdout { codec => rubydebug }
}
So, when i run this after any updation in the same row, my expected output is to update the existing _id values for any changes i made in that row.
But my Elasticsearch is indexing it as a new row where my _id is considered as a string.
"_index": "agesep",
"_type": "_doc",
"_id": ***"%{_id}"***
The duplicate occurs when i use document_id => "%{id}" as:
actual:
{
"_index": "mydata",
"_type": "_doc",
"_id": "BuilbG0B1IIlyY9-4P7t",
"_score": 1,
"_source": {
"id": 1,
"age": 13,
"name": "Greg",
"updated_on": "2019-09-26T08:11:00.000Z",
"#timestamp": "2019-09-26T08:17:52.974Z",
"#version": "1"
}
}
duplicate:
{
"_index": "mydata",
"_type": "_doc",
"_id": "1",
"_score": 1,
"_source": {
"age": 56,
"#version": "1",
"id": 1,
"name": "Greg",
"updated_on": "2019-09-26T08:18:00.000Z",
"#timestamp": "2019-09-26T08:20:14.561Z"
}
How do i get it to consider the existing _id and not create a duplicate value when i make updates in ES?
My expectation is to update data in the index based on the _id, and not create a new row of update.

I suggest using id instead of _id
document_id => "%{id}"

Related

I have implemented the kafka with logstash input and elasticsearch output. its working fine in kibana.. I want to filter the data based on statuscode

This is kibana dashboard json Data.. Here i have to filter the based on response statuscode with in the message json data field..
{
"_index": "rand-topic",
"_type": "_doc",
"_id": "ulF8uH0BK9MbBSR7DPEw",
"_version": 1,
"_score": null,
"fields": {
"#timestamp": [
"2021-12-14T10:27:56.956Z"
],
"#version": [
"1"
],
"#version.keyword": [
"1"
],
"message": [
"{\"requestMethod\":\"GET\",\"headers\":{\"content-type\":\"application/json\",\"user-agent\":\"PostmanRuntime/7.28.4\",\"accept\":\"*/*\",\"postman-token\":\"977fc94b-38c8-4df4-ad73-814871a32eca\",\"host\":\"localhost:5600\",\"accept-encoding\":\"gzip, deflate, br\",\"connection\":\"keep-alive\",\"content-length\":\"44\"},\"body\":{\"category\":\"CAT\",\"noise\":\"purr\"},\"query\":{},\"requestUrl\":\"http://localhost:5600/kafka\",\"protocol\":\"HTTP/1.1\",\"remoteIp\":\"1\",\"requestSize\":302,\"userAgent\":\"PostmanRuntime/7.28.4\",\"statusCode\":200,\"response\":{\"success\":true,\"message\":\"Kafka Details are added\",\"data\":{\"kafkaData\":{\"_id\":\"61b871ac69be37078a9c1a79\",\"category\":\"DOG\",\"noise\":\"bark\",\"__v\":0},\"postData\":{\"category\":\"DOG\",\"noise\":\"bark\"}}},\"latency\":{\"seconds\":0,\"nanos\":61000000},\"responseSize\":193}"]},"sort[1639477676956]}
Expected output like this Here added the statuscode field from message field
{
"_index": "rand-topic",
"_type": "_doc",
"_id": "ulF8uH0BK9MbBSR7DPEw",
"_version": 1,
"_score": null,
"fields": {
"#timestamp": [
"2021-12-14T10:27:56.956Z"
],
"#version": [
"1"
],
"#version.keyword": [
"1"
],
"statusCode": [
200
],
"message": [
"{\"requestMethod\":\"GET\",\"headers\":{\"content-
type\":\"application/json\",\"user-
agent\":\"PostmanRuntime/7.28.4\",\"accept\":\"*/*\",\"postman-
token\":\"977fc94b-38c8-4df4-ad73-
814871a32eca\",\"host\":\"localhost:5600\",\"accept-
encoding\":\"gzip, deflate, br\",\"connection\":\"keep-
alive\",\"content-length\":\"44\"},\"body\":
{\"category\":\"CAT\",\"noise\":\"purr\"},\"query\": {}, \"requestUrl\":\"http://localhost:5600/kafka\",\"protocol\":\"HTTP/1.1\",\"remoteIp\":\"1\",\"requestSize\":302,\"userAgent\":\"PostmanRuntime/7.28.4\",\"statusCode\":200,\"response\":{\"success\":true,\"message\":\"Kafka Details are added\",\"data\":{\"kafkaData\":{\"_id\":\"61b871ac69be37078a9c1a79\",\"category\":\"DOG\",\"noise\":\"bark\",\"__v\":0},\"postData\":{\"category\":\"DOG\",\"noise\":\"bark\"}}},\"latency\":{\"seconds\":0,\"nanos\":61000000},\"responseSize\":193}"
]},"sort": [1639477676956]}
Please help me how to configure logstash filter for statusCode
input {
kafka {
topics => ["randtopic"]
bootstrap_servers => "192.168.29.138:9092"
}
}
filter{
mutate {
add_field => {
"statusCode" => "%{[status]}"
}
}
}
output {
elasticsearch {
hosts => ["192.168.29.138:9200"]
index => "rand-topic"
workers => 1
}
}
output {
if [message][0][statusCode] == "200" {
Do Somethings ....
stdout { codec => ""}
}
}

ElasticSearch query with conditions on multiple documents

I have data of this format in elasticsearch, each one is in seperate document:
{ 'pid': 1, 'nm' : 'tom'}, { 'pid': 1, 'nm' : 'dick''},{ 'pid': 1, 'nm' : 'harry'}, { 'pid': 2, 'nm' : 'tom'}, { 'pid': 2, 'nm' : 'harry'}, { 'pid': 3, 'nm' : 'dick'}, { 'pid': 3, 'nm' : 'harry'}, { 'pid': 4, 'nm' : 'harry'}
{
"took": 137,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 8,
"max_score": null,
"hits": [
{
"_index": "query_test",
"_type": "user",
"_id": "AVj9KS86AaDUbQTYUmwY",
"_score": null,
"_source": {
"pid": 1,
"nm": "Harry"
}
},
{
"_index": "query_test",
"_type": "user",
"_id": "AVj9KJ9BAaDUbQTYUmwW",
"_score": null,
"_source": {
"pid": 1,
"nm": "Tom"
}
},
{
"_index": "query_test",
"_type": "user",
"_id": "AVj9KRlbAaDUbQTYUmwX",
"_score": null,
"_source": {
"pid": 1,
"nm": "Dick"
}
},
{
"_index": "query_test",
"_type": "user",
"_id": "AVj9KYnKAaDUbQTYUmwa",
"_score": null,
"_source": {
"pid": 2,
"nm": "Harry"
}
},
{
"_index": "query_test",
"_type": "user",
"_id": "AVj9KXL5AaDUbQTYUmwZ",
"_score": null,
"_source": {
"pid": 2,
"nm": "Tom"
}
},
{
"_index": "query_test",
"_type": "user",
"_id": "AVj9KbcpAaDUbQTYUmwb",
"_score": null,
"_source": {
"pid": 3,
"nm": "Dick"
}
},
{
"_index": "query_test",
"_type": "user",
"_id": "AVj9Kdy5AaDUbQTYUmwc",
"_score": null,
"_source": {
"pid": 3,
"nm": "Harry"
}
},
{
"_index": "query_test",
"_type": "user",
"_id": "AVj9KetLAaDUbQTYUmwd",
"_score": null,
"_source": {
"pid": 4,
"nm": "Harry"
}
}
]
}
}
And I need to find the pid's which have 'harry' and do not have 'tom', which in the above example are 3 and 4. Which essentialy means look for the documents having same pids where none of them has nm with value 'tom' but at least one of them have nm with value 'harry'.
How do I query that?
EDIT: Using Elasticsearch version 5
What if you have a POST request body which could look something like below, where you might use bool :
POST _search
{
"query": {
"bool" : {
"must" : {
"term" : { "nm" : "harry" }
},
"must_not" : {
"term" : { "nm" : "tom" }
}
}
}
}
I am relatively very new in Elasticsearch, so I might be wrong. But I have never seen such query. Simple filters can not be used here as those are applied on a doc (and not aggregations) which you do not want. What I see is you want to do a "Group by" query with "Having" clause (in terms of SQL). But Group by queries involve some aggregation (like avg, max, min of any field) which is used in "Having" clause. Basically you use a reducer for Post processing of aggregation results. For queries like this Bucket Selector Aggregation can be used. Read this
But your case is different. You do not want to apply Having clause on any metric aggregation but you want to check if some value is present in field (or column) of your "group by" data. In terms of SQL, you want to do a "where" query in "group by". This is what I have never seen. You can also read this
However, at application level, you can easily do this by breaking your query. First find unique pid where nm= harry using term aggs. Then get docs for those pid with additional condition nm != tom.
P.S. I am very new to ES. And I will be very happy if any one contradicts me show ways to do this in one query. I will also learn that.

How to filter a query using another query in ElasticSearch

Given the example user and product docs below:
{
"_id": "1",
"_type": "user",
"_source": {
"id": "1",
"following": ["2", "3", ... , "10000"]
}
{
"_id": "1",
"_type": "product",
"_source": {
"id": "1",
"owner_id": "2"
}
{
"_id": "2",
"_type": "product",
"_source": {
"id": "2",
"owner_id": "10001"
}
I want to get the products that belongs to the users who are followed by user with id=1. I don't want to make 2 different queries (first for getting the users followed by user id=1 and then second for getting the products) since user id=1 is following ~10000 users.
Is there any way of getting the result using only one query?
I think you're looking for the terms query:
https://www.elastic.co/guide/en/elasticsearch/reference/5.5/query-dsl-terms-query.html
Check out their documentation, their example is very similar to your case:
GET /product/_search
{
"query" : {
"terms" : {
"owner_id": {
"index": "<your_index>",
"type": "user",
"id": "1",
"path": "following"
}
}
}
}

How do I replicate the _id and _type of elasticsearch index when dumping data through Logstash

I have an "Index":samcorp with "type":"sam".
One of them looks like the below :
{
"_index": "samcorp",
"_type": "sam",
"_id": "1236",
"_version": 1,
"_score": 1,
"_source": {
"name": "Sam Smith",
"age": 22,
"confirmed": true,
"join_date": "2014-06-01"
}
}
I want to replicate the same data into a different "index" name "jamcorp" with the same "type" and same "id"
I am using Logstash to do it:
I use the below code in the configuration file of logstash I end up having wrong ids and type
input {
elasticsearch {
hosts => ["127.0.0.1:9200"]
index => "samcorp"
}
}
filter {
mutate {
remove_field => [ "#version", "#timestamp" ]
}
}
output {
elasticsearch {
hosts => ["127.0.0.1:9200"]
manage_template => false
index => "jamcorp"
document_type => "%{_type}"
document_id => "%{_id}"
}
}
I've tried all possible combinations, I gt the following output:
Output:
{
"_index": "jamcorp",
"_type": "%{_type}",
"_id": "%{_id}",
"_version": 4,
"_score": 1,
"_source": {
"name": "Sam Smith",
"age": 22,
"confirmed": true,
"join_date": "2014-06-01"
}
}
The Ouptut I require is:
{
"_index": "jamcorp",
"_type": "sam",
"_id": "1236",
"_version": 4,
"_score": 1,
"_source": {
"name": "Sam Smith",
"age": 22,
"confirmed": true,
"join_date": "2014-06-01"
}
}
Any help would be appreciated. :) Thanks
In your elasticsearch input, you need to set the docinfo parameter to true
input {
elasticsearch {
hosts => ["127.0.0.1:9200"]
index => "samcorp"
docinfo => true <--- add this
}
}
As a result the #metadata hash will be populated with the index, _type and _id of the document and you can reuse that in your filters and outputs:
output {
elasticsearch {
hosts => ["127.0.0.1:9200"]
manage_template => false
index => "jamcorp"
document_type => "%{[#metadata][_type]}" <--- use #metadata
document_id => "%{[#metadata][_id]}" <--- use #metadata
}
}

How to extract feature from the Elasticsearch _source to index

I have used logstash,elasticsearch and kibana to collect logs.
The log file is json which like this:
{"_id":{"$oid":"5540afc2cec7c68fc1248d78"},"agentId":"0000000BAB39A520","handler":"SUSIControl","sensorId":"/GPIO/GPIO00/Level","ts":{"$date":"2015-04-29T09:00:00.846Z"},"vHour":1}
{"_id":{"$oid":"5540afc2cec7c68fc1248d79"},"agentId":"0000000BAB39A520","handler":"SUSIControl","sensorId":"/GPIO/GPIO00/Dir","ts":{"$date":"2015-04-29T09:00:00.846Z"},"vHour":0}
and the code I have used in logstash:
input {
file {
type => "log"
path => ["/home/data/1/1.json"]
start_position => "beginning"
}
}
filter {
json{
source => "message"
}
}
output {
elasticsearch { embedded => true }
stdout { codec => rubydebug }
}
then the output in elasticsearch is :
{
"_index": "logstash-2015.06.29",
"_type": "log",
"_id": "AU5AG7KahwyA2bfnpJO0",
"_version": 1,
"_score": 1,
"_source": {
"message": "{"_id":{"$oid":"5540afc2cec7c68fc1248d7c"},"agentId":"0000000BAB39A520","handler":"SUSIControl","sensorId":"/GPIO/GPIO05/Dir","ts":{"$date":"2015-04-29T09:00:00.846Z"},"vHour":1}",
"#version": "1",
"#timestamp": "2015-06-29T16:17:03.040Z",
"type": "log",
"host": "song-Lenovo-IdeaPad",
"path": "/home/song/soft/data/1/Average.json",
"_id": {
"$oid": "5540afc2cec7c68fc1248d7c"
},
"agentId": "0000000BAB39A520",
"handler": "SUSIControl",
"sensorId": "/GPIO/GPIO05/Dir",
"ts": {
"$date": "2015-04-29T09:00:00.846Z"
},
"vHour": 1
}
}
But the information in the json file all in the _source not index
so that i can't use kibana to analysis them.
the kibana shows that Analysis is not available for object fields.
the _source is object fields
how to solve this problem?

Resources