Combine two index into third index in elastic search using logstash - elasticsearch

I have two index
employee_data
{"code":1, "name":xyz, "city":"Mumbai" }
transaction_data
{"code":1, "Month":June", payment:78000 }
I want third index like this
3)join_index
{"code":1, "name":xyz, "city":"Mumbai", "Month":June", payment:78000 }
How it's possible??
i am trying in logstash
input {
elasticsearch {
hosts => "localost"
index => "employees_data,transaction_data"
query => '{ "query": { "match": { "code": 1} } }'
scroll => "5m"
docinfo => true
}
}
output {
elasticsearch {
hosts => ["localhost"]
index => "join1"
}
}

You can use elasticsearch input on employees_data
In your filters, use the elasticsearch filter on transaction_data
input {
elasticsearch {
hosts => "localost"
index => "employees_data"
query => '{ "query": { "match_all": { } } }'
sort => "code:desc"
scroll => "5m"
docinfo => true
}
}
filter {
elasticsearch {
hosts => "localhost"
index => "transaction_data"
query => "(code:\"%{[code]}\"
fields => {
"Month" => "Month",
"payment" => "payment"
}
}
}
output {
elasticsearch {
hosts => ["localhost"]
index => "join1"
}
}
And send your new document to your third index with the elasticsearch output
You'll have 3 elastic search connection and the result can be a little slow.
But it works.

You don't need Logstash to do this, Elasticsearch itself supports that by leveraging the enrich processor.
First, you need to create an enrich policy (use the smallest index, let's say it's employees_data ):
PUT /_enrich/policy/employee-policy
{
"match": {
"indices": "employees_data",
"match_field": "code",
"enrich_fields": ["name", "city"]
}
}
Then you can execute that policy in order to create an enrichment index
POST /_enrich/policy/employee-policy/_execute
When the enrichment index has been created and populated, the next step requires you to create an ingest pipeline that uses the above enrich policy/index:
PUT /_ingest/pipeline/employee_lookup
{
"description" : "Enriching transactions with employee data",
"processors" : [
{
"enrich" : {
"policy_name": "employee-policy",
"field" : "code",
"target_field": "tmp",
"max_matches": "1"
}
},
{
"script": {
"if": "ctx.tmp != null",
"source": "ctx.putAll(ctx.tmp); ctx.remove('tmp');"
}
}
]
}
Finally, you're now ready to create your target index with the joined data. Simply leverage the _reindex API combined with the ingest pipeline we've just created:
POST _reindex
{
"source": {
"index": "transaction_data"
},
"dest": {
"index": "join1",
"pipeline": "employee_lookup"
}
}
After running this, the join1 index will contain exactly what you need, for instance:
{
"_index" : "join1",
"_type" : "_doc",
"_id" : "0uA8dXMBU9tMsBeoajlw",
"_score" : 1.0,
"_source" : {
"code":1,
"name": "xyz",
"city": "Mumbai",
"Month": "June",
"payment": 78000
}
}

As long as I know, this can not be happened just using elasticsearch APIs. To handle this, you need to set a unique ID for documents that are relevant. For example, the code that you mentioned in your question can be a good ID for documents. So you can reindex the first index to the third one and use UPDATE API to update them by reading documents from the second index and update them by their IDs into the third index. I hope I could help.

Related

Enriching the Data in Elastic Search

We will be ingesting data into an Index (Index1), however one of the fields in the document(field1) is an ENUM value, which needs to be converted into a value (string) using a lookup through a rest api call.
the rest api call gives a JSON in response like this which has string values for all the ENUMS.
{
values : {
"ENUMVALUE1" : "StringValue1",
"ENUMVALUE2" : "StringValue2"
}
}
I am thinking of making an index from this response document and use that for the lookup.
The incoming document has field1 as ENUMVALUE1 or ENUMVALUE2 (only one of them) and we want to eventually save StringValue1 or StringValue2 in the document under field1 and not ENUMVALUE1.
I went through the documentation of enrichment processor however I am not sure if that is the correct approach to handle this scenario.
While forming the match enrich policy I am not sure how match_field and enrich_fields should be configured.
Could you please advise if this can be done in Elastic and if yes what possible options do I have if the above one is not an optimal approach.
OK, 150-200 enums might not be enough to use an enrich index, but here is a potential solution.
You first need to build the source index containing all enum mappings, it would look like this:
POST enums/_doc/_bulk
{"index":{}}
{"enum_id": "ENUMVALUE1", "string_value": "StringValue1"}
{"index":{}}
{"enum_id": "ENUMVALUE2", "string_value": "StringValue2"}
Then you need to create an enrich policy out of this index:
PUT /_enrich/policy/enum-policy
{
"match": {
"indices": "enums",
"match_field": "enum_id",
"enrich_fields": [
"string_value"
]
}
}
POST /_enrich/policy/enum-policy/_execute
Once it's built (with 200 values it should take a few seconds), you can start building your ingest pipeline using an ingest processor:
PUT _ingest/pipeline/enum-pipeline
{
"description": "Enum enriching pipeline",
"processors": [
{
"enrich" : {
"policy_name": "enum-policy",
"field" : "field1",
"target_field": "tmp"
}
},
{
"set": {
"if": "ctx.tmp != null",
"field": "field1",
"value": "{{tmp.string_value}}"
}
},
{
"remove": {
"if": "ctx.tmp != null",
"field": "tmp"
}
}
]
}
Testing this pipeline, we get this:
POST _ingest/pipeline/enum-pipeline/_simulate
{
"docs": [
{
"_source": {
"field1": "ENUMVALUE1"
}
},
{
"_source": {
"field1": "ENUMVALUE4"
}
}
]
}
Results =>
{
"docs" : [
{
"doc" : {
"_source" : {
"field1" : "StringValue1" <--- value has been replaced
}
}
},
{
"doc" : {
"_source" : {
"field1" : "ENUMVALUE4" <--- value has NOT been replaced
}
}
}
]
}
For the sake of completeness, I'm sharing the other solution without enrich index, so you can test both and use whichever makes most sense for you.
In this second option, we're simply going to use an ingest pipeline with a script processor whose parameters contain a map of your enums. field1 will be replaced by whatever value is mapped to the enum value it contains, or will keep its value if there's no corresponding enum value.
PUT _ingest/pipeline/enum-pipeline
{
"description": "Enum enriching pipeline",
"processors": [
{
"script": {
"source": """
ctx.field1 = params.getOrDefault(ctx.field1, ctx.field1);
""",
"params": {
"ENUMVALUE1": "StringValue1",
"ENUMVALUE2": "StringValue2",
... // add all your enums here
}
}
}
]
}
Testing this pipeline, we get this
POST _ingest/pipeline/enum-pipeline/_simulate
{
"docs": [
{
"_source": {
"field1": "ENUMVALUE1"
}
},
{
"_source": {
"field1": "ENUMVALUE4"
}
}
]
}
Results =>
{
"docs" : [
{
"doc" : {
"_source" : {
"field1" : "StringValue1" <--- value has been replaced
}
}
},
{
"doc" : {
"_source" : {
"field1" : "ENUMVALUE4" <--- value has NOT been replaced
}
}
}
]
}
So both solutions would work for your case, you just need to pick up the one that is the best fit. Just know that in the first option, if your enums change, you'll need to rebuild your source index and enrich policy, while in the second case, you just need to modify the parameters map of your pipeline.

Elasticsearch - Reindex documents with stored / excluded fields

Im having an index mapping with the following configuration:
"mappings" : {
"_source" : {
"excludes" : [
"special_field"
]
},
"properties" : {
"special_field" : {
"type" : "text",
"store" : true
},
}
}
So, when A new document is indexed using this mapping a got de following result:
{
"_index": "********-2021",
"_id": "************",
"_source": {
...
},
"fields": {
"special_field": [
"my special text"
]
}
}
If a _search query is perfomed, special_field is not returned inside _source as its excluded.
With the following _search query, special_field data is returned perfectly:
GET ********-2021/_search
{
"stored_fields": [ "special_field" ],
"_source": true
}
Right now im trying to reindex all documents inside that index, but im loosing the info stored in special_field and only _source field is getting reindexed.
Is there a way to put that special_field back inside _source field?
Is there a way to reindex that documents without loosing special_field data?
How could these documents be migrated to another cluster without loosing special_field data?
Thank you all.
Thx Hamid Bayat, I finally got it using a small logstash pipeline.
I will share it:
input {
elasticsearch {
hosts => "my-first-cluster:9200"
index => "my-index-pattern-*"
user => "****"
password => "****"
query => '{ "stored_fields": [ "special_field" ], "_source": true }'
size => 500
scroll => "5m"
docinfo => true
docinfo_fields => ["_index", "_type", "_id", "fields"]
}
}
filter {
if [#metadata][fields][special_field]{
mutate {
add_field => { "special_field" => "%{[#metadata][fields][special_field]}" }
}
}
}
output {
elasticsearch {
hosts => ["http://my-second-cluster:9200"]
password => "****"
user => "****"
index => "%{[#metadata][_index]}"
document_id => "%{[#metadata][_id]}"
template => "/usr/share/logstash/config/index_template.json"
template_name => "template-name"
template_overwrite => true
}
}
I had to add fields into docinfo_fields => ["_index", "_type", "_id", "fields"] elasticsearch input plugin and all my stored_fields were on [#metadata][fields] event field.
As the #metadata field is not indexed i had to add a new field at root level with [#metadata][fields][special_field] value.
Its working like a charm.

Replica and shard settings not applied in elasticsearch template

I've added a template like this:
curl -X PUT "e.f.g.h:9200/_template/impression-template" -H 'Content-Type: application/json' -d'
{
"index_patterns": ["impression-%{+YYYY.MM.dd}"],
"settings": {
"number_of_shards": 2,
"number_of_replicas": 2
},
"mappings": {
"_doc": {
"_source": {
"enabled": false
},
"dynamic": false,
"properties": {
"message": {
"type": "object",
"properties": {
...
And I've logstash instance that read events from kafka on write them to ES. Here is my logstash config:
input {
kafka {
topics => ["impression"]
bootstrap_servers => "a.b.c.d:9092"
}
}
filter {
json {
source => "message"
target => "message"
}
}
output {
elasticsearch {
hosts => ["e.f.g.h:9200"]
index => "impression-%{+YYYY.MM.dd}"
template_name => "impression-template"
}
}
But each day I get index with 5 shard and 1 replica (which is default config of ES). How I could fix that so I could get 2 replica and 2 shard?
Not sure you can add index_pattern as my_index-%{+YYYY.MM.dd}, because when you create it and PUT my_index-2019.03.10 it will have empty mapping because it's not recognized. I had same issue, and workaround for this was to set index_pattern as my_index-* and add year suffix to indices which should look like my_index-2017, my_index-2018...
{
"my_index_template" : {
"order" : 0,
"index_patterns" : [
"my_index-*"
],
"settings" : {
"index" : {
"number_of_shards" : "5",
"number_of_replicas" : "1"
}
},...
I took year part from timestamp field (YYYY-MM-dd) to generate year and add it to the end of index name by logstash
grok {
match => [
"timestamp", "(?<index_year>%{YEAR})"
]
}
mutate {
add_field => {
"[#metadata][index_year]" => "%{index_year}"
}
}
mutate {
remove_field => [ "index_year", "#version" ]
}
}
output{
elasticsearch {
hosts => ["localhost:9200"]
index => "my_index-%{[#metadata][index_year]}"
document_id => "%{some_field}"
}
}
After logstash was completed, I've managed to get my_index-2017, my_index-2018 and my_index-2019 indices with 5 shards, and 1 replica and correct mapping as I predefined in my template.

Understanding ELK Analyzer

Am newby to ELK 5.1.1 stack and I have a few questions just for my understanding.
I have setup this stack basicaly with standard analyzers / filters and everything works great.
My data source is a MySQL backend that I index using Logstash.
I would like to deal with queries containing accents and hopefully asciifolding token filter can help achieve this.
First I learned out how to create custom analyzer and save as template.
Right now when I query this url http://localhost:9200/_template?pretty I have 2 templates: the logstash default template named logstash and my custom template which settings are:
"custom_template" : {
"order" : 1,
"template" : "doo*",
"settings" : {
"index" : {
"analysis" : {
"analyzer" : {
"myCustomAnalyzer" : {
"filter" : [
"standard",
"lowercase",
"asciifolding"
],
"tokenizer" : "standard"
}
}
},
"refresh_interval" : "5s"
}
},
"mappings" : { },
"aliases" : { }
}
Searching for the keyword Yaoundé returns 70 hits but when I search for Yaounde I keep having no hit.
Below is my query for the second case
{
"query": {
"query_string": {
"query": "yaounde",
"fields": [
"title"
]
}
},
"from": 0,
"size": 10
}
Please can somebody help me guess what am doing wrong here?
Also knowing that my data is analyzed by Logstash during the index process do I really have to specify that the analyzer myCustomAnalyzer should be applied during the research as per this second query ?
{
"query": {
"query_string": {
"query": "yaounde",
"fields": [
"title"
],
"analyzer": "myCustomAnalyzer"
}
},
"from": 0,
"size": 10
}
Here is a sample of the output part of my logstash config file
output {
stdout { codec => json_lines }
if [type] == "announces" {
elasticsearch {
hosts => "localhost:9200"
document_id => "%{job_id}"
index => "dooone"
document_type => "%{type}"
}
} else {
elasticsearch {
hosts => "localhost:9200"
document_id => "%{uid}"
index => "dootwo"
document_type => "%{type}"
}
}
}
Thank You
A good place to start is with the index template documentation of elasticsearch:
https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-templates.html
An example for your scenario that could work for the title field:
"custom_template" : {
"order" : 1,
"template" : "doo*",
"settings" : {
"index" : {
"analysis" : {
"analyzer" : {
"myCustomAnalyzer" : {
"filter" : [
"standard",
"lowercase",
"asciifolding"
],
"tokenizer" : "standard"
}
}
},
"refresh_interval" : "5s"
}
},
"mappings" : {
"your_type": {
"properties": {
"title": {
"type": "text",
"analyzer": "myCustomAnalyzer"
}
}
}
},
"aliases" : { }
}
An alternative would be to change the dynamic mapping. You can find a good example right here for strings.
https://www.elastic.co/guide/en/elasticsearch/reference/current/dynamic-templates.html#match-mapping-type
Can you show the mapping of your document ?
(GET /my_index/my_doc/_mapping )
The analyser you provide as argument in your query does only apply at search time, not at indexation time. So if you haven't set this analyser in your mapping, the string is still indexed with "default" analyzer, so it will not match your results.
The analyzer you provide at search time will apply on your query string, but then it will look into indexed data, which is indexed as "Yaoundé", not "yaounde".

logstash output to elasticsearch index and mapping

I'm trying to have logstash output to elasticsearch but I'm not sure how to use the mapping I defined in elasticsearch...
In Kibana, I did this:
Created an index and mapping like this:
PUT /kafkajmx2
{
"mappings": {
"kafka_mbeans": {
"properties": {
"#timestamp": {
"type": "date"
},
"#version": {
"type": "integer"
},
"host": {
"type": "keyword"
},
"metric_path": {
"type": "text"
},
"type": {
"type": "keyword"
},
"path": {
"type": "text"
},
"metric_value_string": {
"type": "keyword"
},
"metric_value_number": {
"type": "float"
}
}
}
}
}
Can write data to it like this:
POST /kafkajmx2/kafka_mbeans
{
"metric_value_number":159.03478490788203,
"path":"/home/usrxxx/logstash-5.2.0/bin/jmxconf",
"#timestamp":"2017-02-12T23:08:40.934Z",
"#version":"1","host":"localhost",
"metric_path":"node1.kafka.server:type=BrokerTopicMetrics,name=TotalFetchRequestsPerSec.FifteenMinuteRate",
"type":null
}
now my logstash output looks like this:
input {
kafka {
kafka details here
}
}
output {
elasticsearch {
hosts => "http://elasticsearch:9050"
index => "kafkajmx2"
}
}
and it just writes it to the kafkajmx2 index but doesn't use the map, when I query it like this in kibana:
get /kafkajmx2/kafka_mbeans/_search?q=*
{
}
I get this back:
{
"_index": "kafkajmx2",
"_type": "logs",
"_id": "AVo34xF_j-lM6k7wBavd",
"_score": 1,
"_source": {
"#timestamp": "2017-02-13T14:31:53.337Z",
"#version": "1",
"message": """
{"metric_value_number":0,"path":"/home/usrxxx/logstash-5.2.0/bin/jmxconf","#timestamp":"2017-02-13T14:31:52.654Z","#version":"1","host":"localhost","metric_path":"node1.kafka.server:type=SessionExpireListener,name=ZooKeeperAuthFailuresPerSec.Count","type":null}
"""
}
}
how do I tell it to use the map kafka_mbeans in the logstash output?
-----EDIT-----
I tried my output like this but still get the same results:
output {
elasticsearch {
hosts => "http://10.204.93.209:9050"
index => "kafkajmx2"
template_name => "kafka_mbeans"
codec => plain {
format => "%{message}"
}
}
}
the data in elastic search should look like this:
{
"#timestamp": "2017-02-13T14:31:52.654Z",
"#version": "1",
"host": "localhost",
"metric_path": "node1.kafka.server:type=SessionExpireListener,name=ZooKeeperAuthFailuresPerSec.Count",
"metric_value_number": 0,
"path": "/home/usrxxx/logstash-5.2.0/bin/jmxconf",
"type": null
}
--------EDIT 2--------------
I atleast got the message to parse into json by adding a filter like this:
input {
kafka {
...kafka details....
}
}
filter {
json {
source => "message"
remove_field => ["message"]
}
}
output {
elasticsearch {
hosts => "http://node1:9050"
index => "kafkajmx2"
template_name => "kafka_mbeans"
}
}
It doesn't use the template still but this atleast parses the json correctly...so now I get this:
{
"_index": "kafkajmx2",
"_type": "logs",
"_id": "AVo4a2Hzj-lM6k7wBcMS",
"_score": 1,
"_source": {
"metric_value_number": 0.9967205071482902,
"path": "/home/usrxxx/logstash-5.2.0/bin/jmxconf",
"#timestamp": "2017-02-13T16:54:16.701Z",
"#version": "1",
"host": "localhost",
"metric_path": "kafka1.kafka.network:type=SocketServer,name=NetworkProcessorAvgIdlePercent.Value",
"type": null
}
}
What you need to change is very simple. First use the json codec in your kafka input. No need for the json filter, you can remove it.
kafka {
...kafka details....
codec => "json"
}
Then in your elasticsearch output you're missing the mapping type (parameter document_type below), which is important otherwise it defaults to logs (as you can see) and that doesn't match your kafka_mbeans mapping type. Moreover, you don't really need to use template since your index already exists. Make the following modification:
elasticsearch {
hosts => "http://node1:9050"
index => "kafkajmx2"
document_type => "kafka_mbeans"
}
This is defined with the template_name parameter on the elasticsearch output.
elasticsearch {
hosts => "http://elasticsearch:9050"
index => "kafkajmx2"
template_name => "kafka_mbeans"
}
One warning, though. If you want to start creating indexes that are boxed on time, such as one index a week, you will have to take a few more steps to ensure your mapping stays with each. You have a couple of options there:
Create an elasticsearch template, and define it to apply to indexes using a glob. Such as kafkajmx2-*
Use the template parameter on the output, which specifies a JSON file that defines your mapping that will be used with all indexes created through that output.

Resources