Elasticsearch - Reindex documents with stored / excluded fields - elasticsearch

Im having an index mapping with the following configuration:
"mappings" : {
"_source" : {
"excludes" : [
"special_field"
]
},
"properties" : {
"special_field" : {
"type" : "text",
"store" : true
},
}
}
So, when A new document is indexed using this mapping a got de following result:
{
"_index": "********-2021",
"_id": "************",
"_source": {
...
},
"fields": {
"special_field": [
"my special text"
]
}
}
If a _search query is perfomed, special_field is not returned inside _source as its excluded.
With the following _search query, special_field data is returned perfectly:
GET ********-2021/_search
{
"stored_fields": [ "special_field" ],
"_source": true
}
Right now im trying to reindex all documents inside that index, but im loosing the info stored in special_field and only _source field is getting reindexed.
Is there a way to put that special_field back inside _source field?
Is there a way to reindex that documents without loosing special_field data?
How could these documents be migrated to another cluster without loosing special_field data?
Thank you all.

Thx Hamid Bayat, I finally got it using a small logstash pipeline.
I will share it:
input {
elasticsearch {
hosts => "my-first-cluster:9200"
index => "my-index-pattern-*"
user => "****"
password => "****"
query => '{ "stored_fields": [ "special_field" ], "_source": true }'
size => 500
scroll => "5m"
docinfo => true
docinfo_fields => ["_index", "_type", "_id", "fields"]
}
}
filter {
if [#metadata][fields][special_field]{
mutate {
add_field => { "special_field" => "%{[#metadata][fields][special_field]}" }
}
}
}
output {
elasticsearch {
hosts => ["http://my-second-cluster:9200"]
password => "****"
user => "****"
index => "%{[#metadata][_index]}"
document_id => "%{[#metadata][_id]}"
template => "/usr/share/logstash/config/index_template.json"
template_name => "template-name"
template_overwrite => true
}
}
I had to add fields into docinfo_fields => ["_index", "_type", "_id", "fields"] elasticsearch input plugin and all my stored_fields were on [#metadata][fields] event field.
As the #metadata field is not indexed i had to add a new field at root level with [#metadata][fields][special_field] value.
Its working like a charm.

Related

Combine two index into third index in elastic search using logstash

I have two index
employee_data
{"code":1, "name":xyz, "city":"Mumbai" }
transaction_data
{"code":1, "Month":June", payment:78000 }
I want third index like this
3)join_index
{"code":1, "name":xyz, "city":"Mumbai", "Month":June", payment:78000 }
How it's possible??
i am trying in logstash
input {
elasticsearch {
hosts => "localost"
index => "employees_data,transaction_data"
query => '{ "query": { "match": { "code": 1} } }'
scroll => "5m"
docinfo => true
}
}
output {
elasticsearch {
hosts => ["localhost"]
index => "join1"
}
}
You can use elasticsearch input on employees_data
In your filters, use the elasticsearch filter on transaction_data
input {
elasticsearch {
hosts => "localost"
index => "employees_data"
query => '{ "query": { "match_all": { } } }'
sort => "code:desc"
scroll => "5m"
docinfo => true
}
}
filter {
elasticsearch {
hosts => "localhost"
index => "transaction_data"
query => "(code:\"%{[code]}\"
fields => {
"Month" => "Month",
"payment" => "payment"
}
}
}
output {
elasticsearch {
hosts => ["localhost"]
index => "join1"
}
}
And send your new document to your third index with the elasticsearch output
You'll have 3 elastic search connection and the result can be a little slow.
But it works.
You don't need Logstash to do this, Elasticsearch itself supports that by leveraging the enrich processor.
First, you need to create an enrich policy (use the smallest index, let's say it's employees_data ):
PUT /_enrich/policy/employee-policy
{
"match": {
"indices": "employees_data",
"match_field": "code",
"enrich_fields": ["name", "city"]
}
}
Then you can execute that policy in order to create an enrichment index
POST /_enrich/policy/employee-policy/_execute
When the enrichment index has been created and populated, the next step requires you to create an ingest pipeline that uses the above enrich policy/index:
PUT /_ingest/pipeline/employee_lookup
{
"description" : "Enriching transactions with employee data",
"processors" : [
{
"enrich" : {
"policy_name": "employee-policy",
"field" : "code",
"target_field": "tmp",
"max_matches": "1"
}
},
{
"script": {
"if": "ctx.tmp != null",
"source": "ctx.putAll(ctx.tmp); ctx.remove('tmp');"
}
}
]
}
Finally, you're now ready to create your target index with the joined data. Simply leverage the _reindex API combined with the ingest pipeline we've just created:
POST _reindex
{
"source": {
"index": "transaction_data"
},
"dest": {
"index": "join1",
"pipeline": "employee_lookup"
}
}
After running this, the join1 index will contain exactly what you need, for instance:
{
"_index" : "join1",
"_type" : "_doc",
"_id" : "0uA8dXMBU9tMsBeoajlw",
"_score" : 1.0,
"_source" : {
"code":1,
"name": "xyz",
"city": "Mumbai",
"Month": "June",
"payment": 78000
}
}
As long as I know, this can not be happened just using elasticsearch APIs. To handle this, you need to set a unique ID for documents that are relevant. For example, the code that you mentioned in your question can be a good ID for documents. So you can reindex the first index to the third one and use UPDATE API to update them by reading documents from the second index and update them by their IDs into the third index. I hope I could help.

Replica and shard settings not applied in elasticsearch template

I've added a template like this:
curl -X PUT "e.f.g.h:9200/_template/impression-template" -H 'Content-Type: application/json' -d'
{
"index_patterns": ["impression-%{+YYYY.MM.dd}"],
"settings": {
"number_of_shards": 2,
"number_of_replicas": 2
},
"mappings": {
"_doc": {
"_source": {
"enabled": false
},
"dynamic": false,
"properties": {
"message": {
"type": "object",
"properties": {
...
And I've logstash instance that read events from kafka on write them to ES. Here is my logstash config:
input {
kafka {
topics => ["impression"]
bootstrap_servers => "a.b.c.d:9092"
}
}
filter {
json {
source => "message"
target => "message"
}
}
output {
elasticsearch {
hosts => ["e.f.g.h:9200"]
index => "impression-%{+YYYY.MM.dd}"
template_name => "impression-template"
}
}
But each day I get index with 5 shard and 1 replica (which is default config of ES). How I could fix that so I could get 2 replica and 2 shard?
Not sure you can add index_pattern as my_index-%{+YYYY.MM.dd}, because when you create it and PUT my_index-2019.03.10 it will have empty mapping because it's not recognized. I had same issue, and workaround for this was to set index_pattern as my_index-* and add year suffix to indices which should look like my_index-2017, my_index-2018...
{
"my_index_template" : {
"order" : 0,
"index_patterns" : [
"my_index-*"
],
"settings" : {
"index" : {
"number_of_shards" : "5",
"number_of_replicas" : "1"
}
},...
I took year part from timestamp field (YYYY-MM-dd) to generate year and add it to the end of index name by logstash
grok {
match => [
"timestamp", "(?<index_year>%{YEAR})"
]
}
mutate {
add_field => {
"[#metadata][index_year]" => "%{index_year}"
}
}
mutate {
remove_field => [ "index_year", "#version" ]
}
}
output{
elasticsearch {
hosts => ["localhost:9200"]
index => "my_index-%{[#metadata][index_year]}"
document_id => "%{some_field}"
}
}
After logstash was completed, I've managed to get my_index-2017, my_index-2018 and my_index-2019 indices with 5 shards, and 1 replica and correct mapping as I predefined in my template.

how to transfer data to elastic via logstast and using analyzer?

I have a logstash config file below. Elastic is reading my data as a b where as i want it to read it as ab i found i need to use not_analyzed for my sscat filed and max_shingle_size , min_shingle_size for products to get the best result.
Should I use not_analyzed for products field as well? Will that give better result?
How should I fill my my_id_analyzer to actually use the analyzer on different fields?
How should I connect the template with logstash config file?
input{
file{
path => "path"
start_position =>"beginning"
}
}
filter{
csv{
separator => ","
columns => ["Index", "Category", "Scat", "Sscat", "Products", "Measure", "Price", "Description", "Gst"]
}
mutate{convert => ["Index", "float"] }
mutate{convert => ["Price", "float"] }
mutate{convert => ["Gst", "float"] }
}
output{
elasticsearch{
hosts => "host"
user => "elastic"
password => "pass"
index => "masterdb"
}
}
I also have a template that can do it for all the future files that i upload
curl user:pass host:"host" /_template/logstash-id -XPUT -d '{
"template": "logstash-*",
"settings" : {
"analysis": {
"analyzer": {
"my_id_analyzer"{
}
}
}
}
},
"mappings": {
"properties" : {
"id" : { "type" : "string", "analyzer" : "my_id_analyzer" }
}
}
}'
You can use "ignore_above:" to restrict to a max length along with "not_analyzed" while creating mapping so that text doesn't get analyzed.
Declaring type as keyword instead of text will be other alternative for you.
Regarding the connecting template with logstash, why you need this? Once you have template created on elasticsearch, you can create your index which will follow the created template definition and you can start indexing.

ElasticSearch 5.0.0 - error about object name is already in use

I am learning ElasticSearch and have hit a block. I am trying to use logstash to load a simple CSV into ElasticSearch. This is the data, it is a postcode, longitude, latitude
ZE1 0BH,-1.136758103355,60.150855671143
ZE1 0NW,-1.15526666950369,60.1532197533966
I am using the following logstash conf file to filter the CSV to create a "location" field
input {
file {
path => "postcodes.csv"
start_position => "beginning"
sincedb_path => "/dev/null"
}
}
filter {
csv {
columns => ["postcode", "lat", "lon"]
separator => ","
}
mutate { convert => {"lat" => "float"} }
mutate { convert => {"lon" => "float"} }
mutate { rename => {"lat" => "[location][lat]"} }
mutate { rename => {"lon" => "[location][lon]"} }
mutate { convert => { "[location]" => "float" } }
}
output {
elasticsearch {
action => "index"
hosts => "localhost"
index => "postcodes"
}
stdout { codec => rubydebug }
}
And I have added the mapping to ElasticSearch using the console in Kibana
PUT postcodes
{
"settings": {
"number_of_shards": 1
},
"mappings": {
"feature": {
"_all": { "enabled": true },
"properties": {
"postcode": {"type": "text"},
"location": {"type": "geo_point"}
}
}
}
}
I check the mappins for the index using
GET postcodes/_mapping
{
"postcodes": {
"mappings": {
"feature": {
"_all": {
"enabled": true
},
"properties": {
"location": {
"type": "geo_point"
},
"postcode": {
"type": "text"
}
}
}
}
}
}
So this all seems to be correct having looked at the documentation and the other questions posted.
However when i run
bin/logstash -f postcodes.conf
I get an error:
[location] is defined as an object in mapping [logs] but this name is already used for a field in other types
I have tried a number of alternative methods;
Deleted the index and the create a template.json and changed my conf file to have the extra settings:
manage_template => true
template => "postcode_template.json"
template_name =>"open_names"
template_overwrite => true
and this gets the same error.
I have managed to get the data loaded by not supplying a template however the data never gets loaded in as a geo_point so you cannot use the Kibana Tile Map to visualise the data
Can anyone explain why I am receiving that error and what method I should use?
Your problem is that you don't have a document_type => feature on your elasticsearch output. Without that, it's going to create the object on type logs which is why you are getting this conflict.

Logstash couchdb_changes doesn't correctly propagate document deletion to Elasticsearch

I am trying to use the couchdb_changes Logstash plugin to detect my CouchDB changes and update Elasticsearch index adequately.
Document creations/updates work fine, but somehow deletions do not work.
Here is my Logstash configuration:
input {
couchdb_changes {
host => "localhost"
db => "products"
sequence_path => ".couchdb_products_seq"
type => "product"
tags => ["product"]
keep_revision => true
}
}
output {
elasticsearch {
hosts => ["localhost:9200"]
index => "products"
# Pass the CouchDB document ID to Elastic, otherwise it is lost and Elastic generates a new one
document_id => "%{[#metadata][_id]}"
}
# Debug
stdout {
codec => rubydebug {
metadata => true
}
}
}
I came across this link but the "protocol" parameter no longer exists in the elasticsearch Logstash plugin, and I would expect such a huge bug to be fixed by now.
In my Logstash console I see this when I delete a CouchDB document (from Futon):
{
"#version" => "1",
"#timestamp" => "2016-05-13T14:06:55.734Z",
"type" => "product",
"tags" => [
[0] "product"
],
"#metadata" => {
"_id" => "15d6f519d6827a2f28de4df1d40082d5",
"action" => "delete",
"seq" => 10020
}
}
So instead of deleting document with id "15d6f519d6827a2f28de4df1d40082d5", it replaces its content. Here is the document "15d6f519d6827a2f28de4df1d40082d5" after the deletion, in Elasticsearch:
curl -XGET 'localhost:9200/products/product/15d6f519d6827a2f28de4df1d40082d5?pretty'
{
"_index" : "products",
"_type" : "product",
"_id" : "15d6f519d6827a2f28de4df1d40082d5",
"_version" : 3,
"found" : true,
"_source" : {
"#version" : "1",
"#timestamp" : "2016-05-13T14:06:55.734Z",
"type" : "product",
"tags" : [ "product" ]
}
}
Any idea of why the deletion doesn't work? Is this a bug of the couchdb_changes plugin? The elasticsearch plugin?
For information, here are my app versions:
Elasticsearch 2.3.2
Logstash 2.3.2
Apache CouchDB 1.6.1
I think I found the problem.
I had to manually add this line in the logstash output.elasticsearch configuration:
action => "%{[#metadata][action]}"
in order to pass the "delete" from metadata to Elasticsearch.
Now there is another issue with upsert, but it's tracked in a GitHub ticket.
Edit: To bypass theupsert issue, I actually changed my configuration to this (mainly, add a field to store whether the action is a delete):
input {
couchdb_changes {
host => "localhost"
db => "products"
sequence_path => ".couchdb_products_seq"
type => "product"
tags => ["product"]
keep_revision => true
}
}
filter {
if [#metadata][action] == "delete" {
mutate {
add_field => { "elastic_action" => "delete" }
}
} else {
mutate {
add_field => { "elastic_action" => "index" }
}
}
}
output {
elasticsearch {
hosts => ["localhost:9200"]
index => "products"
document_id => "%{[#metadata][_id]}"
action => "%{elastic_action}"
}
# Debug
stdout {
codec => rubydebug {
metadata => true
}
}
}
I am nowhere near an expert in Logstash/Elasticsearch, but this seems to work for the moment.

Resources