I'm trying to reindex the data by having elasticsearch as input and sending back to elasticsearch as output. The script is running fine but the indexing is going indefinitely. The script is as below
input {
elasticsearch {
host => "10.0.0.11"
index => "logstash-2015.02.05"
}
}
output {
elasticsearch {
host => "10.0.0.11"
protocol => "http"
cluster => "logstash"
node_name => "logindexer"
index => "logstash-2015.02.05_new"
}
}
This means if I have 200 docs under logstash-2015.02.05 index then it creates duplicate records in logstash-2015.02.05_new and keeps going until I stop the logstash agent. Is there a way to just restrict the documents in new index to have exactly the same as the old index? Pls help.
Related
There is a problem with our mappings for elasticsearch 1.7. I am fixing the problem by creating a new index with the correct mappings. I understand that since I am creating a new index I will have to reindex from old index with existing data to the new index I have just created. Problem is I have googled around and can't find a way to reindex from old to new. Seems like the reindex API was introduced in ES 2.3 and not supported for 1.7.
My question is how do I reindex my data from old to new after fixing my mappings. Alternatively, what is the best practice for making mapping changes in ES 1.7?
https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-reindex.html will not work for me because we're on an old version of ES (1.7)
https://www.elastic.co/blog/changing-mapping-with-zero-downtime
Initially went down that path but got stuck, need a way to reindex the old to the new
Late for your use case, but wanted to put it out there for others. This is an excellent step-by-step guide on how to reindex an Elasticsearch index using Logstash version 1.5 while maintaining the integrity of the original data: http://david.pilato.fr/blog/2015/05/20/reindex-elasticsearch-with-logstash/
This is the logstash-simple.conf the author creates:
Input {
# We read from the "old" cluster
elasticsearch {
hosts => [ "localhost" ]
port => "9200"
index => "index"
size => 500
scroll => "5m"
docinfo => true
}
}
filter {
mutate {
remove_field => [ "#timestamp", "#version" ]
}
}
output {
# We write to the "new" cluster
elasticsearch {
host => "localhost"
port => "9200"
protocol => "http"
index => "new_index"
index_type => "%{[#metadata][_type]}"
document_id => "%{[#metadata][_id]}"
}
# We print dots to see it in action
stdout {
codec => "dots"
}
There are a few options for you:
use logstash - it's very easy to create a reindex config in logstash and use that to reindex your documents. for example:
input {
elasticsearch {
hosts => [ "localhost" ]
port => "9200"
index => "index1"
size => 1000
scroll => "5m"
docinfo => true
}
}
output {
elasticsearch {
host => "localhost"
port => "9200"
protocol => "http"
index => "index2"
index_type => "%{[#metadata][_type]}"
document_id => "%{[#metadata][_id]}"
}
}
The problem with this approach that it'll be relatively slow since you'll have only a single machine that peforms the reindexing process.
another option, use this tool. It'll be faster than logstash but you'll have to provide a segmentation logic for all your documents to speed up the processing. For example, if you have a numeric fields whose values range from 1 - 100, then you could segment the queries in the tool for, maybe, 10 intervals (1 - 10, 11 - 20, ... 91 - 100), so the tool will spawn up 10 indexers that will work in parallel reindexing your old index.
Using the elasticsearch output in logstash, how can i update only the #timestamp for a log message if newer?
I don't want to reindex the whole document, nor have the same log message indexed twice.
Also, if the #timestamp is older, it must not update/replace the current version.
Currently, i'm doing this:
filter {
if ("cloned" in [tags]) {
fingerprint {
add_tag => [ "lastlogin" ]
key => "lastlogin"
method => "SHA1"
}
}
}
output {
if ("cloned" in [tags]) {
elasticsearch {
action => "update"
doc_as_upsert => true
document_id => "%{fingerprint}"
index => "lastlogin-%{+YYYY.MM}"
sniffing => true
template_overwrite => true
}
}
}
It is similar to How to deduplicate documents while indexing into elasticsearch from logstash but i do not want to always update the message field; only if the #timestamp field is more recent.
You can't decide from Logstash level if a document needs to be updated or nothing should be done, this needs to be decided at Elasticsearch level. Which means that you need to experiment and test with _update API.
I suggest looking at https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-update.html#upserts. Meaning, if the document exists the script is executed (where you can check, if you want, the #timestamp), otherwise the content of upsert is considered as a new document.
I'm running a logstash instance that reads some records from kafka and inserts them onto elasticsearch.
I had a problem with the elasticsearch configuration and new records weren't being inserted into elasticsearch.
Eventually I was able to fix the elasticsearch output. But even though the elasticsearch output wasn't able to write the records logstash didn't stop reading in more data from kafka.
So when I restarted logstash, it didn't pick up from the last successful kafka offset. Basically I lost all the records since the elasticsearch output stopped writing records.
How can I avoid that from happening again? Is there a way to stop the whole pipeline when there is an error on the output?
My simplified config file:
input {
kafka {
zk_connect => "zk01:2181,zk02:2181,zk03:2181/kafka"
topic_id => "my-topic"
auto_offset_reset => "smallest"
group_id => "logstash-es"
codec => "json"
}
}
output {
elasticsearch {
index => "index-%{+YYYY-MM-dd}"
document_type => "dev"
hosts => ["elasticsearch01","elasticsearch02","elasticsearch03","elasticsearch04","elasticsearch05","elasticsearch06"]
template => "/my-template.json"
template_overwrite => true
manage_template => true
}
}
Now,I meet a question. My logstash configuration file as follows:
input {
redis {
host => "127.0.0.1"
port => 6379
db => 10
data_type => "list"
key => "local_tag_del"
}
}
filter {
}
output {
elasticsearch {
action => "delete"
hosts => ["127.0.0.1:9200"]
codec => "json"
index => "mbd-data"
document_type => "localtag"
document_id => "%{album_id}"
}
file {
path => "/data/elasticsearch/result.json"
}
stdout {}
}
I want to read id from redis, by logstash, notify es to delete document.
Excuse me,My English is poor,I hope that someone will help me .
Thx.
I can't help you particularly, because your problem is spelled out in your error message - logstash couldn't connect to your elasticsearch instance.
That usually means one of:
elasticsearch isn't running
elasticsearch isn't bound to localhost
That's nothing to do with your logstash config. Using logstash to delete documents is a bit unusual though, so I'm not entirely sure this isn't an XY problem
I have an ready-build Apache-Index on one machine, that I would like to clone to another machine using logstash. Fairly easy i thought
input {
elasticsearch {
host => "xxx.xxx.xxx.xxx"
index => "logs"
}
}
filter {
}
output {
elasticsearch {
cluster => "Loa"
host => "127.0.0.1"
protocol => http
index => "logs"
index_type => "apache_access"
}
}
that pulls over the docs, but doesn't stop as it uses the default query "*" (the original index has ~50.000 docs and I killed the former script, when the new index was over 600.000 docs and rising)
Next I tried to make sure the docs would get updated instead of duplicated, but this commit hasn't made it yet, so i don't have a primary..
Then I remembered sincedb but don't seem to be able to use that in the query (or is that possible)
Any advice? Maybe a complete different approach? Thanks a lot!
Assuming that the elasticsearch input creates a logstash event with the document id ( I assume it will be _id or something similar), try setting the elastic search output the following way:
output {
elasticsearch {
cluster => "Loa"
host => "127.0.0.1"
protocol => http
index => "logs"
index_type => "apache_access"
document_id => "%{_id}"
}
}
That way, even if the elasticsearch input, for whatever reason, continues to push the same documents indefinitely, elasticsearch will merely updated the existing documents, instead of creating new documents with new ids.
Once you reach 50,000, you can stop.