Reindexing in Elasticsearch 1.7 - elasticsearch

There is a problem with our mappings for elasticsearch 1.7. I am fixing the problem by creating a new index with the correct mappings. I understand that since I am creating a new index I will have to reindex from old index with existing data to the new index I have just created. Problem is I have googled around and can't find a way to reindex from old to new. Seems like the reindex API was introduced in ES 2.3 and not supported for 1.7.
My question is how do I reindex my data from old to new after fixing my mappings. Alternatively, what is the best practice for making mapping changes in ES 1.7?
https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-reindex.html will not work for me because we're on an old version of ES (1.7)
https://www.elastic.co/blog/changing-mapping-with-zero-downtime
Initially went down that path but got stuck, need a way to reindex the old to the new

Late for your use case, but wanted to put it out there for others. This is an excellent step-by-step guide on how to reindex an Elasticsearch index using Logstash version 1.5 while maintaining the integrity of the original data: http://david.pilato.fr/blog/2015/05/20/reindex-elasticsearch-with-logstash/
This is the logstash-simple.conf the author creates:
Input {
# We read from the "old" cluster
elasticsearch {
hosts => [ "localhost" ]
port => "9200"
index => "index"
size => 500
scroll => "5m"
docinfo => true
}
}
filter {
mutate {
remove_field => [ "#timestamp", "#version" ]
}
}
output {
# We write to the "new" cluster
elasticsearch {
host => "localhost"
port => "9200"
protocol => "http"
index => "new_index"
index_type => "%{[#metadata][_type]}"
document_id => "%{[#metadata][_id]}"
}
# We print dots to see it in action
stdout {
codec => "dots"
}

There are a few options for you:
use logstash - it's very easy to create a reindex config in logstash and use that to reindex your documents. for example:
input {
elasticsearch {
hosts => [ "localhost" ]
port => "9200"
index => "index1"
size => 1000
scroll => "5m"
docinfo => true
}
}
output {
elasticsearch {
host => "localhost"
port => "9200"
protocol => "http"
index => "index2"
index_type => "%{[#metadata][_type]}"
document_id => "%{[#metadata][_id]}"
}
}
The problem with this approach that it'll be relatively slow since you'll have only a single machine that peforms the reindexing process.
another option, use this tool. It'll be faster than logstash but you'll have to provide a segmentation logic for all your documents to speed up the processing. For example, if you have a numeric fields whose values range from 1 - 100, then you could segment the queries in the tool for, maybe, 10 intervals (1 - 10, 11 - 20, ... 91 - 100), so the tool will spawn up 10 indexers that will work in parallel reindexing your old index.

Related

Logstash: Missing data after migration

I have been migrating one of the indexes in self-hosted Elasticsearch to amazon-elasticsearch using Logstash. we have around 1812 documents in our self-hosted Elasticsearch but in amazon-elasticsearch, we have only about 637 documents. Half of the documents are missing after migration.
Our logstash config file
input {
elasticsearch {
hosts => ["https://staing-example.com:443"]
user => "userName"
password => "password"
index => "testingindex"
size => 100
scroll => "1m"
}
}
filter {
}
output {
amazon_es {
hosts => ["https://example.us-east-1.es.amazonaws.com:443"]
region => "us-east-1"
aws_access_key_id => "access_key_id"
aws_secret_access_key => "access_key_id"
index => "testingindex"
}
stdout{
codec => rubydebug
}
}
We have tried for some of the other indexes as well but it still migrating only half of the documents.
Make sure to compare apples to apples by running GET index/_count on your index on both sides.
You might see more or less documents depending on where you look (Elasticsearch HEAD plugin, Kibana, Cerebro, etc) and if replicas are taken into account in the count or not.
In your case you had more replicas in your local environment than in your AWS Elasticsearch service, hence the different count.

ElasticSearch assign own IDs while indexing with LogStash

I am indexing a large corpora of information and I have a string-key that I know is unique. I would like to avoid using the search and rather access documents by this artificial identifier.
Since the Path directive is discontinued in ES 1.5, anyone know a workaround to this problem!?
My data look like:
{unique-string},val1, val2, val3...
{unique-string2},val4, val5, val6...
I am using logstash to index the files and would prefer to fetch the documents through a direct get, rather than through an exact-match.
In your elasticsearch output plugin, just specify the document_id setting with a reference to the field you want to use as id, i.e. the one named 1 in your csv filter.
input {
file {...}
}
filter {
csv{
columns=>["1","2","3"]
separator => ","
}
}
output {
elasticsearch {
action => "index"
host => "localhost"
port => "9200"
index => "index-name"
document_id => "%{1}" <--- add this line
workers => 2
cluster => "elasticsearch-cluster"
protocol => "http"
}
}

How to efficiently move data from elasticsearch index (with one shard) to another index (with 5 shards)?

I have an elasticsearch index which contains around 5 GB of data on a single node in a single shard. Now I have created another index with same settings as older one, but with number_of_shards as 5 instead of 1.
I am looking for the most efficient approach to copy the data from older index to newer index without any downtime.
I would suggest using Logstash for this. You could use the following configuration. Make sure to replace the source and target hosts, as well as the index and type names to match your local environment.
File: reindex.conf
input {
elasticsearch {
hosts => "localhost:9200" <---- your source host
index => "my_source_index"
}
}
filter {
mutate {
remove_field => [ "#version", "#timestamp" ]
}
}
output {
elasticsearch {
host => "localhost" <--- your target host
port => 9200
protocol => "http"
manage_template => false
index => "my_target_index"
document_type => "my_type"
workers => 5
}
}
And then you can simply launch it with
bin/logstash -f reindex.conf

Logstash reindex elasticsearch Issue

I'm trying to reindex the data by having elasticsearch as input and sending back to elasticsearch as output. The script is running fine but the indexing is going indefinitely. The script is as below
input {
elasticsearch {
host => "10.0.0.11"
index => "logstash-2015.02.05"
}
}
output {
elasticsearch {
host => "10.0.0.11"
protocol => "http"
cluster => "logstash"
node_name => "logindexer"
index => "logstash-2015.02.05_new"
}
}
This means if I have 200 docs under logstash-2015.02.05 index then it creates duplicate records in logstash-2015.02.05_new and keeps going until I stop the logstash agent. Is there a way to just restrict the documents in new index to have exactly the same as the old index? Pls help.

Copy ElasticSearch-Index with Logstash

I have an ready-build Apache-Index on one machine, that I would like to clone to another machine using logstash. Fairly easy i thought
input {
elasticsearch {
host => "xxx.xxx.xxx.xxx"
index => "logs"
}
}
filter {
}
output {
elasticsearch {
cluster => "Loa"
host => "127.0.0.1"
protocol => http
index => "logs"
index_type => "apache_access"
}
}
that pulls over the docs, but doesn't stop as it uses the default query "*" (the original index has ~50.000 docs and I killed the former script, when the new index was over 600.000 docs and rising)
Next I tried to make sure the docs would get updated instead of duplicated, but this commit hasn't made it yet, so i don't have a primary..
Then I remembered sincedb but don't seem to be able to use that in the query (or is that possible)
Any advice? Maybe a complete different approach? Thanks a lot!
Assuming that the elasticsearch input creates a logstash event with the document id ( I assume it will be _id or something similar), try setting the elastic search output the following way:
output {
elasticsearch {
cluster => "Loa"
host => "127.0.0.1"
protocol => http
index => "logs"
index_type => "apache_access"
document_id => "%{_id}"
}
}
That way, even if the elasticsearch input, for whatever reason, continues to push the same documents indefinitely, elasticsearch will merely updated the existing documents, instead of creating new documents with new ids.
Once you reach 50,000, you can stop.

Resources