ElasticSearch - How to make a 1-to-1 copy of an existing index - elasticsearch

I'm using Elasticsearch 2.3.3 and trying to make an exact copy of an existing index. (using the reindex plugin bundled with Elasticsearch installation)
The problem is that the data is copied but settings such as the mapping and the analyzer are left out.
What is the best way to make an exact copy of an existing index, including all of its settings?
My main goal is to create a copy, change the copy and only if all went well switch an alias to the copy. (Zero downtime backup and restore)

In my opinion, the best way to achieve this would be to leverage index templates. Index templates allow you to store a specification of your index, including settings (hence analyzers) and mappings. Then whenever you create a new index which matches your template, ES will create the index for you using the settings and mappings present in the template.
So, first create an index template called index_template with the template pattern myindex-*:
PUT /_template/index_template
{
"template": "myindex-*",
"settings": {
... your settings ...
},
"mappings": {
"type1": {
"properties": {
... your mapping ...
}
}
}
}
What will happen next is that whenever you want to index a new document in any index whose name matches myindex-*, ES will use this template (+settings and mappings) to create the new index.
So say your current index is called myindex-1 and you want to reindex it into a new index called myindex-2. You'd send a reindex query like this one
POST /_reindex
{
"source": {
"index": "myindex-1"
},
"dest": {
"index": "myindex-2"
}
}
myindex-2 doesn't exist yet, but it will be created in the process using the settings and mappings of index_template because the name myindex-2 matches the myindex-* pattern.
Simple as that.

The following seems to achieve exactly what I wanted:
Using Snapshot And Restore I was able to restore to a different index:
POST /_snapshot/index_backup/snapshot_1/_restore
{
"indices": "original_index",
"ignore_unavailable": true,
"include_global_state": false,
"rename_pattern": "original_index",
"rename_replacement": "replica_index"
}
As far as I can currently tell, it has accomplished exactly what I needed.
A 1-to-1 copy of my original index.
I also suspect this operation has better performance than re-indexing for my purposes.

I'm facing the same issue when using the reindex API.
Basically I'm merging daily, weekly, monthly indices to reduce shards.
We have a lot of indices with different data inputs, and maintaining a template for all cases is not an option. Thus we use dynamic mapping.
Due to dynamic mapping the reindex process can produce conflicts if your data is complicated, say json stored in a string field, and the reindexed field can end up as something else.
Sollution:
Copy the mapping of your source index
Create a new index, applying the mapping
Disable dynamic mapping
Start the reindex process.
A script can be created, and should of course have error checking in place.
Abbreviated scripts below.
Create a new empty index with the mapping from an original index.:
#!/bin/bash
SRC=$1
DST=$2
# Create a temporary file for holding the SRC mapping
TMPF=$(mktemp)
# Extract the SRC mapping, use `jq` to get the first record
# write to TMPF
curl -f -s "${URL:?}/${SRC}/_mapping | jq -M -c 'first(.[])' > ${TMPF:?}
# Create the new index
curl -s -H 'Content-Type: application/json' -XPUT ${URL:?}/${DST} -d #${TMPF:?}
# Disable dynamic mapping
curl -s -H 'Content-Type: application/json' -XPUT \
${URL:?}/${DST}/_mapping -d '{ "dynamic": false }'
Start reindexing
curl -s -XPOST "${URL:?}" -H 'Content-Type: application/json' -d'
{
"conflicts": "proceed",
"source": {
"index": "'${SRC}'"
},
"dest": {
"index": "'${DST}'",
"op_type": "create"
}
}'

Related

ElasticSearch - How to merge indexes into one index?

My cluster has an index for each day since a few months ago,
5 shards each index (the default),
and I can't run queries on the whole cluster because there are too many shards (over 1000).
The document IDs are automatically generated.
How can I combine the indexes into one index, deal with conflicting ids (if conflicts are even possible), and change the types?
I am using ES version 5.2.1
Common problem that is visible only after few months of using ELK stack with filebeat creating indices day by day. There is a few options to fix the performance issue here.
_forcemerge
First you can use _forcemerge to limit the numer of segments inside Lucene index. Operation won't limit or merge indices but will improve the performance of Elasticsearch.
curl -XPOST 'localhost:9200/logstash-2017.07*/_forcemerge?max_num_segments=1'
This will run through the whole month indices and force merge segments. When done for every month, it should improve the Elasticsearch performance a lot. In my case CPU usage went down from 100% to 2.7%.
Unfortunately this won't solve the shards problem.
_reindex
Please read the _reindex documentation and backup your database before continue.
As tomas mentioned. If you want to limit number of shards or indices there is no other option than use _reindex to merge few indices into one. This can take a while depending on the number and size of indices you have.
Destination index
You can create the destination index beforehand and specify number of shards it should contain. This will ensure your final index will have the number of shards you need.
curl -XPUT 'localhost:9200/new-logstash-2017.07.01?pretty' -H 'Content-Type: application/json' -d'
{
"settings" : {
"index" : {
"number_of_shards" : 1
}
}
}
'
Limiting number of shards
If you want to limit number of shards per index you can run _reindex one to one. In this case there should be no entries dropped as it will be exact copy but with smaller number of shards.
curl -XPOST 'localhost:9200/_reindex?pretty' -H 'Content-Type: application/json' -d'
{
"conflicts": "proceed",
"source": {
"index": "logstash-2017.07.01"
},
"dest": {
"index": "logstash-v2-2017.07.01",
"op_type": "create"
}
}
'
After this operation you can remove old index and use new one. Unfortunately if you want to use old name you need to _reindex one more time with new name. If you decide to do that
DON'T FORGET TO SPECIFY NUMBER OF SHARDS FOR THE NEW INDEX! By default it will fall back to 5.
Merging multiple indices and limiting number of shards
curl -XPOST 'localhost:9200/_reindex?pretty' -H 'Content-Type: application/json' -d'
{
"conflicts": "proceed",
"source": {
"index": "logstash-2017.07*"
},
"dest": {
"index": "logstash-2017.07",
"op_type": "create"
}
}
'
When done you should have all entries from logstash-2017.07.01 to logstash-2017.07.31 merged into logstash-2017.07. Note that the old indices must be deleted manually.
Some of the entries can be overwritten or merged, depending which conflicts and op_type option you choose.
Further steps
Create new indices with one shard
You can set up index template that will be used every time new logstash index is created.
curl -XPUT 'localhost:9200/_template/template_logstash?pretty' -H 'Content-Type: application/json' -d'
{
"template" : "logstash-*",
"settings" : {
"number_of_shards" : 1
}
}
'
This will ensure every new index created that match logstash- in name to have only one shard.
Group logs by month
If you don't stream too many logs you can set up your logstash to group logs by month.
// file: /etc/logstash/conf.d/30-output.conf
output {
elasticsearch {
hosts => ["localhost"]
manage_template => false
index => "%{[#metadata][beat]}-%{+YYYY.MM}"
document_type => "%{[#metadata][type]}"
}
}
Final thoughts
It's not easy to fix initial misconfiguration! Good luck with optimising your Elastic search!
You can use the reindex api.
POST _reindex
{
"conflicts": "proceed",
"source": {
"index": ["twitter", "blog"],
"type": ["tweet", "post"]
},
"dest": {
"index": "all_together"
}
}

Rebuild index with zero downtime

Currently working on something and needed some help. I will have an elastic index populated from a sql database. There will be an initial full reindex from the sql database then there will be nightly job which will update / delete / insert updates.
In the event of a major failure I may need to do full reindex. Ideally i want zero downtime. I did find some articles about creating aliases etc however this sees to be more updates to field mappings. My situation is a full reindex of the data from my source db. Can i just get that data push the docs to elastic and elastic will just update the existing index as ids will be same? Or do i need to do something else?
Regards
Ismail
For zero downtime you can create a new index, populate it from your database, and use the alias to switch from the old index to the new one. Steps:
Call your main index something like main_index_1 (or whatever you like)
Create an alias for that index called main_index
curl -XPUT 'localhost:9200/main_index_1/_alias/main_index?pretty
Set up your application to point to this alias
Create a new index called main_index_2 and index it from your database
Switch the alias to point to the new index
curl -XPOST 'localhost:9200/_aliases?pretty' -H 'Content-Type: application/json' -d
{
"actions": [
{ "remove": { "index": "main_index_1", "alias": "main_index" }},
{ "add": { "index": "main_index_2", "alias": "main_index" }}
]
}

Rolling indices in Elasticsearch

I see a lot of topics on how to create rolling indices in Elasticsearch using logstash.
But is there a way to achieve the same i.e create indices on daily basis in elasticsearch without logstash?
I came a cross a post which says to run cron job to create the indices as date rolls, but that is a manual job I have to do, I was looking for out of the box options if available in elasticsearch
Yes, use index templates (which is what Logstash uses internally to achieve the creation of rolling indices)
Simply create a template with a name pattern like this and then everytime you index a document in an index whose name matches that pattern, ES will create the index for you:
curl -XPUT localhost:9200/_template/my_template -d '{
"template" : "logstash-*",
"settings" : {
"number_of_shards" : 1
},
"mappings" : {
"my_type" : {
"properties": {
...
}
}
}
}'

Updating the default index number_of_replicas setting for new indices

I've tried updating the number of replicas as follows, according to the documentation
curl -XPUT 'localhost:9200/_settings' -d '
{ "index" : { "number_of_replicas" : 4 } }'
This correctly changes the replica count for existing nodes. However, when logstash creates a new index the following day, number_of_replicas is set to the old value.
Is there a way to permanently change the default value for this setting without updating all the elasticsearch.yml files in the cluster and restarting the services?
FWIW I've also tried
curl -XPUT 'localhost:9200/logstash-*/_settings' -d '
{ "index" : { "number_of_replicas" : 4 } }'
to no avail.
Yes, you can use index templates. Index templates are a great way to set default settings (including mappings) for new indices created in a cluster.
Index Templates
Index templates allow to define templates that will automatically be
applied to new indices created. The templates include both settings
and mappings, and a simple pattern template that controls if the
template will be applied to the index created.
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/indices-templates.html
For your example:
curl -XPUT 'localhost:9200/_template/logstash_template' -d '
{
"template" : "logstash-*",
"settings" : {"number_of_replicas" : 4 }
} '
This will set the default number of replicas to 4 for all new indexes that match the name "logstash-*". Note that this will not change existing indexes, only newly created ones.

Is there a way to "escape" ElasticSearch stop words?

I am fairly new to ElasticSearch and have a question on stop words. I have an index that contains state names for the USA....ex: New York/NY, California/CA,Oregon/OR. I believe Oregon's abbreviation, 'OR' is a stop word, so when I insert the state data into the index, I cannot search on 'OR'. Is there a way I can set up custom stopwords for this or am I doing something wrong?
Here is how I am building the index:
curl -XPUT http://localhost:9200/test/state/1 -d '{"stateName": ["California","CA"]}'
curl -XPUT http://localhost:9200/test/state/2 -d '{"stateName": ["New York","NY"]}'
curl -XPUT http://localhost:9200/test/state/3 -d '{"stateName": ["Oregon","OR"]}'
A search for 'NY', works fine. Ex:
curl -XGET 'http://localhost:9200/test/state/_search?pretty=1' -d '
{
"query" : {
"match" : {
"stateName" : "NY"
}
}
}'
But a search for 'OR', returns zero hits:
curl -XGET 'http://localhost:9200/test/state/_search?pretty=1' -d '
{
"query" : {
"match" : {
"stateName" : "OR"
}
}
}'
I believe this search returns no results because OR is stop word, but I don't know how to work around this. Thanks for you help.
You can (and definitely should) control the way you index data by modifying your mapping according to your data and the way you want to search against it.
In your case I would disable stopwords for that specific field rather than modifying the stopword list, but you could do the latter too if you wish to. The point is that you're using the default mapping which is great to start with, but as you can see you need to tweak it depending on your needs.
For each field, you can specify what analyzer to use. An analyzer defines the way you split your text into tokens (tokenizer) that will be indexed and also additional changes you can make to each token (even remove or add new ones) using token filters.
You can specify your mapping either while creating your index or update it afterwards using the put mapping api (as long as the changes you make are backwards compatible).

Resources