Moving data from oine Elasticsearch index to another with higher number of shards or increasing shard number in existing index - elasticsearch

I am new to Elasticsearch and I have been reading documentation in order to find a way of increasing amount of shards that my index consists of. Currently my index looks like this:
country_data 0 p STARTED 227 100.7kb 192.168.0.115 $HOSTNAME
country_data 0 r STARTED 227 100.7kb 192.168.0.116 $HOSTNAME
I wanted to increase the number of shard to 5 however I was unable to find a proper way of doing it. I learnt from another Stackoverflow question that I should be able to do it like this:
POST _reindex?slices=5
{
"source": {
"index": "country_data"
},
"dest": {
"index": "country_data_new"
}
}
However when I did that I got a copy of my country_data with same amount of shards and replicas (1 and 1). I tried to learn more about it in documentation but all I found is this: https://www.elastic.co/guide/en/elasticsearch/client/curator/current/option_slices.html
I couldn't find anything in documentation about increasing number of shards in existing index or how can I move data to new index which would have more shards. I would be grateful for any insights into this problem or at least a website where could I learn how to do it.

This can be done in any of the below mentioned way.
1st Option : You can use the elastic search Split Index API.
I suggest you to please go through the documentation once before proceeding with this method.
2nd Option : Create a new index with same mappings and give the required settings for new shards. Then use the reindex API to copy data from source index to destination index
To create the new Index:
PUT /<NEW_INDEX_NAME>
{
"settings": {
"number_of_shards": <REQUIRED_NUMBER_OF_SHARDS>
},
"mappings": {<MAPPINGS_OF_SOURCE_INDEX>}
}
}
If you don't give the number of shards in the settings while creating an index, by default it creates index with one primary and one replica shard.
To Reindex from source to newly created index:
POST _reindex
{
"source": {
"index": "<SOURCE_INDEX_NAME>"
},
"dest": {
"index": "<NEW_INDEX_NAME>"
}
}

Related

Reindexing more than 10k documents in Elasticsearch

Let's say I have an index- A. It contains 26k documents. Now I want to change a field status with type as Keyword. As I can't change A's status field type which is already existing, I will create a new index: B with my setting my desired type.
I followed reindex API:
POST _reindex
{
"source": {
"index": "A",
"size": 10000
},
"dest": {
"index": "B",
"version_type": "external"
}
}.
But the problem is, here I can migrate only 10k docs. How to copy the rest?
How can I copy all the docs without losing any?
delete the size: 10000 and problem will be solved.
by the way the size field in Reindex API means that what batch size elasticsearch should use to fetch and reindex docs every time. by default the batch size is 100. (you thought it means how many document you want to reindex)

Index policy or Index template for Elasticsearh

I have elasticsearch cluster for storing logs, and i have indices like this
logs-2021.01.01
logs-2021.01.02
logs.2021.01.03 ...etc
so indices creates at daily basis, and i have index template for this indices
PUT _index_template/template_1
{
"index_patterns": ["logs*"],
"template": {
"settings": {
"number_of_shards": 6,
"number_of_replicas": 1
}
but I want to make sure that indexes that are older than 1 day have 0 replicas to save disk space, and indexes that are younger than 1 day remain with 1 replica (so that in case of server loss, I have data for today)
how can i do this using elasticsearch way? i think about bash script that executes by cron , which get all of the indices which older than 1 day and make 0 replica, but i don't want to use external scripts to do that
Thank you for you help
You can use ILM (Index life cycle management) concept of the Elasticsearch.
I this, you can create policy with different state and perform some action in each state.
You can give the condition, when the index gets migrated to next state. you can give your condition base on your scenario.
PUT _ilm/policy/my_policy
{
"policy": {
"phases": {
"warm": {
"actions": {
"allocate" : {
"number_of_replicas" : 0
}
}
}
}
}
}
https://www.elastic.co/guide/en/elasticsearch/reference/current/getting-started-index-lifecycle-management.html
https://www.elastic.co/guide/en/elasticsearch/reference/current/ilm-allocate.html
This is not the full proof policy but you can use this concept for your scenario.

AWS elasticsearch disable replication of all indices

I am using a single node AWS ES cluster. Currently, its health status is showing yellow which is obvious because there is no other node to which Amazon ES can assign a replica. I want to set the replication of all my current and upcoming indices to 0. I have indices created in this pattern:
app-one-2021.02.10
app-two-2021.01.11
so on...
These indices are currently having number_of_replicas set to 1. To disable replication for all indices I am throwing a PUT request in index pattern:
PUT /app-one-*/_settings
{
"index" : {
"number_of_replicas":0
}
}
Since I am using a wildcard here so it should set number_of_replicas to 0 in all the matching indices, which it is doing successfuly.
But if any new index is created in the future let's say app-one-2021.03.10. Then the number_of_replicas is again set to 1 in this index.
Every time I have to run a PUT request to set number_of_replicas to 0 which is tedious. Why new indices are not automatically taking number_of_replicas to 0 even if I am using a wildcard (*) in my PUT request.
Is there any way to completely set replication (number_of_replicas to 0) to 0, and doesn't matter if it's a new index or an old index. How can I achieve this?
Yes, the way is to define index templates.
Before Elasticsearch v7.8, you could only use the _template API (see docs). E.g., in your case, you can create a template matching all the app-* indices:
PUT _template/app_settings
{
"index_patterns": ["app-*"],
"settings": {
"number_of_replicas": 0
}
}
Since Elasticsearch v7.8, the old API is still supported but deprecated, and you can use the _index_template API instead (see docs).
PUT _index_template/app_settings
{
"index_patterns": ["app-*"],
"template": {
"settings": {
"number_of_replicas": 0
}
}
}
Update: add code snippets for both _template and _index_template API.

ElasticSearch - How to merge indexes into one index?

My cluster has an index for each day since a few months ago,
5 shards each index (the default),
and I can't run queries on the whole cluster because there are too many shards (over 1000).
The document IDs are automatically generated.
How can I combine the indexes into one index, deal with conflicting ids (if conflicts are even possible), and change the types?
I am using ES version 5.2.1
Common problem that is visible only after few months of using ELK stack with filebeat creating indices day by day. There is a few options to fix the performance issue here.
_forcemerge
First you can use _forcemerge to limit the numer of segments inside Lucene index. Operation won't limit or merge indices but will improve the performance of Elasticsearch.
curl -XPOST 'localhost:9200/logstash-2017.07*/_forcemerge?max_num_segments=1'
This will run through the whole month indices and force merge segments. When done for every month, it should improve the Elasticsearch performance a lot. In my case CPU usage went down from 100% to 2.7%.
Unfortunately this won't solve the shards problem.
_reindex
Please read the _reindex documentation and backup your database before continue.
As tomas mentioned. If you want to limit number of shards or indices there is no other option than use _reindex to merge few indices into one. This can take a while depending on the number and size of indices you have.
Destination index
You can create the destination index beforehand and specify number of shards it should contain. This will ensure your final index will have the number of shards you need.
curl -XPUT 'localhost:9200/new-logstash-2017.07.01?pretty' -H 'Content-Type: application/json' -d'
{
"settings" : {
"index" : {
"number_of_shards" : 1
}
}
}
'
Limiting number of shards
If you want to limit number of shards per index you can run _reindex one to one. In this case there should be no entries dropped as it will be exact copy but with smaller number of shards.
curl -XPOST 'localhost:9200/_reindex?pretty' -H 'Content-Type: application/json' -d'
{
"conflicts": "proceed",
"source": {
"index": "logstash-2017.07.01"
},
"dest": {
"index": "logstash-v2-2017.07.01",
"op_type": "create"
}
}
'
After this operation you can remove old index and use new one. Unfortunately if you want to use old name you need to _reindex one more time with new name. If you decide to do that
DON'T FORGET TO SPECIFY NUMBER OF SHARDS FOR THE NEW INDEX! By default it will fall back to 5.
Merging multiple indices and limiting number of shards
curl -XPOST 'localhost:9200/_reindex?pretty' -H 'Content-Type: application/json' -d'
{
"conflicts": "proceed",
"source": {
"index": "logstash-2017.07*"
},
"dest": {
"index": "logstash-2017.07",
"op_type": "create"
}
}
'
When done you should have all entries from logstash-2017.07.01 to logstash-2017.07.31 merged into logstash-2017.07. Note that the old indices must be deleted manually.
Some of the entries can be overwritten or merged, depending which conflicts and op_type option you choose.
Further steps
Create new indices with one shard
You can set up index template that will be used every time new logstash index is created.
curl -XPUT 'localhost:9200/_template/template_logstash?pretty' -H 'Content-Type: application/json' -d'
{
"template" : "logstash-*",
"settings" : {
"number_of_shards" : 1
}
}
'
This will ensure every new index created that match logstash- in name to have only one shard.
Group logs by month
If you don't stream too many logs you can set up your logstash to group logs by month.
// file: /etc/logstash/conf.d/30-output.conf
output {
elasticsearch {
hosts => ["localhost"]
manage_template => false
index => "%{[#metadata][beat]}-%{+YYYY.MM}"
document_type => "%{[#metadata][type]}"
}
}
Final thoughts
It's not easy to fix initial misconfiguration! Good luck with optimising your Elastic search!
You can use the reindex api.
POST _reindex
{
"conflicts": "proceed",
"source": {
"index": ["twitter", "blog"],
"type": ["tweet", "post"]
},
"dest": {
"index": "all_together"
}
}

ElasticSearch + Kibana - Unique count using pre-computed hashes

update: Added
I want to perform unique count on my ElasticSearch cluster.
The cluster contains about 50 millions of records.
I've tried the following methods:
First method
Mentioned in this section:
Pre-computing hashes is usually only useful on very large and/or high-cardinality fields as it saves CPU and memory.
Second method
Mentioned in this section:
Unless you configure Elasticsearch to use doc_values as the field data format, the use of aggregations and facets is very demanding on heap space.
My property mapping
"my_prop": {
"index": "not_analyzed",
"fielddata": {
"format": "doc_values"
},
"doc_values": true,
"type": "string",
"fields": {
"hash": {
"type": "murmur3"
}
}
}
The problem
When I use unique count on my_prop.hash in Kibana I receive the following error:
Data too large, data for [my_prop.hash] would be larger than limit
ElasticSearch has 2g heap size.
The above also fails for a single index with 4 millions of records.
My questions
Am I missing something in my configurations?
Should I increase my machine? This does not seem to be the scalable solution.
ElasticSearch query
Was generated by Kibana:
http://pastebin.com/hf1yNLhE
ElasticSearch Stack trace
http://pastebin.com/BFTYUsVg
That error says you don't have enough memory (more specifically, memory for fielddata) to store all the values from hash, so you need to take them out from the heap and put them on disk, meaning using doc_values.
Since you are already using doc_values for my_prop I suggest doing the same for my_prop.hash (and, no, the settings from the main field are not inherited by the sub-fields): "hash": { "type": "murmur3", "index" : "no", "doc_values" : true }.

Resources