Migrate from standalone to cluster instances of elasticsearch - elasticsearch

I have one node cluster of Elasticsearch
With installed Elasticsearch, kibana, apm-server
I have about 5 TB of indices
my elk version is
{
"name" : "elk-old",
"cluster_name" : "elasticsearch",
"cluster_uuid" : "qLR6jhtgS627KCq7Ls-dxQ",
"version" : {
"number" : "7.5.0",
"build_flavor" : "default",
"build_type" : "rpm",
"build_hash" : "e9ccaed468e2fac2275a3761849cbee64b39519f",
"build_date" : "2019-11-26T01:06:52.518245Z",
"build_snapshot" : false,
"lucene_version" : "8.3.0",
"minimum_wire_compatibility_version" : "6.8.0",
"minimum_index_compatibility_version" : "6.0.0-beta1"
},
"tagline" : "You Know, for Search"
}
Now I want to install 3 additional nodes with hardware characteristics different to the old-elk instance:
old-elk: 10 cpu, 64 ram, 10 tb disk
new-nodes: 10 cpu, 64 ram, 5 tb disk for each one
My goal is to add three additional nodes to the cluster and store all new indices on them
old indices should be available for reading and store on old node
old indices will be automatically deleted after 1 month
and after 1 month I have to exclude old node from the cluster and shutdown it
How can I achieve this right way?

first up, upgrade. 7.5 is EOL and 7.15 is latest
then implement ILM, which will handle allocation of indices, as well as rollover and retention

Related

Elasticsearch random slow queries troubleshooting

I moved my data from a Elasticsearch cluster to another with powerful hardware
(4 nodes / 2CPU 8GB RAM each one with 4GB on JVM per machine // old cluster had 3 nodes with 1 cpu each one and 2GB on JVM per machine)
but I am randomly experiencing some very slow query responses that I didn't have on the old cluster.
Same ES version on both nodes 6.8.14 and same document numbers / shards
(62GB data / 106 million documents).
Launching the queries from kibana using the profiler on the new cluster
is showing that most of the time is spent in this phase :
"collector" : [
{
"name" : "CancellableCollector",
"reason" : "search_cancelled",
"time_in_nanos" : 5777066437,
"children" : [
{
"name" : "MultiCollector",
"reason" : "search_multi",
"time_in_nanos" : 5756203807,
"children" : [
{
"name" : "SimpleTopScoreDocCollector",
"reason" : "search_top_hits",
"time_in_nanos" : 747674917
},
{
"name" : "MultiBucketCollector: [[min_price, max_price, in_stock, out_of_stock, category, agg_h, mdata, agg_att, agg_f]]",
"reason" : "aggregation",
"time_in_nanos" : 4966026553
}
]
}
]
}
]
CPU usage per node is pretty low (5/10%) and load average is OK (0.50)
Launching the same query again in short time lowers the response from 8 sec to 0.4 sec (due to ES caching I guess), also removing the "aggs" part from the query seems to fix the issue so poor performance is actually during the aggregators phase.
Anyway I do not understand why this slowdown occurs only on the new "better" cluster... how can I optimize the performance or better troubleshoot ?

Elasticsearch timout doesn't work when do searching

Elasticsearch version (bin/elasticsearch --version):5.2.2
JVM version (java -version): 1.8.0_121
OS version (uname -a if on a Unix-like system): opensuse
Do search with " curl -XGET 'localhost:9200/_search?pretty&timeout=1ms' "
The part of response is :
{
"took" : 5,
"timed_out" : false,
"_shards" : {
"total" : 208,
"successful" : 208,
"failed" : 0
},
"hits" : {
"total" : 104429,
"max_score" : 1.0,
"hits" :
...
The took time is 5ms, and timeout setting is 1ms. Why "timed_out" is false rather than true.
Thanks
The timeout is per searched shard (looks like 208 in your case), while the took is for the entire query. On a per shard level you are within the limit. The documentation has some additional information on when you will hit timed_out and more caveats.
Try with a more expensive query (leading wildcard, fuzziness,...) — I guess then you should hit the (shard) limit.

Elasticsearch: Inconsistent number of shards in stats & cluster APIs

I uploaded the data to my single node cluster and named the index as 'gequest'.
When I GET from http://localhost:9200/_cluster/stats?human&pretty, I get:
"cluster_name" : "elasticsearch",
"status" : "yellow",
"indices" : {
"count" : 1,
"shards" : {
"total" : 5,
"primaries" : 5,
"replication" : 0.0,
"index" : {
"shards" : {
"min" : 5,
"max" : 5,
"avg" : 5.0
},
"primaries" : {
"min" : 5,
"max" : 5,
"avg" : 5.0
},
"replication" : {
"min" : 0.0,
"max" : 0.0,
"avg" : 0.0
}
}
}
When I do GET on http://localhost:9200/_stats?pretty=true
"_shards" : {
"total" : 10,
"successful" : 5,
"failed" : 0
}
How come total number of shards not consistent in two reports? Why total shards are 10 from stats API. How to track the other 5?
From the results it is likely that you have a single elasticsearch node running and created a index with default values(which creates 5 shards and one replica). Since there is only one node running elasticsearch is unable to assign the replica shards anywhere(elasticsearch will never assign the primary and replica of the same shard in a single node).
The _cluster/stats API gives information about the cluster including the current state. From your result it is seen that the cluster state is "yellow" indicating that all the primary shards are allocated but not all replicas have been allocated/initialized. So it is showing only the allocated shards as 5.
The _stats API gives information about your indices in the cluster. It will give information about how many shards the index will have and how many replicas. Since your index needs a total of 10 shards (5 primary and 5 replica as specified when you create the index) the stats contain information as total 10, successful 5 and failed 5(failed because unable to allocate in any node).
Use http://localhost:9200/_cat/shards to see the overall shard status

How to remove nodes from a ES cluster

Wowee...how does one remve nodes from ES?
I had 4 nodes, wanted to remove three.
On the node I wanted to keep I ran the below:
curl -XPUT localhost:9200/_cluster/settings -d '{"transient" : {"cluster.routing.allocation.exclude._ip" : "172.31.6.204"}}';echo
curl -XPUT localhost:9200/_cluster/settings -d '{"transient" : {"cluster.routing.allocation.exclude._ip" : "172.31.6.205"}}';echo
curl -XPUT localhost:9200/_cluster/settings -d '{"transient" : {"cluster.routing.allocation.exclude._ip" : "172.31.6.206"}}';echo
Now my cluster health looks like this:
curl localhost:9200/_cluster/health?pretty=true
{
"cluster_name" : "elasticsearch",
"status" : "red",
"timed_out" : false,
"number_of_nodes" : 1,
"number_of_data_nodes" : 1,
"active_primary_shards" : 2,
"active_shards" : 2,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 8
How do I fix? What is the proper method for removing nodes?
Thank god this was in dev...
Thanks
Your decommission was good with cluster.routing.allocation.exclude._ip !
The problem you got is that you don't give time for shards to move, each time, to the remained nodes.
You can replay the same commands, one by one, but with paying attention to the migrating shards with some monitoring plug-ins as ElasticHQ or Kopf

Elasticsearch jdbc river eats up entire memory

I am trying to index 16 million docs(47gb) from a mysql table into elasticsearch index. I am using jparante's elasticsearch jdbc river to do this. But, after creating the river and waiting for about 15 mins, the entire heap memory gets consumed without any sign of the river running or docs getting indexed. The river used to run fine when I had around 10-12 million records to index. I have tried running the river 3-4 times, but in vain.
Heap Memory pre allocated to the ES process = 10g
elasticsearch.yml
cluster.name: test_cluster
index.cache.field.type: soft
index.cache.field.max_size: 50000
index.cache.field.expire: 2h
cloud.aws.access_key: BBNYJC25Dij8JO7YM23I(fake)
cloud.aws.secret_key: GqE6y009ZnkO/+D1KKzd6M5Mrl9/tIN2zc/acEzY(fake)
cloud.aws.region: us-west-1
discovery.type: ec2
discovery.ec2.groups: sg-s3s3c2fc(fake)
discovery.ec2.any_group: false
discovery.zen.ping.timeout: 3m
gateway.recover_after_nodes: 1
gateway.recover_after_time: 1m
bootstrap.mlockall: true
network.host: 10.111.222.33(fake)
river.sh
curl -XPUT 'http://--address--:9200/_river/myriver/_meta' -d '{
"type" : "jdbc",
"jdbc" : {
"driver" : "com.mysql.jdbc.Driver",
"url" : "jdbc:mysql://--address--:3306/mydatabase",
"user" : "USER",
"password" : "PASSWORD",
"sql" : "select * from mytable order by creation_time desc",
"poll" : "5d",
"versioning" : false
},
"index" : {
"index" : "myindex",
"type" : "mytype",
"bulk_size" : 500,
"bulk_timeout" : "240s"
}
}'
System properties:
16gb RAM
200gb disk space
Depending on your version of elasticsearch-river-jdbc (find out with ls -lrt plugins/river-jdbc/) this bug might be closed (https://github.com/jprante/elasticsearch-river-jdbc/issues/45)
Otherwise file a bug report on Github.

Resources