Elasticsearch, how many clusters, indexes do I need for 8 applications - elasticsearch

I have an ELK Stack set up and accepting log data from 2 of my applications and everything is working ok. Its been running for 25 days and I have nearly 4GB of Data/Documents on a 25GB server.
My question
I have 8 applications in total that I would like to hook up to my ELK Stack.
Is the one cluster OK for this, or do I need to add more clusters? say a cluster for each applications data? If so how do I do that without having to re-index my data?
Why does cluster health say "yellow (244 of 488)?
Should I index each application to index on it own index rather than the default "logstash-{todays-date}". Like my-app-1-{todays-date}, my-app-2-{todays-date} etc..?
your help is greatly appreciated
G

Your cluster is yellow because your logstash-* indices are configured with 1 replica and you probably have a single node. 244 of 488 means that you have 488 shards in all your indices but only 244 are assigned on your single node and 244 remain to be assigned to new nodes. This is not a problem per se, but if your current node were to fail for some reason, you'd probably lose some data, whereas if you had 2+ nodes, the data would be replicated on other nodes, your cluster would be green (and you'd see 488 of 488) and you'd have a lower risk of losing data.
As for your second question, nothing prevents you from storing all the logs from your eight applications in the same daily logstash indices. You just need to make sure that your logstash configuration accounts for every different apps and adds one field with the application name (e.g. app: app1, app: app2, etc) to the indexed log events so that you can then distinguish within Kibana from which app each log event has been issued.

I have only used Elasticsearch and no the complete ELK stack, but I can give some ideas and guess what is going on. 488 = 2 x 244 , so I guess there are un-assigned replica shards in the single-machine cluster. You can update this setting ad-hoc and set it to zero:
curl -XPUT 'localhost:9200/my_index/_settings' -d '
{"index" : {"number_of_replicas" : 0}}'
You should update logstash index template not to use replicas when you are running just a single machine. Also your shards seem to be only about 20 MB in size so I'd recommend each index to use just one shard instead of five, each shard consumes extra resources. Having multiple shards increases indexing speed but slows down queries, you should check if one is sufficient or not.
Index / application / day would speed-up querying if dashboards are mostly application-specific, and you can create a day-specific alias to-be used by cross-application queries.

Related

Elasticsearch reindex gets stuck

Context
We have two Elasticsearch clusters with 6 and 3 nodes each. The cluster with 6 nodes is the one we use in production environment and we use the one with 3 nodes for testing purposes. (We have the same problem in both clusters). All the nodes have the following characteristics:
Elasticsearch 7.4.2
1TB HDD disk
8 GB RAM
In our case, we need to reindex some of the indexes. Those indexes have billions of documents and a size between 50GB and 250GB.
Problem
Whenever we start reindexing, internally or from a remote source, the task starts working correctly but it reaches a point where it stops reindexing, without apparent reason. We canĀ“t see anything in the logs. The task is not cancelled or anything, it only stops reindexing documents, it looks like the task gets stuck. We tried changing GC strategies, we used CMS and Shenandoah but nothing changes.
Has anyone run into the same problem?
It's difficult to find the RCA of these issues without debugging it and with the little information you provided(missing cluster and index configuration, index slow logs information, elasticsearch error logs, Elasticsearch hot threads to name a few).

Limiting Elasticsearch data retention below disk space

Scenario:
We use Elasticsearch & logstash to do application logging for a moderately high traffic system
This system generates ~200gb of logs every single day
We use 4 instances sharded; and want to retain roughly last 3 days worth of logs
So, we implemented a "cleanup" system, running daily, which removes all data older than 3 days
So far so good. However, a few days ago, some subsystem generated a persistent spike of data logs, resulting in filling up all available disk space within a few hours, which turned the cluster red. This also meant, that the cleanup system wasn't able to connect to ES, as the entire cluster was down -on account of disk being full. This is extremely problematic, as it limits our visibility into what's going on -and blocks our ability to see what caused this in the first place.
Doing root cause analysis here, a few questions pop out:
How can we look at the system in eg Kibana when the cluster status is red?
How can we tell ES to throw away (oldest-first) logs if there is no more space, rather than going status=red?
In what ways can we make sure this does not happen ever again?
Date based index patterns are tricky with spiky loads. There are two things to combine this for a smooth setup without needing manual intervention:
Switch to rollover indices. You can then define that you want to create a new index once your existing one has reached X GB. Then you don't care about the log volume per day any more, but you can simply keep as many indices around as you have disk space (and leave some buffer / fine tune the watermarks).
To automate the rollover, removal of indices, and optionally setting of an alias, we have Elastic Curator:
Example for rollover
Example for delete index, but you want to combine this with the count filtertype
PS: There will be another solution soon, called Index Lifecycle Management. It's built into Elasticsearch directly and can be configured through Kibana, but it's only around the corner at the moment.
How can we look at the system in eg Kibana when the cluster status is red?
Kibana can't connect to ES if it's already down. Best to poll Cluster health API to get cluster's current state.
How can we tell ES to throw away (oldest-first) logs if there is no more space, rather than going status=red?
This option is not inbuilt within Elasticsearch. Best way is to monitor disk space using Watcher or some other tool and have your monitoring send out an alert + trigger a job that cleansup old logs if the disk usage goes below a specified threshold.
In what ways can we make sure this does not happen ever again?
Monitor the disk space of your cluster nodes.

Elasticsearch reindex store sizes vary greatly

I am running Elasticsearch 6.2.4. I have a program that will automatically create an index for me as well as the mappings necessary for my data. For this issue, I created an index called "landsat" but it needs to actually be named "landsat_8", so I chose to reindex. The original "landsat" index has 2 shards and 0 read replicas. The store size is ~13.4gb with ~6.6gb per shard and the index holds just over 515k documents.
I created a new index called "landsat_8" with 5 shards, 1 read replica, and started a reindex with no special options. On a very small Elastic Cloud cluster (4GB RAM), it finished in 8 minutes. It was interesting to see that the final store size was only 4.2gb, yet it still held all 515k documents.
After it was finished, I realized that I failed to create my mappings before reindexing, so I blew it away and started over. I was shocked to find that after an hour, the /cat/_indices endpoint showed that only 7.5gb of data and 154,800 documents had been reindexed. 4 hours later, the entire job seemed to have died at 13.1gb, but it only showed 254,000 documents had been reindexed.
On this small 4gb cluster, this reindex operation was maxing out CPU. I increased the cluster to the biggest one Elastic Cloud offered (64gb ram), 5 shards, 0 RR and started the job again. This time, I set the refresh_interval on the new index to -1 and changed the size for the reindex operation to 2000. Long story short, this job ended in somewhere between 1h10m and 1h19m. However, this time I ended up with a total store size of 25gb, where each shard held ~5gb.
I'm very confused as to why the reindex operation causes such wildly different results in store size and reindex performance. Why, when I don't explicitly define any mappings and let ES automatically create mappings, is the store size so much smaller? And why, when I use the exact same mappings as the original index, is the store so much bigger?
Any advice would be greatly appreciated. Thank you!
UPDATE 1:
Here are the only differences in mappings:
The left image is "landsat" and the right image is "landsat_8". There is a root level "type" field and a nested "properties.type" field in the original "landsat" index. I forgot one of my goals was to remove the field "properties.type" from the data during the reindex. I seem to have been successful in doing so, but at the same time, accidentally renamed the root-level "type" field mapping to "provider", thus "landsat_8" has an unused "provider" mapping and an auto-created "type" mapping.
So there are some problems here, but I wouldn't think this would nearly double my store size...

How can I route ElasticSearch requests to a few shards

My ES cluster has 12 servers, but when I create my index I just indicated 3 shards. So should I use the parameter route for each time writing and reading for making the latency shorted.
If you want to controll shard allocation there is few options
One of the options you can set in config yml file node.rack: rack1
Then when you create/update index
PUT test/_settings
{
"index.routing.allocation.include.rack": "rack1"
}
In addition it depends on size of you index, for instance in my app i am using different type of indexes and some of them have 1 shard (they are settings indexes) other have 3 shards and 1 replica, and i dont care about allocation because its super fast, so if you care about latency then maybe its better to think about upgrading network

If you create a table with 32 shards on one server, when you add more servers will those shards rebalance?

When you have a one node cluster and you create a table with 32 shards, and then you add, say, 7 more nodes to the cluster, will those shards automatically migrate to the rest of the cluster so I have 4 shards per node ?
Is manual intervention required for this ?
How about the replicas created on one node ? Do those migrate to other nodes as well ?
Nothing will be automatically redistributed. In current versions of RethinkDB changing the number/distribution of replicas or changing shard boundaries will cause a loss of availability, so you have to explicitly ask for it happen (either in the web UI or with the command line administration tool).

Resources