what is a good indexing strategy when dealing with AWS/CloudWatch logs? - elasticsearch

I am new to Elasticsearch world and I'm working on a project to use Amazon Elasticsearch service (Elasticsearch and Kibana) to provide a log analytics system for all the CloudWatch logs from different AWS accounts. Setting up the stack and routing the CloudWatch logs is the easy part. But I've noticed a good indexing strategy comes to play specially when you have immutable data in a time series fashion (logs in this case).
My first approach was to create one daily single index for each log group and use the Index Policy to move/expire old indices based on my requirements. but I figured that I am going to deal with a lot of tiny indices in my Elasticsearch cluster.
Then I considered to index all the CloudWatch log groups from each AWS account into a daily single index The problem is that it exceeds the mapping limit (1000 fields) mostly caused by CloudTrail and VPS flow logs and I think it is not a good idea to increase this limit.
So I've decided to group my logs into some limited number of index types (e.g. cloudtrail logs, VPC flow logs, and other logs). So basically I would have three daily indices for each AWS account which are relatively larger indices and I won't have to increase the mapping limit.
I'm sharing this to see if anybody els has implemented something similar and what are their thoughts. I'm still in the initial phase of the project and I am eagerly looking for suggestions and recommendations.

A good indexing strategy is very subjective and depends on lot of factors like size of each index and how ofter you are going to query it.
Since, here we are taking about cloudwatch logs, you should continue your focus on avoiding lots of smaller size indices. Apart from combining logs of different types, you can also look at combining older indices into weekly or monthly indices. For example, reindex one weeks data into a weekly index at the end of the week. Also, make sure you have a retention period defined and are clearing off any older indices.
You can also considering looking at UltraWarm nodes in Amazon Elasticsearch which provide hot-warm storage architecture which works really well for read only data like logs.

Related

Figure out the problematic index in ES cluster?

I have elastic-search cluster which hosts more than 15 indices, I have a Datadog integration which shows me the below view of my elastic-search cluster.
We have alert integration with DD(datadog) which gives us alert if overall CPU usage goes beyond 60% and also in our application we start getting alerts when elasticsearch cluster is under stress as in this case our response time increases beyond a configures threshold.
Now my problem is how to know which indices are consuming the ES cluster resources most, so that we can fine either throttle the request from those indices or optimize their requests.
Some things which we did:
Looked at the slow query log: Which doesn't give us the culprit as due to heavy load or CPU usage, we have slow queries log from almost all the big indices.
Like in the DD dashboard there is spike in the bulk queue, but this is overall and not specific to a particular ES indices.
So my problem is very simple and all I want some metric from DD or elastic which can easily tell me which indices are consuming the most resources on a elastic-search cluster.
Unfortuanetly I can not propose an exact solution/workaround to you but you might have a look at the following documentations/API's:
Indices Stats API
Cluster Stats API
Nodes Stats API
The cpu usage is not included in the exported fields but maybe you can derive a high cpu usage behaviour from the other fields.
I hope I could help you in some way.

ElasticSearch/Logstash/Kibana How to deal with spikes in log traffic

What is the best way to deal with a surge in log messages being written to an ElasticSearch cluster in a standard ELK setup?
We use a standard ELK (ElasticSearch/Logstash/Kibana) set-up in AWS for our websites logging needs.
We have an autoscaling group of Logstash instances behind a load balancer, that log to an autoscaling group of ElasticSearch instances behind another load balancer. We then have a single instance serving Kibana.
For day to day business we run 2 Logstash instances and 2 ElasticSearch instances.
Our website experiences short periods of high level traffic during events - our traffic increases by about 2000% during these events. We know about these occurring events well in advance.
Currently we just increase the number of ElasticSearch instances temporarily during the event. However we have had issues where we have subsequently scaled down too quickly, meaning we have lost shards and corrupted our indexes.
I've been thinking of setting the auto_expand_replicas setting to "1-all" to ensure each node has a copy of all the data, so we don't need to worry about how quickly we scale up or down. How significant would the overhead of transferring all the data to new nodes be? We currently only keep about 2 weeks of log data - this works out around 50gb in all.
I've also seen people mention using a separate auto scaling group of non-data nodes to deal with increases of search traffic, while keep the number of data nodes the same. Would this help in a write heavy situation, such as the event I previously mentioned?
My Advice
Your best bet is using Redis as a broker in between Logstash and Elasticsearch:
This is described on some old Logstash docs but is still pretty relevant.
Yes, you will see a minimal delay between the logs being produced and them eventually landing in Elasticsearch, but it should be minimal as the latency between Redis and Logstash is relatively small. In my experience Logstash tends to work through the backlog on Redis pretty quickly.
This kind of setup also gives you a more robust setup where even if Logstash goes down, you're still accepting the events through Redis.
Just scaling Elasticsearch
As to your question on whether or not extra non-data nodes will help in write-heavy periods: I don't believe so, no. Non-data nodes are great when you're seeing lots of searches (reads) being performed, as they delegate the search to all the data nodes, and then aggregate the results before sending them back to the client. They take away the load of aggregating the results from the data nodes.
Writes will always involve your data nodes.
I don't think adding and removing nodes is a great way to cater for this.
You can try to tweak the thread pools and queues in your peak periods. Let's say normally you have the following:
threadpool:
index:
type: fixed
size: 30
queue_size: 1000
search
type: fixed
size: 30
queue_size: 1000
So you have an even amount of search and index threads available. Just before your peak time, you can change the setting (on the run) to the following:
threadpool:
index:
type: fixed
size: 50
queue_size: 2000
search
type: fixed
size: 10
queue_size: 500
Now you have a lot more threads doing indexing, allowing for a faster indexing throughput, while search is put on the backburner. For good measure I've also increased the queue_size to allow for more of a backlog to build up. This might not work as expected, though, and experimentation and tweaking is recommended.

Elastic search runtime metrics

My question is more research related.
We have elastic search handling various tasks including taking log entries from remote clients. The problem is that there are times that the clients overload elastic search.
Is there a way to query ES to get runtime metrics like number of queries in last n minutes and so on. I'm hoping we can use these to throttle the client logging as load increases.
Data on number of search and get requests per second can be obtained by querying indices stats.
There are multiple tools that provide elasticsearch monitoring, most of them open-source. Having a look at their source code may be helpful.
Please also note that throttling requests client-side based on elasticsearch stats may not be optimal solution, as it is hard to coordinate with variable number of clients. Using circuit breakers that trigger on request timeouts may be more robust.
Also an option is to set a reverse proxy in front of elasticsearch. Moreoever, some problems related to many indexing requests can be solved by throttling IO for merge operations in elasticsearch itself, as is discussed here.
Try using LucidWorks SiLK instead - it uses Solr and that's more scalable. Download it from here: http://www.lucidworks.com/lucidworks-silk

elasticsearch vs hbase/hadoop for realtime statistics

I'm loggin millions of small log documents weekly to do:
ad hoc queries for data mining
joining, comparing, filtering and calculating values
many many fulltext-search with python
run this operations with all millions of docs, some times every day
My first thought was put all docs in HBase/HDFS and run Hadoop jobs generating stats results.
The problem is: some of results must be near real-time.
So, after some research I discovered ElasticSearch and Now I'm thinking about transfer all millions of documents and use DSL-Queries to generate stats results.
Is this a good idea? ElasticSearch seems to be so easy to handle with millions/billions of documents.
For real-time search Analytics Elastic Search is a good choice.
Definitely easier to setup and handle than Hadoop/HBase/HDFS.
Elastic-Search vs HBase Good Comparison: http://db-engines.com/en/system/Elasticsearch%3BHBase

Elasticsearch Multiple Cluster Search

I just watched Rafal Kuc's presentation and would like to use it for the basis of an Elastic Search Question.
If I added 50 million documents per day to a cluster where each day created a new index (time based data design pattern), eventually in time that would get pretty big. For example sake we'll put the avg document at 15kb.
Now lets say I needed to do that for 10 years. Eventually, I would need to create multiple clusters. Can I create multiple clusters in ES and search them all simultaneously? Could I use an alias for something like this? Or is it not possible?
No, I think a search via the api or your client of choice (Java/Python/etc) is going to be against a single cluster.
Your client could make multiple requests one to each cluster, perhaps if you organized your clusters by years?
In theory a cluster could just grow forever, although at some point I would think the overhead of scattering and gathering a query to N nodes (where N is very very very large) would cause problems.

Resources