Is there a way to find out which metrics costs the most and determine the following factors:
who owns it?
the upstream calls related to it
etc? anything else you can think of?
It's a big project containing multiple clusters, would like to know if there's a way to breakdown the measurement of cost of the metrics to calculate the highest costing metrics?
thanks!
In my app I collect a lot of metrics: hardware/native system metrics (such as CPU load, available memory, swap memory, network IO in terms of packets and bytes sent/received, etc.) as well as JVM metrics (garbage collectins, heap size, thread utilization, etc.) as well as app-level metrics (instrumentations that only have meaning to my app, e.g. # orders per minute, etc.).
Throughout the week, month, year I see trends/patterns in these metrics. For instance when cron jobs all kick off at midnight I see CPU and disk thrashing as reports are being generated, etc.
I'm looking for a way to assess/evaluate metrics as healthy/normal vs unhealthy/abnormal but that takes these patterns into consideration. For instance, if CPU spikes around (+/- 5 minutes) midnight each night, that should be considered "normal" and not set off alerts. But if CPU pins during a "low tide" in the day, say between 11:00 AM and noon, that should definitely cause some red flags to trigger.
I have the ability to store my metrics in a time-series database, if that helps kickstart this analytical process at all, but I don't have the foggiest clue as to what algorithms, methods and strategies I could leverage to establish these cyclical "baselines" that act as a function of time. Obviously, such a system would need to be pre-seeded or even trained with historical data that was mapped to normal/abnormal values (which is why I'm learning towards a time-series DB as the underlying store) but this is new territory for me and I don't even know what to begin Googling so as to get back meaningful/relevant/educated solution candidates in the search results. Any ideas?
You could categorize each metric (CPU load, available memory, swap memory, network IO) with the day and time as bad or good for each metric.
Come up with a set of data for a given time frame with metric values and whether they are good or bad. Train a model using 70% of the data with the good and bad answers in the data.
Then test the trained model using the other 30% of data without the answers to see if you get the predicted results (good,bad) from the model. You could use a classification algorithm.
I am using codahale metrics for monitoring purposes. Lets say there is a spike in latency at some point and later there are no values reported due to attribute that there are no traffic, the value in the graph stays as is(I am using a histogram). At times it gives a notion that the spike remains and we might need to address it, but it actually means that no values are reported after that and hence the graph doesn't decay. Am I missing any config parameter in this case or is the behaviour expected?
The way we update the metrics is
metrics.processingTime.update(processingTime);
So, when there is no traffic, we don't update this metric.
I know that the histogram takes into consideration datapoints from the past (for an irregular period of time) in order to display a statistical image of the data.
When there are no new datapoints, only the outlier is taken into consideration and averaged on and on.
The meters have the same behavior, displaying the data through moving averages of 1,5,15 minutes.
The solution in the histogram case is to use HDRhistogram and flush it periodically.
With Elasticsearch I know I can do some nice time series data queries and get mean/max etc
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-facets-statistical-facet.html
Is it possible though to only include the 90% percentile in that calculation and in Kibana in particular?
Any thoughts on how this could be done?
Elasticsearch doesn't currently support percentiles (including median).
Percentiles are much harder to compute than statistics in a distributed environment. Let's assume you have 2 shards. If you ask both of them for the sum of their values and the number of values, you would be able to know the global average value: ($sum1 + $sum2) / $(value_count1 + $value_count2).
On the other hand, if you want to compute the median, the only way to compute it accurately is to get all values from both shards, sort them and take the median. This would require lots of memory and of network bandwidth.
Fortunately there are algorithms that allow to compute good approximated values of percentiles with limited memory usage, and we are in particular looking into tdigest so it is quite likely that (approximate) percentiles will be supported in a future release of Elasticsearch.
I have been exploring the Graphite graphing tool for showing metrics from multiple servers, and it seems that the 'recommended' way is to send all metrics data to StatsD first. StatsD aggregates the data and sends it to graphite (or rather, Carbon).
In my case, I want to do simple aggregations like sum and average on metrics across servers and plot that in graphite. Graphite comes with a Carbon aggregator which can do this.
StatsD does not even provide aggregation of the kind I am talking about.
My question is - should I use statsd at all for my use case? Anything I am missing here?
StatsD operates over UDP, which removes the risk of carbon-aggregator.py being slow to respond and introducing latency in your application. In other words, loose coupling.
StatsD supports sampling of inbound metrics, which is useful when you don't want your aggregator to take 100% of all data points to compute descriptive statistics. For high-volume code sections, it is common to use 0.5%-1% sample rates so as to not overload StatsD.
StatsD has broad client-side support.
tldr: you will probably want statsd (or carbon-c-relay) if you ever want to look at the server-specific sums or averages.
carbon aggregator is designed to aggregate values from multiple metrics together into a single output metric, typically to increase graph rendering performance. statsd is designed to aggregate multiple data points in a single metric, because otherwise graphite only stores the last value reported in the minimum storage resolution.
statsd example:
assume that your graphite storage-schemas.conf file has a minimum retention of 10 seconds (the default) and your application is sending approximately 100 data points every 10 seconds to services.login.server1.count with a value of 1. without statsd, graphite would only store the last count received in each 10-second bucket. after the 100th message is received, the other 99 data points would have been thrown out. however, if you put statsd between your application and graphite, then it will sum all 100 datapoints together before sending the total to graphite. so, without statsd your graph only indicates if a login occurred in during the 10 second interval. with statsd, it indicates how many logins occurred in during that interval. (for example)
carbon aggregator example: assume you have 200 different servers reporting 200 separate metrics (services.login.server1.response.time, services.login.server2.response.time, etcetera). on your operations dashboard you show a graph of the average accross all servers using this graphite query: weightedAverage(services.login.server*.response.time, services.login.server*.response.count, 2). unfortunately, rendering this graph takes 10 seconds. to solve this problem, you can add a carbon aggregator rule to pre-calculate the average across all your servers and store the value in a new metric. now you can update your dashboard to simply pull a single metric (e.g. services.login.response.time). the new metric renders almost instantly.
side notes:
the aggregation rules in storage-aggregation.conf apply to all storage intervals in storage-schemas.conf except the first (smallest) retention period for each retention string. it is possible to use carbon-aggregator to aggregate data points within a metric for that first retention period. unfortunately, aggregation-rules.conf uses "blob" patterns rather than regex patterns. so you need to add a separate aggregation-rules.conf file entry for every path depth and aggregation type. the advantage of statsd is that the client sending the metric can specify the aggregation type rather than encoding it in the metric path. that gives you the flexibility to add a new metric on the fly regardless of metric path depth. if you wanted to configure carbon-aggregator to do statsd-like aggregation automatically when you add a new metric, your aggregation-rules.conf file would look something like this:
<n1>.avg (10)= avg <n1>.avg$
<n1>.count (10)= sum <n1>.count$
<n1>.<n2>.avg (10)= avg <n1>.<n2>.avg$
<n1>.<n2>.count (10)= sum <n1>.<n2>.count$
<n1>.<n2>.<n3>.avg (10)= avg <n1>.<n2>.<n3>.avg$
<n1>.<n2>.<n3>.count (10)= sum <n1>.<n2>.<n3>.count$
...
<n1>.<n2>.<n3> ... <n99>.count (10)= sum <n1>.<n2>.<n3> ... <n99>.count$
notes: the trailing "$" is not needed in graphite 0.10+ (currently pre-release) here is the relevant patch on github and here is the standard documentation on aggregation rules
the weightedAverage function is new in graphite 0.10, but generally the averageSeries function will give a very similar number as long as your load is evenly balanced. if you have some servers that are both slower and service fewer requests or you are just a stickler for precision, then you can still calculate a weighted average with graphite 0.9. you just need to build a more complex query like this:
divideSeries(sumSeries(multiplySeries(a.time,a.count), multiplySeries(b.time,b.count)),sumSeries(a.count, b.count))
if statsd is run on the client box this also reduces network load. although, in theory, you could run carbon-aggregator on the client side too. however, if you use one of the statsd client libraries, you can also use sampling to reduce the load on your application machine's cpu (e.g. creating loopback udp packets). furthermore, statsd can automatically perform multiple different aggregations on a single input metric (sum, mean, min, max, etcetera)
if you use statsd on each app server to aggregate response times, and then re-aggregate those values on the graphite server using carbon aggregator, you end up with an average response time weighted by app server rather than request. obviously, this only matters for aggregating using a mean or top_90 aggregation rule, and not min, max or sum. also, it only matters for mean if your load is unbalanced. as an example: assume you have a cluster of 100 servers, and suddenly 1 server is sent 99% of the traffic. consequentially, the response times quadruple on that 1 server, but remain steady on the other 99 servers. if you use client side aggregation, your overall metric would only go up about 3%. but if you do all your aggregation in a single server-side carbon aggregator, then your overall metric would go up by about 300%.
carbon-c-relay is essentially a drop-in replacement for carbon-aggregator written in c. it has improved performance and regex-based matching rules. the upshot being that you can do both statsd-style datapoint aggregation and carbon-relay style metric aggregation and other neat stuff like multi-layered aggregation all in the same simple regex-based config file.
if you use the cyanite back-end instead of carbon-cache, then cyanite will do the intra-metric averaging for you in memory (as of version 0.5.1) or at read time (in the version <0.1.3 architecture).
If the Carbon aggregator offers everything you need, there is no reason not to use it. It has two basic aggregation functions (sum and average), and indeed these are not covered by StatsD. (I'm not sure about the history, but maybe the Carbon aggregator already existed and the StatsD authors did not want to duplicate features?) Receiving data via UDP is also supported by Carbon, so the only thing you would miss would be the sampling, which does not matter if you aggregate by averaging.
StatsD supports different metric types by adding extra aggregate values (e.g. for timers: mean, lower, upper and upper Xth percentile, ...). I like them, but if you don't need them, the Carbon aggregator is a good way to go too.
I have been looking at the source code of the Carbon aggregator and StatsD (and Bucky, a StatsD implementation in Python), and they are all so simple, that I would not worry about resource usage or performance for either choice.
Looks like carbon aggregator and statsd support disjoint set of features:
statsd supports rate calculation and summation but does not support averaging values
carbon aggregator supports averaging but does not support rate calculation.
Because graphite has a minimum resolution, so you cannot save two different values for the same metric during defined interval. StatsD solves this problem by pre-aggregating them, and instead of saying "1 user registered now" and "1 user registered now" it says "2 users registered".
The other reason is performance because:
You send data to StatsD via UDP, which is a fire and forget protocol, stateless, much faster
StatsD etsy's implementation is in NodeJS which also increases the performance a lot.