Performace impact of using setStatsSampleRate/topology.stats.sample.rate - apache-storm

What is the performance impact of setting topology.stats.sample.rate: 1.0 in yaml?
How this works?

topology.stats.sample.rate configures the rate at which a Storm topology statistics would be calculated.
Default value in defaults.yaml is 0.05. This means only five out of 100 events are taken into account.
The value of 1 means each tuple's statistics is going to be calculated.
Is this going to decrease performance? Most likely many will say yes but since each environment is different, I would say it is better to measure it yourself. Increase and decrease the value and measure the throughput of your topology.

Related

Differnces between __execute-count value and values gathered by the Metrics Reporting API v2

I have run a topology, and I used the Meter type in metric Reporting API v2. In the execute method I mark this metric. So it will mark an event whenever the execute method is called. But when I compare this value with the __execute-count, I see huge differences. Does anyone know why this happens?
These are the values from my log which are gathered at the same time:
9:v7 __execute-count {v0:v7=44500}
9:v7 tuple_inRate.count 664129
Update:
When I use the mark method on the Meter metric, I will get different results in comparison with the Counter metric. But still, I do not understand why the values from the counter metric (tuple counter) are not the same as the __execute-count.
As given in this answer, Storms Internal Metrics are just estimated by a percentage of the real data flow. Initially, it uses 5% of incoming tuples to make those estimations. This may lead to inaccuracies for extreme high or low throughputs.
EDIT: The documentation describes the following:
In general all of these tuple count metrics are randomly sub-sampled unless otherwise stated. This means that the counts you see both on the UI and from the built in metrics are not necessarily exact. In fact by default we sample only 5% of the events and estimate the total number of events from that. The sampling percentage is configurable per topology through the topology.stats.sample.rate config. Setting it to 1.0 will make the counts exact, but be aware that the more events we sample the slower your topology will run (as the metrics are counted in the same code path as tuples are processed). This is why we have a 5% sample rate as the default.
EDIT 2 In this post, there is more information about the estimation:
The way it works is that if you choose a sampling rate of 0.05, it will pick a random element of the next 20 events in which to increase the count by 20. So if you have 20 tasks for that bolt, your stats could be off by +-380.
By the way, execute_count is just an increasing number, while your tuple_inRate.count is a rate, isn`t it?

system_cpu_usage value much less than expected

I am monitoring a spring boot application in promethus with metrics generated by micrometer.
For CPU usage, there is metrics 'system_cpu_usage'.
I observe that its value is mostly under 1. Is it expected? Same application when monitored in VisualVM, CPU graph is always above 15 percent range.
Do I need to multiple the value by 100?
Yes. Micrometer system_cpu_usage is reported between 0.0-1.0 (1.0 being cpu completely occupied). Hence you need to multiple the value by 100 to get in percentage terms.

Influxdb(single node) scaling to ~200 writes per second

What is the maximum number of points that can be written to influxdb (single node) per second? Is it feasible to scale influxdb without going for the paid cluster? And should I consider elasticsearch instead of influxdb for time series data (~3000 bytes/sec/user) if I am expecting around 60 concurrent users?
Depends on hardware.
Limiting factors are
Cardinality of series in the DB (total unique series)
WAL disk throughput (this could be put on tmpfs if you don't have SSD)
Data disk throughput (use SSD for best results)
RAM (more is better)
CPU for ingestion, indexing and queries
How far a single node can go largely depends on these and on the workload.
For write-heavy workloads of low cardinality, CPU generally tends to run out faster than anything else, assuming SSDs are used and disk I/O has been optimised accordingly.
After that, cardinality is the biggest limiting factor. Schema design plays a huge role, much bigger than number of nodes.
From some benchmarks I have done, a single node easily scales to ~70K series per second, with CPU being the limiting factor. This was on an old version though, likely higher than that now. Again, largely depends on data and schema design.
It is feasible to scale it without paid cluster by adding separate nodes, but not if you want to keep a homogeneous view (single source of all your data). Scaling vertically (more CPU, RAM) works only as long as cardinality remains consistent, meaning more data points for roughly same number of series.
InfluxDB suggest up to 250K writes / second with 25 queries per second on up to 1M unique queries is feasible on a single node. See hardware guidelines.
For the amount of data you have single node is more than enough - size of data does not matter, number of series does. Avoid elasticsearch for time series data - needs much more infrastructure to handle same amount of data.

Specific Cache Hit Rate calculation

Scenario:
Suppose we have infinite cache memory size. Caching is just limited by timeout, value of this timeout is half an hour. Cache is initially empty.
Problem:
We have 50,000 distinct request. Our system is querying, randomly, at the rate of 15 request/second i.e. 27,000 request in half an hour . What kind of curve or average value of cache hit rate could we expect for first 5 hours?
Note: This scenario is fixed. I need an approach to find out hit rate. If you think tag is wrong, please suggest appropriate tag.
I think you're right and this is a math question (certainly not a programming
problem).
One approach is to consider the extremes -- what is the hit rate for the
first query when the the system starts running? For the second query?
After one second? After 10? After a minute? And what is the likelyhood
that any random query will be found in the cache once the system has been
running a long time?
These are few specific values, and together they give you a curve.
I don't think great numeric precision is necessary; the long-term average
and the shape of the curve is more interesting.

Metrics doesn't decay when no values are reported

I am using codahale metrics for monitoring purposes. Lets say there is a spike in latency at some point and later there are no values reported due to attribute that there are no traffic, the value in the graph stays as is(I am using a histogram). At times it gives a notion that the spike remains and we might need to address it, but it actually means that no values are reported after that and hence the graph doesn't decay. Am I missing any config parameter in this case or is the behaviour expected?
The way we update the metrics is
metrics.processingTime.update(processingTime);
So, when there is no traffic, we don't update this metric.
I know that the histogram takes into consideration datapoints from the past (for an irregular period of time) in order to display a statistical image of the data.
When there are no new datapoints, only the outlier is taken into consideration and averaged on and on.
The meters have the same behavior, displaying the data through moving averages of 1,5,15 minutes.
The solution in the histogram case is to use HDRhistogram and flush it periodically.

Resources