How to measure invocations over time with micrometer - spring-boot

I have a Spring Boot Application which I'm migrating to Micrometer right now.
What I'd like to achieve is, to count the invocation over time for specific objects.
Let's assume I have a function which creates cars of certain Brands. Then I'd like to measure how many Ford, Skoda, VW and so on I created in the past minute.
Especially, if there was no Skoda created between now()-1 and now() then the metric should return 0.
The docs state that I shouldn't use counter, since the number of created cars can grow indefinitely while running the App. Also a Timer isn't really fitting since I'd only start the timer before Constructor invocation and after that.
I tried a gauge, but also this only gives me absolute numbers:
Arrays.stream(brand).forEach(brand -> metricNames.stream().forEach(name -> {
String id = METRIC_PREFIX + METRIC_SEPARATOR + brand + name;
AtomicInteger summary = Metrics.gauge(id, new AtomicInteger(0));
summary.getAndIncrement();
}));
In dropwizard there were Meters, but what is the equivalent in Micrometer?

You need a counter:
MeterRegistry metrics...
private final Counter nikeBrandCounter = metrics.counter("brands", "brand", "Nike");
Arrays.stream(brand).forEach(brand -> metricNames.stream().forEach(name -> {
if(name == "Nike") {
nikeBrandCounter.increment();
}
}));

Related

KStream to KStream Join- Output record post a configurable time in event of no matching record within the window

Need some opinion/help around one use case of KStream/KTable usage.
Scenario:
I have 2 topics with common key--requestId.
input_time(requestId,StartTime)
completion_time(requestId,EndTime)
The data in input_time is populated at time t1 and the data in completion_time is populated at t+n.(n being the time taken for a process to complete).
Objective
To compare the time taken for a request by joining data from the topics and raised alert in case of breach of a threshold time.
It may happen that the process may fail and the data may not arrive on the completion_time topic at all for the request.
In that case we intend to use a check that if the currentTime is well past a specific(lets say 5s) threshold since the start time.
input_time(req1,100) completion_time(req1,104) --> no alert to be raised as 104-100 < 5(configured value)
input_time(req2,100) completion_time(req2,108) --> alert to be raised with req2,108 as 108-100 >5
input_time(req3,100) completion_time no record--> if current Time is beyond 105 raise an alert with req3,currentSysTime as currentSysTime - 100 > 5
Options Tried.
1) Tried both KTable-KTable and KStream-Kstream outer joins but the third case always fails.
final KTable<String,Long> startTimeTable = builder.table("input_time",Consumed.with(Serdes.String(),Serdes.Long()));
final KTable<String,Long> completionTimeTable = builder.table("completion_time",Consumed.with(Serdes.String(),Serdes.Long()));
KTable<String,Long> thresholdBreached =startTimeTable .outerJoin(completionTimeTable,
new MyValueJoiner());
thresholdBreached.toStream().filter((k,v)->v!=null)
.to("finalTopic",Produced.with(Serdes.String(),Serdes.Long()));
Joiner
public Long apply(Long startTime,Long endTime){
// if input record itself is not available then we cant use any alerting.
if (null==startTime){
log.info("AlertValueJoiner check: the start time itself is null so returning null");
return null;
}
// current processing time is the time used.
long currentTime= System.currentTimeMillis();
log.info("Checking startTime {} end time {} sysTime {}",startTime,endTime,currentTime);
if(null==endTime && currentTime-startTime>5000){
log.info("Alert:No corresponding record from file completion yet currentTime {} startTime {}"
,currentTime,startTime);
return currentTime-startTime;
}else if(null !=endTime && endTime-startTime>5000){
log.info("Alert: threshold breach for file completion startTime {} endTime {}"
,startTime,endTime);
return endTime-startTime;
}
return null;
}
2) Tried the custom logic approach recommended as per the thread
How to manage Kafka KStream to Kstream windowed join?
-- This approach stopped working for scenarios 2 and 3.
Is there any case of handling all three scenarios using DSL or Processors?
Not sure of we can use some kind of punctuator to listen to when the window changes and check for the stream records in current window and if there is no matching records found,produce a result with systime.?
Due to the nature of the logic involve it surely had to be done with combination of DSL and processor API.
Used a custom transformer and state store to compare with configured
values.(case 1 &2)
Added a punctuator based on wall clock for
handling the 3rd case

Could using changelogs cause a bottleneck for the app itself?

I have a spring cloud kafka streams application that rekeys incoming data to be able to join two topics, selectkeys, mapvalues and aggregate data. Over time the consumer lag seems to increase and scaling by adding multiple instances of the app doesn't help a bit. With every instance the consumer lag seems to be increasing.
I scaled up and down the instances from 1 to 18 but no big difference is noticed. The number of messages it lags behind, keeps increasing every 5 seconds independent of the number of instances
KStream<String, MappedOriginalSensorData> flattenedOriginalData = originalData
.flatMap(flattenOriginalData())
.through("atl-mapped-original-sensor-data-repartition", Produced.with(Serdes.String(), new MappedOriginalSensorDataSerde()));
//#2. Save modelid and algorithm parts of the key of the errorscore topic and reduce the key
// to installationId:assetId:tagName
//Repartition ahead of time avoiding multiple repartition topics and thereby duplicating data
KStream<String, MappedErrorScoreData> enrichedErrorData = errorScoreData
.map(enrichWithModelAndAlgorithmAndReduceKey())
.through("atl-mapped-error-score-data-repartition", Produced.with(Serdes.String(), new MappedErrorScoreDataSerde()));
return enrichedErrorData
//#3. Join
.join(flattenedOriginalData, join(),
JoinWindows.of(
// allow messages within one second to be joined together based on their timestamp
Duration.ofMillis(1000).toMillis())
// configure the retention period of the local state store involved in this join
.until(Long.parseLong(retention)),
Joined.with(
Serdes.String(),
new MappedErrorScoreDataSerde(),
new MappedOriginalSensorDataSerde()))
//#4. Set instalation:assetid:modelinstance:algorithm::tag key back
.selectKey((k,v) -> v.getOriginalKey())
//#5. Map to ErrorScore (basically removing the originalKey field)
.mapValues(removeOriginalKeyField())
.through("atl-joined-data-repartition");
then the aggregation part:
Materialized<String, ErrorScore, WindowStore<Bytes, byte[]>> materialized = Materialized
.as(localStore.getStoreName());
// Set retention of changelog topic
materialized.withLoggingEnabled(topicConfig);
// Configure how windows looks like and how long data will be retained in local stores
TimeWindows configuredTimeWindows = getConfiguredTimeWindows(
localStore.getTimeUnit(), Long.parseLong(topicConfig.get(RETENTION_MS)));
// Processing description:
// 2. With the groupByKey we group the data on the new key
// 3. With windowedBy we split up the data in time intervals depending on the provided LocalStore enum
// 4. With reduce we determine the maximum value in the time window
// 5. Materialized will make it stored in a table
stream.groupByKey()
.windowedBy(configuredTimeWindows)
.reduce((aggValue, newValue) -> getMaxErrorScore(aggValue, newValue), materialized);
}
private TimeWindows getConfiguredTimeWindows(long windowSizeMs, long retentionMs) {
TimeWindows timeWindows = TimeWindows.of(windowSizeMs);
timeWindows.until(retentionMs);
return timeWindows;
}
I would expect that increasing the number of instances would decrease the consumer lag tremendous.
So in this setup there are multiple topics involved such as:
* original-sensor-data
* error-score
* kstream-joinother
* kstream-jointhis
* atl-mapped-original-sensor-data-repartition
* atl-mapped-error-score-data-repartition
* atl-joined-data-repartition
the idea is to join the original-sensor-data with the error-score. The rekeying requires the atl-mapped-* topics. then the join will use the kstream* topics and in the end as a result of the join the atl-joined-data-repartition is filled. After that the aggregation also creates topics but I leave this out of scope now.
original-sensor-data
\
\
\ atl-mapped-original-sensor-data-repartition-- kstream-jointhis -\
/ atl-mapped-error-score-data-repartition -- kstream-joinother -\
/ \
error-score atl-joined-data-repartition
As it seems that increasing the number of instances doesn't seem to have much of affect anymore since I introduced the join and the atl-mapped topics, I'm wondering if it is possible that this topology would become its own bottleneck. From the consumer lag it seems that the original-sensor-data and error-score topic have a much smaller consumer lag compare to for instance the atl-mapped-* topics. Is there a way to cope with this by removing these changelogs or does this result in not being able to scale?

gauge-like calculation of DistributionSummary with micrometer and prometheus

I want to get a DistributionSummary over some domain data that does not change very frequently. So it is not about monitoring requests or sth like that.
Let's take number of seats in an office as example. The value for each office can change from time to time and there can be new offices and also offices get removed.
So now I need the current DistributionSummary over all offices, which needs to be calculated every time I think (similar to a Gauge).
I have a Spring Boot 2 app with micrometer and collect the metrics with prometheus and display them in grafana.
What I tried so far:
When I register a DistributionSummary, I can record all the values once during startup... this gives me the distribution, but calculated values like max get lost over time and I cannot update the DistributionSummary (recording new offices would work, but not changing existing ones)
// during startup
seatsInOffice = DistributionSummary.builder("office.seats")
.publishPercentileHistogram()
.sla(1, 5, 20, 50)
.register(meterRegistry);
officeService.getAllOffices().forEach(p -> seatsInOffice.record(o.getNumberOfSeats()));
I also tried to use a #Scheduled task to remove and completely rebuild the DistributionSummary. This seems to work, but feels wrong somehow. Would that be a recommended approach? That would also probably need some synchronisation to not collect the metrics between removing and recalculating distribution.
#Scheduled(fixedRate = 5 * 60 * 1000)
public void recalculateMetrics() {
if (seatsInOffice != null) {
meterRegistry.remove(seatsInOffice);
}
seatsInOffice = DistributionSummary.builder("office.seats")
.publishPercentileHistogram()
.sla(1, 5, 20, 50)
.register(meterRegistry);
officeService.getAllOffices().forEach(p -> seatsInOffice.record(o.getNumberOfSeats()));
}
Another problem I just recognized with this approach: the /actuator/prometheus endpoint still returns the values for the old (removed) metrics, so everything is there mutiple times.
For sth like sla borders I could also use some gauges to provide the values (by calculating them myself), but that would not give me quantiles. Is it possible to create a new DistributionSummary without registering it and just provide the values it collected somehow?
meterRegistry.gauge("office.seats", Tags.of("le", "1"), officeService,
x -> x.getAllOfficesWithLessThanXSeats(1).size());
meterRegistry.gauge("office.seats", Tags.of("le", "5"), officeService,
x -> x.getAllOfficesWithLessThanXSeats(5).size());
meterRegistry.gauge("office.seats", Tags.of("le", "20"), officeService,
x -> x.getAllOfficesWithLessThanXSeats(20).size());
meterRegistry.gauge("office.seats", Tags.of("le", "50"), officeService,
x -> x.getAllOfficesWithLessThanXSeats(50).size());
I would like to have a DistributionSummary that takes a lambda or sth like that to get the values. But maybe these tools are not made for this usecase and I should use sth else. Can you recommend sth?
DistributionSummary has a config distributionStatisticExpiry could control rotate data. It's a workaround
But PrometheusDistributionSummary won't use this field
case Prometheus:
histogram = new TimeWindowFixedBoundaryHistogram(clock, DistributionStatisticConfig.builder()
.expiry(Duration.ofDays(1825)) // effectively never roll over
.bufferLength(1)
.build()
.merge(distributionStatisticConfig), true);

How do I add items with a score above x to goodItems for precision metric in Lenskit 3.0?

I'd like to add the precision metric and use only items
with a rating higher than 4.0 as 'goodItems'
In Lenskit 2 this could be done by:
metric precision {
listSize 10
candidates ItemSelectors.addNRandom(ItemSelectors.testItems(), 100)
exclude ItemSelectors.trainingItems()
goodItems ItemSelectors.testRatingMatches(Matchers.greaterThanOrEqualTo(4.0d))
}
Now I'm trying to do the same in Lenskit 3 with graddle but obviously
metric('pr') {
goodItems 'ItemSelectors.testRatingMatches(Matchers.greaterThanOrEqualTo(4.0d))'
}
doesn't work, since there is no ItemSelectors class in Lenskit 3.0.
How can I link the goodItems with the appropriate items and discard low-rated items in order to achieve a correct precision value?
As told by Mr. Ekstrand, you can select the good items by adding the following line to the gradle build file.
goodItems 'user.testHistory.findAll({ it instanceof org.lenskit.data.ratings.Rating && it.value >= 4 })*.itemId'
However, this returns an Object, in Itemselector.class, there is a parsing that happens to Set, this however doesn't work since the returned object is of the ArrayList Type. If I'm correct this means that the Object needs to be casted to an ArrayList before being casted to a set, I did this by copying the Itemselector class and replacing:
Set<Long> set = (Set<Long>) script.run();
by:
Set<Long> set = new HashSet<Long>((ArrayList<Long>)script.run());
This returns the correct items from my test-set, rated above 4.0
This goodItems should work:
user.testHistory.findAll({ it instanceof org.lenskit.data.ratings.Rating && it.value >= 4 })*.itemId.toSet()

What is the difference between TYPE_STEP_COUNT_DELTA and AGGREGATE_STEP_COUNT_DELTA data type in Google Fit Android Api?

The Google Fit API describes two of these data types of the Sensor API. However both seem to be returning the same data. Can anyone explain the difference?
TYPE_STEP_COUNT_DELTA:
In the com.google.step_count.delta data type, each data point represents the number of steps taken since the last reading.
AGGREGATE_STEP_COUNT_DELTA:
Aggregate number of steps during a time interval.
You can see more here:
https://developers.google.com/android/reference/com/google/android/gms/fitness/data/DataType
// Setting a start and end date using a range of 1 week before this moment.
Calendar cal = Calendar.getInstance();
Date now = new Date();
cal.setTime(now);
long endTime = cal.getTimeInMillis();
cal.add(Calendar.WEEK_OF_YEAR, -1);
long startTime = cal.getTimeInMillis();
java.text.DateFormat dateFormat = getDateInstance();
Log.i(TAG, "Range Start: " + dateFormat.format(startTime));
Log.i(TAG, "Range End: " + dateFormat.format(endTime));
DataReadRequest readRequest = new DataReadRequest.Builder()
// The data request can specify multiple data types to return, effectively
// combining multiple data queries into one call.
// In this example, it's very unlikely that the request is for several hundred
// datapoints each consisting of a few steps and a timestamp. The more likely
// scenario is wanting to see how many steps were walked per day, for 7 days.
.aggregate(DataType.TYPE_STEP_COUNT_DELTA, DataType.AGGREGATE_STEP_COUNT_DELTA)
// Analogous to a "Group By" in SQL, defines how data should be aggregated.
// bucketByTime allows for a time span, whereas bucketBySession would allow
// bucketing by "sessions", which would need to be defined in code.
.bucketByTime(1, TimeUnit.DAYS)
.setTimeRange(startTime, endTime, TimeUnit.MILLISECONDS)
.build();

Resources