I have a simple spring boot app with the following config (the project is available here on GitHub):
management:
metrics:
export:
simple:
mode: step
endpoints:
web:
exposure:
include: "*"
The above config creates SimpleMeterRegistry and configures its metrics to be step-based, with 60 seconds step. I have one script that sends 50-100 requests per second to the service dummy endpoint and there's the other script that polls the data from /actuator/metrics/http.server.requests every X seconds. When I run the latter script every 60 seconds everything works as expected, but when the script is run every 120 seconds, the response always contains zeros for TOTAL_TIME and COUNT metrics.
Can anyone explain this behavior?
I have read the documentation here. The picture below
could indicate that a registry will try to aggregate the data for the previous interval only if pollAsRate is called during the current interval. This will explain why it does not work for 120 seconds interval. But this is just my assumption, does anyone know what is really happening here?
Spring boot version: 2.1.7.RELEASE
UPDATE
I did a similar test with management.metrics.export.simple.step=10s, it works fine when polling interval is 10s and not working when it is 20s. For 15s interval it sporadically works. So, it's definitely related to the step size and polling frequency.
MAX, TOTAL_TIME, COUNT is the property of Statistic.
DistributionStatisticConfig has .expiry(Duration.ofMinutes(2)) which sets the some measutement to 0 if there is no request has been made for last 2 minutes (120 seconds)
Methods such as public TimeWindowMax(Clock clock,...), private void rotate() has been written for the same. You may see the implementation here
More Detailed Answer
Finally figured out what is happening.
On every request to /actuator/metrics, MetricsEndpoint is going to merge measures (see here). That is done by collecting values for all meters with measurement.getValue(). The StepMeasurement.getValue() will not simply return the value, it will update the current and the previous intervals and counts, and roll the count (see here and here).
StepMeasurement.getValue
public double getValue() {
double absoluteCount = (Double)this.f.get();
double inc = Math.max(0.0D, absoluteCount - this.lastCount.sum());
this.lastCount.add(inc);
this.value.getCurrent().add(inc);
return this.value.poll();
}
StepDouble.poll
public double poll() {
rollCount(clock.wallTime());
return previous;
}
How is this related to the polling interval? If you do not poll /actuator/metrics endpoint, the current and previous intervals will not be updated, thus resulting in the current interval not being up-to-date and metrics being recorded for the "wrong" interval.
Related
I deployed an apache beam pipeline to GCP dataflow in a DEV environment and everything worked well. Then I deployed it to production in Europe environment (to be specific - job region:europe-west1, worker location:europe-west1-d) where we get high data velocity and things started to get complicated.
I am using a session window to group events into sessions. The session key is the tenantId/visitorId and its gap is 30 minutes. I am also using a trigger to emit events every 30 seconds to release events sooner than the end of session (writing them to BigQuery).
The problem appears to happen in the EventToSession/GroupPairsByKey. In this step there are thousands of events under the droppedDueToLateness counter and the dataFreshness keeps increasing (increasing since when I deployed it). All steps before this one operates good and all steps after are affected by it, but doesn't seem to have any other problems.
I looked into some metrics and see that the EventToSession/GroupPairsByKey step is processing between 100K keys to 200K keys per second (depends on time of day), which seems quite a lot to me. The cpu utilization doesn't go over the 70% and I am using streaming engine. Number of workers most of the time is 2. Max worker memory capacity is 32GB while the max worker memory usage currently stands on 23GB. I am using e2-standard-8 machine type.
I don't have any hot keys since each session contains at most a few dozen events.
My biggest suspicious is the huge amount of keys being processed in the EventToSession/GroupPairsByKey step. But on the other, session is usually related to a single customer so google should expect handle this amount of keys to handle per second, no?
Would like to get suggestions how to solve the dataFreshness and events droppedDueToLateness issues.
Adding the piece of code that generates the sessions:
input = input.apply("SetEventTimestamp", WithTimestamps.of(event -> Instant.parse(getEventTimestamp(event))
.withAllowedTimestampSkew(new Duration(Long.MAX_VALUE)))
.apply("SetKeyForRow", WithKeys.of(event -> getSessionKey(event))).setCoder(KvCoder.of(StringUtf8Coder.of(), input.getCoder()))
.apply("CreatingWindow", Window.<KV<String, TableRow>>into(Sessions.withGapDuration(Duration.standardMinutes(30)))
.triggering(Repeatedly.forever(AfterProcessingTime.pastFirstElementInPane().plusDelayOf(Duration.standardSeconds(30))))
.discardingFiredPanes()
.withAllowedLateness(Duration.standardDays(30)))
.apply("GroupPairsByKey", GroupByKey.create())
.apply("CreateCollectionOfValuesOnly", Values.create())
.apply("FlattenTheValues", Flatten.iterables());
After doing some research I found the following:
regarding constantly increasing data freshness: as long as allowing late data to arrive a session window, that specific window will persist in memory. This means that allowing 30 days late data will keep every session for at least 30 days in memory, which obviously can over load the system. Moreover, I found we had some ever-lasting sessions by bots visiting and taking actions in websites we are monitoring. These bots can hold sessions forever which also can over load the system. The solution was decreasing allowed lateness to 2 days and use bounded sessions (look for "bounded sessions").
regarding events dropped due to lateness: these are events that on time of arrival they belong to an expired window, such window that the watermark has passed it's end (See documentation for the droppedDueToLateness here). These events are being dropped in the first GroupByKey after the session window function and can't be processed later. We didn't want to drop any late data so the solution was to check each event's timestamp before it is going to the sessions part and stream to the session part only events that won't be dropped - events that meet this condition: event_timestamp >= event_arrival_time - (gap_duration + allowed_lateness). The rest will be written to BigQuery without the session data (Apparently apache beam drops an event if the event's timestamp is before event_arrival_time - (gap_duration + allowed_lateness) even if there is a live session this event belongs to...)
p.s - in the bounded sessions part where he demonstrates how to implement a time bounded session I believe he has a bug allowing a session to grow beyond the provided max size. Once a session exceeded the max size, one can send late data that intersects this session and is prior to the session, to make the start time of the session earlier and by that expanding the session. Furthermore, once a session exceeded max size it can't be added events that belong to it but don't extend it.
In order to fix that I switched the order of the current window span and if-statement and edited the if-statement (the one checking for session max size) in the mergeWindows function in the window spanning part, so a session can't pass the max size and can only be added data that doesn't extend it beyond the max size. This is my implementation:
public void mergeWindows(MergeContext c) throws Exception {
List<IntervalWindow> sortedWindows = new ArrayList<>();
for (IntervalWindow window : c.windows()) {
sortedWindows.add(window);
}
Collections.sort(sortedWindows);
List<MergeCandidate> merges = new ArrayList<>();
MergeCandidate current = new MergeCandidate();
for (IntervalWindow window : sortedWindows) {
MergeCandidate next = new MergeCandidate(window);
if (current.intersects(window)) {
if ((current.union == null || new Duration(current.union.start(), window.end()).getMillis() <= maxSize.plus(gapDuration).getMillis())) {
current.add(window);
continue;
}
}
merges.add(current);
current = next;
}
merges.add(current);
for (MergeCandidate merge : merges) {
merge.apply(c);
}
}
I am currently using Spring Boot Reactive (using webflux) to develop a microservice. In it, I implement some kind of elapsed time calculation to determine how long it took for the process to run.
Basically, when the process start, I initialize the current timestamp to mark as the start timestamp as follows:
...
metrics.setStartMillis(System.currentTimeMillis());
...
and then it will print whether the process success or not along with the elapsed time in doOnSuccess() and onErrorResume() respectively:
...
metrics.setStartMillis(System.currentTimeMillis());
return webclientAdapter.getResponse(request)
.doOnSuccess(success -> {
metrics.info(true); // This will print a metric log indicating a success process along with process time with current timestamp
})
.onErrorResume(error -> {
metrics.info(false); // This will print a metric log indicating a failed process along with process time with current timestamp
});
...
When testing the service by mocking the backend call with a 100ms delay using cUrl, the elapsed time manage to be printed correctly (~100ms elapsed time), however, during a load test using jmeter, the printed elapsed time become very fast (~0 - 20ms ish) although the service is configured to call the mock backend with 100ms delay.
Does this have to do with the nature of Reactive being in an event loop, and if so, how can I ensure that the calling process elapsed time is able to be calculated properly?
Pardon if there is any confusion, feel free to ask additional information.
I got the metrics using prometheus and webclient.
like ..
http_client_requests_seconds_count{clientName="aaa.com", ..., uri="/test"} 5
http_client_requests_seconds_max{clientName="aaa.com", ..., uri="/test"} 0
http_client_requests_seconds_sum{clientName="aaa.com", ..., uri="/test"} 10
I want to know what a each metrics mean.
And Time Unit.. 'http_client_requests_seconds_sum' is milli seconds? nano seconds? seconds?
'http_client_requests_seconds_max' mean longest time?
plz help me....!
http_client_requests_seconds_count is the total number of requests your application made to this endpoint (don’t worry about the fact that the name contains the word seconds).
http_client_requests_seconds_sum is the sum of the duration of every request your application made to this endpoint.
http_client_requests_seconds_max is the maximum request duration during a time window. The value resets to 0 when a new time window starts. The default time window is 2 minutes.
Reference: Spring Boot default metrics
In our project we need to retrieve prices from a remote ftp server. During the office hours this works fine, prices are retrieved and successfully processed. After office hours there are no new prices published on the ftp server, so as expected we don't find anything new.
Our problem is that after a few hours of not finding new prices, the poller just stops polling. No error in the logfiles (even when running on org.springframework.integration on debug level) and no exceptions. We are now using a separate TaskExecutor to isolate the issue, but still the poller just stops. In the mean time we adjusted the cron expression to match these hours, to limited the resource use, but still the poller just stops when it is supposed to run.
Any help to troubleshoot this issue is very much appreciated!
We use an #InboudChannelAdapter on a FtpStreamingMessageSource which is configured like this:
#Bean
#InboundChannelAdapter(
value = FTP_PRICES_INBOUND,
poller = [Poller(
maxMessagesPerPoll = "\${ftp.fetch.size}",
cron = "\${ftp.poll.cron}",
taskExecutor = "ftpTaskExecutor"
)],
autoStartup = "\${ftp.fetch.enabled:false}"
)
fun ftpInboundFlow(
#Value("\${ftp.remote.prices.dir}") pricesDir: String,
#Value("\${ftp.remote.prices.file.pattern}") remoteFilePattern: String,
#Value("\${ftp.fetch.size}") fetchSize: Int,
#Value("\${ftp.fetch.enabled:false}") fetchEnabled: Boolean,
clock: Clock,
remoteFileTemplate: RemoteFileTemplate<FTPFile>,
priceParseService: PriceParseService,
ftpFilterOnlyFilesFromMaxDurationAgo: FtpFilterOnlyFilesFromMaxDurationAgo
): FtpStreamingMessageSource {
val messageSource = FtpStreamingMessageSource(remoteFileTemplate, null)
messageSource.setRemoteDirectory(pricesDir)
messageSource.maxFetchSize = fetchSize
messageSource.setFilter(
inboundFilters(
remoteFilePattern,
ftpFilterOnlyFilesFromMaxDurationAgo
)
)
return messageSource;
}
The property values are:
poll.cron: "*/30 * 4-20 * * MON-FRI"
fetch.size: 10
fetch.enabled: true
We limit the poll.cron we used the retrieve every minute.
In the related DefaultFtpSessionFactory, the timeouts are set to 60 seconds to override the default value of -1 (which means no timeout at all):
sessionFactory.setDataTimeout(timeOut)
sessionFactory.setConnectTimeout(timeOut)
sessionFactory.setDefaultTimeout(timeOut)
Maybe my answer seems a bit too easy, bit is it because your cron expression states that it should schedule the job between 4 and 20 hour. After 8:00 PM it will not schedule the job anymore and it will start polling again at 4:00 AM.
It turned out that the processing took longer than the scheduled interval, so during processing a new task was already executed. So eventually multiple task were trying to accomplish the same thing.
We solved this by using a fixedDelay on the poller instead of a fixedRate.
The difference is that a fixedRate schedules on a regular interval independent if the task was finished and the fixedDelay schedules a delay after the task is finished.
Trying to understand (new to kafka)how the poll event loop in kafka works.
Use Case : 25 records on the topic, max poll size is set to 5.
max.poll.interval.ms = 5000 //5 seconds by default max.poll.records = 5
Sequence of tasks
Poll the records from the topic.
Process the records in a for loop.
Some processing login where the logic would either pass or fail.
If logic passes (with offset) will be added to a map.
Then it will be committed using commitSync call.
If fails then the loop will break and whatever was success before this would be committed.The problem starts after this.
The next poll would just keep moving in batches of 5 even after error, is it expected?
What we basically expect is that the loop breaks and the offsets till success process message logic should get committed, then the next poll should continue from the failed message.
Example, 1st batch of poll 5 messages polled and 1,2 offsets successful and committed then 3rd failed.So the poll call keep moving to next batch like 5-10,10-15 if there are any errors in between we expect it to stop at that point and poll should start from 3 in first case or if it fails in 2nd batch at 8 then the next poll should start from 8th offset not from next max poll batch settings which would be like 5 in this case.IF IT MATTERS USING SPRING BOOT PROJECT and enable autocommit is false.
I have tried finding this in documentation but no help.
tried tweaking this but no help max.poll.interval.ms
EDIT: Not accepted answer because there is no direct solution for a customer consumer.Keeping this for informational purpose
max.poll.interval.ms is milliseconds, not seconds so it should be 5000.
Once the records have been returned by the poll (and offsets not committed), they won't be returned again unless you restart the consumer or perform seek() operations on the consumer to reset the offset to the unprocessed ones.
The Spring for Apache Kafka project provides a SeekToCurrentErrorHandler to perform this task for you.
If you are using the consumer yourself (which it sounds like), you must do the seeks.
You can manually seek to the beginning offset of the poll for all the assigned partitions on failure. I am not sure using spring consumer.
Sample code for seeking offset to beginning for normal consumer.
In the code below I am getting the records list per partition and then getting the offset of the first record to seek to.
def seekBack(records: ConsumerRecords[String, String]) = {
records.partitions().map(partition => {
val partitionedRecords = records.records(partition)
val offset = partitionedRecords.get(0).offset()
consumer.seek(partition, offset)
})
}
One problem doing this in production is bad since you don't want seekback all the time only in cases where you have a transient error otherwise you will end up retrying infinitely.