Could using changelogs cause a bottleneck for the app itself? - apache-kafka-streams

I have a spring cloud kafka streams application that rekeys incoming data to be able to join two topics, selectkeys, mapvalues and aggregate data. Over time the consumer lag seems to increase and scaling by adding multiple instances of the app doesn't help a bit. With every instance the consumer lag seems to be increasing.
I scaled up and down the instances from 1 to 18 but no big difference is noticed. The number of messages it lags behind, keeps increasing every 5 seconds independent of the number of instances
KStream<String, MappedOriginalSensorData> flattenedOriginalData = originalData
.flatMap(flattenOriginalData())
.through("atl-mapped-original-sensor-data-repartition", Produced.with(Serdes.String(), new MappedOriginalSensorDataSerde()));
//#2. Save modelid and algorithm parts of the key of the errorscore topic and reduce the key
// to installationId:assetId:tagName
//Repartition ahead of time avoiding multiple repartition topics and thereby duplicating data
KStream<String, MappedErrorScoreData> enrichedErrorData = errorScoreData
.map(enrichWithModelAndAlgorithmAndReduceKey())
.through("atl-mapped-error-score-data-repartition", Produced.with(Serdes.String(), new MappedErrorScoreDataSerde()));
return enrichedErrorData
//#3. Join
.join(flattenedOriginalData, join(),
JoinWindows.of(
// allow messages within one second to be joined together based on their timestamp
Duration.ofMillis(1000).toMillis())
// configure the retention period of the local state store involved in this join
.until(Long.parseLong(retention)),
Joined.with(
Serdes.String(),
new MappedErrorScoreDataSerde(),
new MappedOriginalSensorDataSerde()))
//#4. Set instalation:assetid:modelinstance:algorithm::tag key back
.selectKey((k,v) -> v.getOriginalKey())
//#5. Map to ErrorScore (basically removing the originalKey field)
.mapValues(removeOriginalKeyField())
.through("atl-joined-data-repartition");
then the aggregation part:
Materialized<String, ErrorScore, WindowStore<Bytes, byte[]>> materialized = Materialized
.as(localStore.getStoreName());
// Set retention of changelog topic
materialized.withLoggingEnabled(topicConfig);
// Configure how windows looks like and how long data will be retained in local stores
TimeWindows configuredTimeWindows = getConfiguredTimeWindows(
localStore.getTimeUnit(), Long.parseLong(topicConfig.get(RETENTION_MS)));
// Processing description:
// 2. With the groupByKey we group the data on the new key
// 3. With windowedBy we split up the data in time intervals depending on the provided LocalStore enum
// 4. With reduce we determine the maximum value in the time window
// 5. Materialized will make it stored in a table
stream.groupByKey()
.windowedBy(configuredTimeWindows)
.reduce((aggValue, newValue) -> getMaxErrorScore(aggValue, newValue), materialized);
}
private TimeWindows getConfiguredTimeWindows(long windowSizeMs, long retentionMs) {
TimeWindows timeWindows = TimeWindows.of(windowSizeMs);
timeWindows.until(retentionMs);
return timeWindows;
}
I would expect that increasing the number of instances would decrease the consumer lag tremendous.
So in this setup there are multiple topics involved such as:
* original-sensor-data
* error-score
* kstream-joinother
* kstream-jointhis
* atl-mapped-original-sensor-data-repartition
* atl-mapped-error-score-data-repartition
* atl-joined-data-repartition
the idea is to join the original-sensor-data with the error-score. The rekeying requires the atl-mapped-* topics. then the join will use the kstream* topics and in the end as a result of the join the atl-joined-data-repartition is filled. After that the aggregation also creates topics but I leave this out of scope now.
original-sensor-data
\
\
\ atl-mapped-original-sensor-data-repartition-- kstream-jointhis -\
/ atl-mapped-error-score-data-repartition -- kstream-joinother -\
/ \
error-score atl-joined-data-repartition
As it seems that increasing the number of instances doesn't seem to have much of affect anymore since I introduced the join and the atl-mapped topics, I'm wondering if it is possible that this topology would become its own bottleneck. From the consumer lag it seems that the original-sensor-data and error-score topic have a much smaller consumer lag compare to for instance the atl-mapped-* topics. Is there a way to cope with this by removing these changelogs or does this result in not being able to scale?

Related

Add field to existing documents over million records

Scenario
We have over 5 million document in a bucket and all of it has nested JSON with a simple uuid key. We want to add one extra field to ALL of the documents.
Example
ee6ae656-6e07-4aa2-951e-ea788e24856a
{
"field1":"data1",
"field2":{
"nested_field1":"data2"
}
}
After adding extra field
ee6ae656-6e07-4aa2-951e-ea788e24856a
{
"field1":"data1",
"field3":"data3",
"field2":{
"nested_field1":"data2"
}
}
It has only one Primary Index: CREATE PRIMARY INDEX idx FOR bucket.
Problem
It takes ages. We tried it with n1ql, UPDATE bucket SET field3 = data3. Also sub-document mutation. But all of it takes hours. It's written in Go so we could put it into a goroutine, but it's still too much time.
Question
Is there any solution to reduce that time?
As you need to add new field, not modifying any existing field it is better to use SDKs SUBDOC API vs N1QL UPDATE (It is whole document update and require fetch the document).
The Best option will be Use N1QL get the document keys then use
SDK SUBDOC API to add the field you need. You can use reactive API(asynchronously)
You have 5M documents and have primary index use following
val = ""
In loop
SELECT RAW META().id FROM mybucket WHERE META().id > $val LIMIT 10000;
SDK SUBDOC update
val = last value from the SELECT
https://blog.couchbase.com/offset-keyset-pagination-n1ql-query-couchbase/
The Eventing Service can be quite performant for these sort of enrichment tasks. Even a low end system should be able to do 5M rows in under two (2) minutes.
// Note src_bkt is an alias to the source bucket for your handler
// in read+write mode supported for version 6.5.1+, this uses DCP
// and can be 100X more performant than N1QL.
function OnUpdate(doc, meta) {
// optional filter to be more selective
// if (!doc.type && doc.type !== "mytype") return;
// test if we already have the field we want to add
if (doc.field3) return;
doc.field3 = "data3";
src_bkt[meta.id] = doc;
}
For more details on Eventing refer to https://docs.couchbase.com/server/current/eventing/eventing-overview.html I typically enrich 3/4 of a billion documents. The Eventing function will also run faster (enrich more documents per second) if you increase the number of workers in your Eventing function's setting from say 3 to 16 provided you have 8+ physical cores on your Eventing node.
I tested the above Eventing function and it enriches 5M documents (modeled on your example) on my non-MDS single node couchbase test system (12 cores at 2.2GHz) in just 72 seconds. Obviously if you have a real multi node cluster it will be faster (maybe all 5M docs in just 5 seconds).

Kafka Streams: Add Sequence to each message within a group of message

Set Up
Kafka 2.5
Apache KStreams 2.4
Deployment to Openshift(Containerized)
Objective
Group a set of messages from a topic using a set of value attributes & assign a unique group identifier
-- This can be achieved by using selectKey and groupByKey
originalStreamFromTopic
.selectKey((k,v)-> String.join("|",v.attribute1,v.attribute2))
.groupByKey()
groupedStream.mapValues((k,v)->
{
v.setGroupKey(k);
return v;
});
For each message within a specific group , create a new message with an itemCount number as one of the attributes
e.g. A group with key "keypart1|keyPart2" can have 10 messages and each of the message should have an incremental id from 1 through 10.
aggregate?
count and some additional StateStore based implementation.
One of the options (that i listed above), can make use of a couple of state stores
state store 1-> Mapping of each groupId and individual Item (KTable)
state store 2 -> Count per groupId (KTable)
A join of these 2 tables to stamp a sequence on the message as they get published to the final topic.
Other statistics:
Average number of messages per group would be in some 1000s except for an outlier case where it can go upto 500k.
In general the candidates for a group should be made available on the source within a span of 15 mins max.
Following points are of concern from the optimum solution perspective .
I am still not clear how i would be able to stamp a sequence number on the messages unless some kind of state store is used for keeping track of messages published within a group.
Use of KTable and state stores (either explicit usage or implicitly by the use of KTable) , would add to the state store size considerably.
Given the problem involves some kind of tasteful processing , the state store cant be avoided but any possible optimizations might be useful.
Any thoughts or references to similar patterns would be helpful.
You can use one state store with which you maintain the ID for each composite key. When you get a message you select a new composite key and then you lookup the next ID for the composite key in the state store. You stamp the message with the new ID that you just looked up. Finally, you increase the ID and write it back to the state store.
Code-wise, it would be something like:
// create state store
StoreBuilder<KeyValueStore<String,String>> keyValueStoreBuilder = Stores.keyValueStoreBuilder(
Stores.persistentKeyValueStore("idMaintainer"),
Serdes.String(),
Serdes.Long()
);
// add store
builder.addStateStore(keyValueStoreBuilder);
originalStreamFromTopic
.selectKey((k,v)-> String.join("|",v.attribute1,v.attribute2))
.repartition()
.transformValues(() -> new ValueTransformer() {
private StateStore state;
void init(ProcessorContext context) {
state = context.getStateStore("idMaintainer");
}
NewValueType transform(V value) {
// your logic to:
// - get the ID for the new composite key,
// - stamp the record
// - increase the ID
// - write the ID back to the state store
// - return the stamped record
}
void close() {
}
}, "idMaintainer")
.to("output-topic");
You do not need to worry about concurrent access to the state store because in Kafka Streams same keys are processed by one single task and tasks do not share state stores. That means, your new composite keys with the same value will be processed by one single task that exclusively maintains the IDs for the composite keys in its state store.

KStream to KStream Join- Output record post a configurable time in event of no matching record within the window

Need some opinion/help around one use case of KStream/KTable usage.
Scenario:
I have 2 topics with common key--requestId.
input_time(requestId,StartTime)
completion_time(requestId,EndTime)
The data in input_time is populated at time t1 and the data in completion_time is populated at t+n.(n being the time taken for a process to complete).
Objective
To compare the time taken for a request by joining data from the topics and raised alert in case of breach of a threshold time.
It may happen that the process may fail and the data may not arrive on the completion_time topic at all for the request.
In that case we intend to use a check that if the currentTime is well past a specific(lets say 5s) threshold since the start time.
input_time(req1,100) completion_time(req1,104) --> no alert to be raised as 104-100 < 5(configured value)
input_time(req2,100) completion_time(req2,108) --> alert to be raised with req2,108 as 108-100 >5
input_time(req3,100) completion_time no record--> if current Time is beyond 105 raise an alert with req3,currentSysTime as currentSysTime - 100 > 5
Options Tried.
1) Tried both KTable-KTable and KStream-Kstream outer joins but the third case always fails.
final KTable<String,Long> startTimeTable = builder.table("input_time",Consumed.with(Serdes.String(),Serdes.Long()));
final KTable<String,Long> completionTimeTable = builder.table("completion_time",Consumed.with(Serdes.String(),Serdes.Long()));
KTable<String,Long> thresholdBreached =startTimeTable .outerJoin(completionTimeTable,
new MyValueJoiner());
thresholdBreached.toStream().filter((k,v)->v!=null)
.to("finalTopic",Produced.with(Serdes.String(),Serdes.Long()));
Joiner
public Long apply(Long startTime,Long endTime){
// if input record itself is not available then we cant use any alerting.
if (null==startTime){
log.info("AlertValueJoiner check: the start time itself is null so returning null");
return null;
}
// current processing time is the time used.
long currentTime= System.currentTimeMillis();
log.info("Checking startTime {} end time {} sysTime {}",startTime,endTime,currentTime);
if(null==endTime && currentTime-startTime>5000){
log.info("Alert:No corresponding record from file completion yet currentTime {} startTime {}"
,currentTime,startTime);
return currentTime-startTime;
}else if(null !=endTime && endTime-startTime>5000){
log.info("Alert: threshold breach for file completion startTime {} endTime {}"
,startTime,endTime);
return endTime-startTime;
}
return null;
}
2) Tried the custom logic approach recommended as per the thread
How to manage Kafka KStream to Kstream windowed join?
-- This approach stopped working for scenarios 2 and 3.
Is there any case of handling all three scenarios using DSL or Processors?
Not sure of we can use some kind of punctuator to listen to when the window changes and check for the stream records in current window and if there is no matching records found,produce a result with systime.?
Due to the nature of the logic involve it surely had to be done with combination of DSL and processor API.
Used a custom transformer and state store to compare with configured
values.(case 1 &2)
Added a punctuator based on wall clock for
handling the 3rd case

gauge-like calculation of DistributionSummary with micrometer and prometheus

I want to get a DistributionSummary over some domain data that does not change very frequently. So it is not about monitoring requests or sth like that.
Let's take number of seats in an office as example. The value for each office can change from time to time and there can be new offices and also offices get removed.
So now I need the current DistributionSummary over all offices, which needs to be calculated every time I think (similar to a Gauge).
I have a Spring Boot 2 app with micrometer and collect the metrics with prometheus and display them in grafana.
What I tried so far:
When I register a DistributionSummary, I can record all the values once during startup... this gives me the distribution, but calculated values like max get lost over time and I cannot update the DistributionSummary (recording new offices would work, but not changing existing ones)
// during startup
seatsInOffice = DistributionSummary.builder("office.seats")
.publishPercentileHistogram()
.sla(1, 5, 20, 50)
.register(meterRegistry);
officeService.getAllOffices().forEach(p -> seatsInOffice.record(o.getNumberOfSeats()));
I also tried to use a #Scheduled task to remove and completely rebuild the DistributionSummary. This seems to work, but feels wrong somehow. Would that be a recommended approach? That would also probably need some synchronisation to not collect the metrics between removing and recalculating distribution.
#Scheduled(fixedRate = 5 * 60 * 1000)
public void recalculateMetrics() {
if (seatsInOffice != null) {
meterRegistry.remove(seatsInOffice);
}
seatsInOffice = DistributionSummary.builder("office.seats")
.publishPercentileHistogram()
.sla(1, 5, 20, 50)
.register(meterRegistry);
officeService.getAllOffices().forEach(p -> seatsInOffice.record(o.getNumberOfSeats()));
}
Another problem I just recognized with this approach: the /actuator/prometheus endpoint still returns the values for the old (removed) metrics, so everything is there mutiple times.
For sth like sla borders I could also use some gauges to provide the values (by calculating them myself), but that would not give me quantiles. Is it possible to create a new DistributionSummary without registering it and just provide the values it collected somehow?
meterRegistry.gauge("office.seats", Tags.of("le", "1"), officeService,
x -> x.getAllOfficesWithLessThanXSeats(1).size());
meterRegistry.gauge("office.seats", Tags.of("le", "5"), officeService,
x -> x.getAllOfficesWithLessThanXSeats(5).size());
meterRegistry.gauge("office.seats", Tags.of("le", "20"), officeService,
x -> x.getAllOfficesWithLessThanXSeats(20).size());
meterRegistry.gauge("office.seats", Tags.of("le", "50"), officeService,
x -> x.getAllOfficesWithLessThanXSeats(50).size());
I would like to have a DistributionSummary that takes a lambda or sth like that to get the values. But maybe these tools are not made for this usecase and I should use sth else. Can you recommend sth?
DistributionSummary has a config distributionStatisticExpiry could control rotate data. It's a workaround
But PrometheusDistributionSummary won't use this field
case Prometheus:
histogram = new TimeWindowFixedBoundaryHistogram(clock, DistributionStatisticConfig.builder()
.expiry(Duration.ofDays(1825)) // effectively never roll over
.bufferLength(1)
.build()
.merge(distributionStatisticConfig), true);

union not happening with Spark transform

I have a Spark stream in which records are flowing in. And the interval size is 1 second.
I want to union all the data in the stream. So i have created an empty RDD , and then using transform method, doing union of RDD (in the stream) with this empty RDD.
I am expecting this empty RDD to have all the data at the end.
But this RDD always remains empty.
Also, can somebody tell me if my logic is correct.
JavaRDD<Row> records = ss.emptyDataFrame().toJavaRDD();
JavaDStream<Row> transformedMessages = messages.flatMap(record -> processData(record))
.transform(rdd -> rdd.union(records));
transformedMessages.foreachRDD(record -> {
System.out.println("Aman" +record.count());
StructType schema = DataTypes.createStructType(fields);
Dataset ds = ss.createDataFrame(records, schema);
ds.createOrReplaceTempView("tempTable");
ds.show();
});
Initially, records is empty.
Then we have transformedMessages = messages + records, but records is empty, so we have: transformedMessages = messages (obviating the flatmap function which is not relevant for the discussion)
Later on, when we do Dataset ds = ss.createDataFrame(records, schema); records
is still empty. That does not change in the flow of the program, so it will remain empty as an invariant over time.
I think what we want to do is, instead of
.transform(rdd -> rdd.union(records));
we should do:
.foreachRDD{rdd => records = rdd.union(records)} //Scala: translate to Java syntax
That said, please note that as this process iteratively adds to the lineage of the 'records' RDD and also will accumulate all data over time. This is not a job that can run stable for a long period of time as, eventually, given enough data, it will grow beyond the limits of the system.
There's no information about the usecase behind this question, but the current approach does not seem to be scalable nor sustainable.

Resources