scale spring-kafka consumers app horizontally - spring

I'm wondering what would be a good approach to configure the amount of partitions in relation to the max number of horizontally scaled instances.
Suppose I have one topic with 6 partitions.
I have one application that uses the ConcurrentKafkaListenerContainerFactory with setConcurrency of 6.
That would mean I have will have 6 KafkaMessageListenerContainer each using one thread and consuming messages from all of my partitions spread out evenly.
If the above is correct then I was wondering what would happen if I scale the app horizontally by adding another instance ?
If the new instance would have the same configuration of a concurrency of 6 and ofcourse the same consumer group I believe the 2nd instance will not be consuming any messages. Because no rebalance will happen because each existing consumer will have one partition assigned to it.
But what if we go back to the first example and have 6 partition with one instance having a concurrency of 3 then each consumer thread/KafkaMessageListenerContainer will have 2 partitions assigned.
If we scale this app (same consumer group id and also a concurrency of 3) I believe that a rebalance will happen and both instances will individually be consuming from 3 partitions.
Are these assumptions correct and if not how should you handle such a case ?

In general your assumption is correct for a default behavior, which is based on the:
/**
* <p>The range assignor works on a per-topic basis. For each topic, we lay out the available partitions in numeric order
* and the consumers in lexicographic order. We then divide the number of partitions by the total number of
* consumers to determine the number of partitions to assign to each consumer. If it does not evenly
* divide, then the first few consumers will have one extra partition.
*
* <p>For example, suppose there are two consumers <code>C0</code> and <code>C1</code>, two topics <code>t0</code> and
* <code>t1</code>, and each topic has 3 partitions, resulting in partitions <code>t0p0</code>, <code>t0p1</code>,
* <code>t0p2</code>, <code>t1p0</code>, <code>t1p1</code>, and <code>t1p2</code>.
*
* <p>The assignment will be:
* <ul>
* <li><code>C0: [t0p0, t0p1, t1p0, t1p1]</code></li>
* <li><code>C1: [t0p2, t1p2]</code></li>
* </ul>
*
* Since the introduction of static membership, we could leverage <code>group.instance.id</code> to make the assignment behavior more sticky.
* For the above example, after one rolling bounce, group coordinator will attempt to assign new <code>member.id</code> towards consumers,
* for example <code>C0</code> -> <code>C3</code> <code>C1</code> -> <code>C2</code>.
*
* <p>The assignment could be completely shuffled to:
* <ul>
* <li><code>C3 (was C0): [t0p2, t1p2] (before was [t0p0, t0p1, t1p0, t1p1])</code>
* <li><code>C2 (was C1): [t0p0, t0p1, t1p0, t1p1] (before was [t0p2, t1p2])</code>
* </ul>
*
* The assignment change was caused by the change of <code>member.id</code> relative order, and
* can be avoided by setting the group.instance.id.
* Consumers will have individual instance ids <code>I1</code>, <code>I2</code>. As long as
* 1. Number of members remain the same across generation
* 2. Static members' identities persist across generation
* 3. Subscription pattern doesn't change for any member
*
* <p>The assignment will always be:
* <ul>
* <li><code>I0: [t0p0, t0p1, t1p0, t1p1]</code>
* <li><code>I1: [t0p2, t1p2]</code>
* </ul>
*/
public class RangeAssignor extends AbstractPartitionAssignor {
However you can plug in any ConsumerPartitionAssignor via partition.assignment.strategy consumer property: https://kafka.apache.org/documentation/#consumerconfigs_partition.assignment.strategy
See also ConsumerPartitionAssignor JavaDocs for more info and its implementations to make a choice for your use-case.

Related

Apache Kafka with multiple partitions distributed deployment

I have a kafka topic with 10 partitions. I plan to deploy two application on different servers. One application will read from partitions 0 to 4. While the other will read from partitions 5 to 9.
Deployment 1
#KafkaListener(topicPartitions =
{ #TopicPartition(topic = "testpartition", partitions = { "0", "1","2", "3","4" })
})
public void receive(ConsumerRecord record) {
System.out.println(String.format("Listener 1 -Topic - %s, Partition - %d, Value: %s", kafkaTopic, record.partition(), record.value()));
}
Deployment 2
#KafkaListener(topicPartitions =
{ #TopicPartition(topic = "testpartition", partitions = { "5", "6","7", "8","9" })
})
public void receive(ConsumerRecord record) {
System.out.println(String.format("Listener 2 -Topic - %s, Partition - %d, Value: %s", kafkaTopic, record.partition(), record.value()));
}
So we will be having two consumer groups as application is deployed separately on different servers.
As each application is consuming from different partitions
will this lead to unwanted replication of messages on kafka topic?
Will all the messages get replicated twice. Also if this is the case then will there be message duplication?
Is this the right way to deploy the consumer application in distributed environment or there a better way?
Since you are manually assigning the partitions, no, there will be no duplication and each instance will only receive records from its assigned partitions.
When you say "replicated"; that depends on the replication factor when the topic is created. Replicas are used to ensure there are multiple copies on different broker instances in order to handle server failures. Replication is not the same as duplication.
But even though records are replicated in that way, there is only one logical instance of each record.
It is possible to get duplicate records in certain (rare) failure scenarios, unless you enable exactly once semantics.
The other way to deploy it is to use Kafka Group Management and let Kafka distribute the partitions across the instances, either using its default algorithm or using a custom ConsumerPartitionAssignor.

Kafka Streams: Add Sequence to each message within a group of message

Set Up
Kafka 2.5
Apache KStreams 2.4
Deployment to Openshift(Containerized)
Objective
Group a set of messages from a topic using a set of value attributes & assign a unique group identifier
-- This can be achieved by using selectKey and groupByKey
originalStreamFromTopic
.selectKey((k,v)-> String.join("|",v.attribute1,v.attribute2))
.groupByKey()
groupedStream.mapValues((k,v)->
{
v.setGroupKey(k);
return v;
});
For each message within a specific group , create a new message with an itemCount number as one of the attributes
e.g. A group with key "keypart1|keyPart2" can have 10 messages and each of the message should have an incremental id from 1 through 10.
aggregate?
count and some additional StateStore based implementation.
One of the options (that i listed above), can make use of a couple of state stores
state store 1-> Mapping of each groupId and individual Item (KTable)
state store 2 -> Count per groupId (KTable)
A join of these 2 tables to stamp a sequence on the message as they get published to the final topic.
Other statistics:
Average number of messages per group would be in some 1000s except for an outlier case where it can go upto 500k.
In general the candidates for a group should be made available on the source within a span of 15 mins max.
Following points are of concern from the optimum solution perspective .
I am still not clear how i would be able to stamp a sequence number on the messages unless some kind of state store is used for keeping track of messages published within a group.
Use of KTable and state stores (either explicit usage or implicitly by the use of KTable) , would add to the state store size considerably.
Given the problem involves some kind of tasteful processing , the state store cant be avoided but any possible optimizations might be useful.
Any thoughts or references to similar patterns would be helpful.
You can use one state store with which you maintain the ID for each composite key. When you get a message you select a new composite key and then you lookup the next ID for the composite key in the state store. You stamp the message with the new ID that you just looked up. Finally, you increase the ID and write it back to the state store.
Code-wise, it would be something like:
// create state store
StoreBuilder<KeyValueStore<String,String>> keyValueStoreBuilder = Stores.keyValueStoreBuilder(
Stores.persistentKeyValueStore("idMaintainer"),
Serdes.String(),
Serdes.Long()
);
// add store
builder.addStateStore(keyValueStoreBuilder);
originalStreamFromTopic
.selectKey((k,v)-> String.join("|",v.attribute1,v.attribute2))
.repartition()
.transformValues(() -> new ValueTransformer() {
private StateStore state;
void init(ProcessorContext context) {
state = context.getStateStore("idMaintainer");
}
NewValueType transform(V value) {
// your logic to:
// - get the ID for the new composite key,
// - stamp the record
// - increase the ID
// - write the ID back to the state store
// - return the stamped record
}
void close() {
}
}, "idMaintainer")
.to("output-topic");
You do not need to worry about concurrent access to the state store because in Kafka Streams same keys are processed by one single task and tasks do not share state stores. That means, your new composite keys with the same value will be processed by one single task that exclusively maintains the IDs for the composite keys in its state store.

Could using changelogs cause a bottleneck for the app itself?

I have a spring cloud kafka streams application that rekeys incoming data to be able to join two topics, selectkeys, mapvalues and aggregate data. Over time the consumer lag seems to increase and scaling by adding multiple instances of the app doesn't help a bit. With every instance the consumer lag seems to be increasing.
I scaled up and down the instances from 1 to 18 but no big difference is noticed. The number of messages it lags behind, keeps increasing every 5 seconds independent of the number of instances
KStream<String, MappedOriginalSensorData> flattenedOriginalData = originalData
.flatMap(flattenOriginalData())
.through("atl-mapped-original-sensor-data-repartition", Produced.with(Serdes.String(), new MappedOriginalSensorDataSerde()));
//#2. Save modelid and algorithm parts of the key of the errorscore topic and reduce the key
// to installationId:assetId:tagName
//Repartition ahead of time avoiding multiple repartition topics and thereby duplicating data
KStream<String, MappedErrorScoreData> enrichedErrorData = errorScoreData
.map(enrichWithModelAndAlgorithmAndReduceKey())
.through("atl-mapped-error-score-data-repartition", Produced.with(Serdes.String(), new MappedErrorScoreDataSerde()));
return enrichedErrorData
//#3. Join
.join(flattenedOriginalData, join(),
JoinWindows.of(
// allow messages within one second to be joined together based on their timestamp
Duration.ofMillis(1000).toMillis())
// configure the retention period of the local state store involved in this join
.until(Long.parseLong(retention)),
Joined.with(
Serdes.String(),
new MappedErrorScoreDataSerde(),
new MappedOriginalSensorDataSerde()))
//#4. Set instalation:assetid:modelinstance:algorithm::tag key back
.selectKey((k,v) -> v.getOriginalKey())
//#5. Map to ErrorScore (basically removing the originalKey field)
.mapValues(removeOriginalKeyField())
.through("atl-joined-data-repartition");
then the aggregation part:
Materialized<String, ErrorScore, WindowStore<Bytes, byte[]>> materialized = Materialized
.as(localStore.getStoreName());
// Set retention of changelog topic
materialized.withLoggingEnabled(topicConfig);
// Configure how windows looks like and how long data will be retained in local stores
TimeWindows configuredTimeWindows = getConfiguredTimeWindows(
localStore.getTimeUnit(), Long.parseLong(topicConfig.get(RETENTION_MS)));
// Processing description:
// 2. With the groupByKey we group the data on the new key
// 3. With windowedBy we split up the data in time intervals depending on the provided LocalStore enum
// 4. With reduce we determine the maximum value in the time window
// 5. Materialized will make it stored in a table
stream.groupByKey()
.windowedBy(configuredTimeWindows)
.reduce((aggValue, newValue) -> getMaxErrorScore(aggValue, newValue), materialized);
}
private TimeWindows getConfiguredTimeWindows(long windowSizeMs, long retentionMs) {
TimeWindows timeWindows = TimeWindows.of(windowSizeMs);
timeWindows.until(retentionMs);
return timeWindows;
}
I would expect that increasing the number of instances would decrease the consumer lag tremendous.
So in this setup there are multiple topics involved such as:
* original-sensor-data
* error-score
* kstream-joinother
* kstream-jointhis
* atl-mapped-original-sensor-data-repartition
* atl-mapped-error-score-data-repartition
* atl-joined-data-repartition
the idea is to join the original-sensor-data with the error-score. The rekeying requires the atl-mapped-* topics. then the join will use the kstream* topics and in the end as a result of the join the atl-joined-data-repartition is filled. After that the aggregation also creates topics but I leave this out of scope now.
original-sensor-data
\
\
\ atl-mapped-original-sensor-data-repartition-- kstream-jointhis -\
/ atl-mapped-error-score-data-repartition -- kstream-joinother -\
/ \
error-score atl-joined-data-repartition
As it seems that increasing the number of instances doesn't seem to have much of affect anymore since I introduced the join and the atl-mapped topics, I'm wondering if it is possible that this topology would become its own bottleneck. From the consumer lag it seems that the original-sensor-data and error-score topic have a much smaller consumer lag compare to for instance the atl-mapped-* topics. Is there a way to cope with this by removing these changelogs or does this result in not being able to scale?

gauge-like calculation of DistributionSummary with micrometer and prometheus

I want to get a DistributionSummary over some domain data that does not change very frequently. So it is not about monitoring requests or sth like that.
Let's take number of seats in an office as example. The value for each office can change from time to time and there can be new offices and also offices get removed.
So now I need the current DistributionSummary over all offices, which needs to be calculated every time I think (similar to a Gauge).
I have a Spring Boot 2 app with micrometer and collect the metrics with prometheus and display them in grafana.
What I tried so far:
When I register a DistributionSummary, I can record all the values once during startup... this gives me the distribution, but calculated values like max get lost over time and I cannot update the DistributionSummary (recording new offices would work, but not changing existing ones)
// during startup
seatsInOffice = DistributionSummary.builder("office.seats")
.publishPercentileHistogram()
.sla(1, 5, 20, 50)
.register(meterRegistry);
officeService.getAllOffices().forEach(p -> seatsInOffice.record(o.getNumberOfSeats()));
I also tried to use a #Scheduled task to remove and completely rebuild the DistributionSummary. This seems to work, but feels wrong somehow. Would that be a recommended approach? That would also probably need some synchronisation to not collect the metrics between removing and recalculating distribution.
#Scheduled(fixedRate = 5 * 60 * 1000)
public void recalculateMetrics() {
if (seatsInOffice != null) {
meterRegistry.remove(seatsInOffice);
}
seatsInOffice = DistributionSummary.builder("office.seats")
.publishPercentileHistogram()
.sla(1, 5, 20, 50)
.register(meterRegistry);
officeService.getAllOffices().forEach(p -> seatsInOffice.record(o.getNumberOfSeats()));
}
Another problem I just recognized with this approach: the /actuator/prometheus endpoint still returns the values for the old (removed) metrics, so everything is there mutiple times.
For sth like sla borders I could also use some gauges to provide the values (by calculating them myself), but that would not give me quantiles. Is it possible to create a new DistributionSummary without registering it and just provide the values it collected somehow?
meterRegistry.gauge("office.seats", Tags.of("le", "1"), officeService,
x -> x.getAllOfficesWithLessThanXSeats(1).size());
meterRegistry.gauge("office.seats", Tags.of("le", "5"), officeService,
x -> x.getAllOfficesWithLessThanXSeats(5).size());
meterRegistry.gauge("office.seats", Tags.of("le", "20"), officeService,
x -> x.getAllOfficesWithLessThanXSeats(20).size());
meterRegistry.gauge("office.seats", Tags.of("le", "50"), officeService,
x -> x.getAllOfficesWithLessThanXSeats(50).size());
I would like to have a DistributionSummary that takes a lambda or sth like that to get the values. But maybe these tools are not made for this usecase and I should use sth else. Can you recommend sth?
DistributionSummary has a config distributionStatisticExpiry could control rotate data. It's a workaround
But PrometheusDistributionSummary won't use this field
case Prometheus:
histogram = new TimeWindowFixedBoundaryHistogram(clock, DistributionStatisticConfig.builder()
.expiry(Duration.ofDays(1825)) // effectively never roll over
.bufferLength(1)
.build()
.merge(distributionStatisticConfig), true);

pl/sql: Functions

I have three column value in excel sheet
A: # of unsuccessful transfers to CCR (CTI) =11986
B: # of calls NOT wrapped =8585
C: # of wrapped calls= 15283
and total of the three column is # of incoming calls(CTI)= 37017( this is sum of # of wrapped calls + # of unsuccessful transfers to CCR (CTI) + # of calls NOT wrapped)
I also calculate # of unaccounted calls(This is substracion of # of wrapped calls - # of unsuccessful transfers to CCR (CTI) - # of calls NOT wrapped)
So my # of unaccounted calls = 1163
Now i have to find out percentage of uncounted calls so i divide 37017/1163
So my percentatge is 3%, ideally it should be 0%, how do i find out in oracle that out of 3% what percent falls in A, B or C.
A B C comes from database qry, and the source is same but bunch of
different filters for each qry for A B and C
That might allow you spot a pattern in the rows that aren't picked up by any A, B or C, though you'd still need to work out which of the three queries you would have expected each row (or pattern of rows) to have been picked up by, and why they were missed.
Since the sum of the counts from the three queries with additional filters is lower than the count from the query without those filters, you seem to have a gap in the filters themselves. If I had to guess then the first place I'd look is for incorrect handling of null values, trying to equate them (since null is neither equal or not equal to anything, even itself). But that's clearly speculation, and without seeing the filters and knowing which columns can be null isn't very helpful.
You can maybe isolate the 1163 rows that aren't showing up by using minus to find the rows picked up by the 'total' query and not included by any of those producing A, B and C; something like:
select *
from xx_new.xx_cti_call_details#appsread.prd.com
where dealer_name = 'XYG'
and TRUNC(CREATION_DATE) BETWEEN '01-JUL-2012' AND '31-JUL-2012'
minus
select *
from xx_new.xx_cti_call_details#appsread.prd.com
where dealer_name = 'XYG'
and TRUNC(CREATION_DATE) BETWEEN '01-JUL-2012' AND '31-JUL-2012'
and <additional filters for A>
minus
select *
from xx_new.xx_cti_call_details#appsread.prd.com
where dealer_name = 'XYG'
and TRUNC(CREATION_DATE) BETWEEN '01-JUL-2012' AND '31-JUL-2012'
and <additional filters for B>
minus
select *
from xx_new.xx_cti_call_details#appsread.prd.com
where dealer_name = 'XYG'
and TRUNC(CREATION_DATE) BETWEEN '01-JUL-2012' AND '31-JUL-2012'
and <additional filters for C>
I'm curious about you having a distinct in your initial query though, since it suggests you're counting switches calls are made from rather than the calls themselves. It also might mean the counts should not add up - though in that case I'd perhaps expect A+B+C to be greater than the simple as there would be the potential for overlaps - and that select * might actually return more than 1163 rows; in which case you might only want to select the columns you think might be a problem.
Incidentally, if creation_date is indexed then you might get better performance with where creation_date >= date '2012-07-01' and creation_date < date '2012-08-01', as the trunk() function woudl prevent the index being used. Might not be an issue for you though.

Resources