kafka streams DSL: add an option parameter to disable repartition when using `map` `selectByKey` `groupBy` - apache-kafka-streams

According to the documents, streams will be marked for repartition when applied map selectKey groupBy even though the new key has been partitioned appropriately. Is it possible to add an option parameter to disable repartition ?
Here is my user case:
there is a topic has been partitioned by user_id.
# topic 'user', format '%key,%value'
partition-1:
user1,{'user_id':'user1', 'device_id':'device1'}
user1,{'user_id':'user1', 'device_id':'device1'}
user1,{'user_id':'user1', 'device_id':'device2'}
partition-2:
user2,{'user_id':'user2', 'device_id':'device3'}
user2,{'user_id':'user2', 'device_id':'device4'}
I want to count user_id-device_id pairs using DSL as follow:
stream
.groupBy((user_id, value) -> {
JSONObject event = new JSONObject(value);
String userId = event.getString('user_id');
String deviceId = event.getString('device_id');
return String.format("%s&%s", userId,deviceId);
})
.count();
Actually the new key has been partitioned indirectly. There is no need to do it again.

If you use .groupBy(), it always causes data re-partitioning. If possible use groupByKey instead, which will re-partition data only if required.
In your case, you are changing the keys anyways, so that will create a re-partition topic.

Related

influxDB: How to convert field to tag in influxDB v2.0

We need to convert field to tag in influxDB v2.0 but not able to find any proper solution.
Can someone help me out to achieve the same ?
Solution we found was to create new measurement by altering fields and tags of existing measurement but not able achieve it using Flux language.
Using below flux query we can copy the data from one measurement to another but not able to change the field to tag while adding data in new measurement.
from(bucket: "bucket_name")
|> range(start: -10y)
|> filter(fn: (r) => r._measurement == "cu_om")
|> aggregateWindow(every: 5s, fn: last, createEmpty: false)
|> yield(name: "last")
|> set(key: "_measurement", value: "cu_om_new1")
|> to(org: "org_name", bucket: "bucket_name")
Any help appreciated.
You're almost there with your original code, there are extra fields with the to() function that allow this.
If you have a set of data already where you have a tag name as value, you can specify it as a tagColumn in to().
Also, the new tag(s) must be string(s).
|> to(bucket: "NewBucketName",
tagColumns: ["NewTagName"],
fieldFn: (r) => ({"SomeValue": r._value })
)
Have a look at writing pivoted data to InfluxDB, maybe that's what you need. Using this method, you have control over which columns are written as fields and which as tags:
Use experimental.to() to write pivoted data to InfluxDB. Input data must have the following columns:
_time
_measurement
All columns in the group key other than _time and _measurement are written to InfluxDB as tags. Columns not in the group key are written to InfluxDB as fields.

Getting non compacted key/value from day window based statestore

Topology Definition:
KStream<String, JsonNode> transactions = builder.stream(inputTopic, Consumed.with(Serdes.String(), jsonSerde));
KTable<Windowed<String>, JsonNode> aggregation =
transactions
.groupByKey()
.windowedBy(
TimeWindows.of(Duration.ofSeconds(windowDuration)).grace(Duration.ofSeconds(windowGraceDuration)))
.aggregate(() -> new Service().buildInitialStats(),
(key, transaction, previous) -> new Service().build(key, transaction, previous),
Materialized.<String, JsonNode, WindowStore<Bytes, byte[]>>as(statStoreName).withRetention(Duration.ofSeconds((windowDuration + windowGraceDuration + windowRetentionDuration)))
.withKeySerde(Serdes.String())
.withValueSerde(jsonSerde)
.withCacheDisabled())
.suppress(Suppressed.untilWindowCloses(Suppressed.BufferConfig.unbounded()));
aggregation.toStream()
.to(outputTopic, Produced.with(windowedSerde, jsonSerde));
State Store API: Fetch key by looking up all timewindows.
Instant timeFrom = Instant.ofEpochMilli(0);
Instant timeTo = Instant.now();
WindowStoreIterator<ObjectNode> value = store.fetch(key,timeFrom,timeTo);
while(value.hasNext()){
System.out.println(value.next());
}
As a part of test,performed 2 transactions and it produces key 1, My requirement is to get key1 twice(current & previous) without compaction when i lookup statestore. Result always returns final result with key and final aggregated value.
Txn1 --> Key - Key1 | Value - {Count=1,attribute='test'}
Txn2 --> Key - Key1 | Value - {Count=2,attribute='test1'}
Current Behavior after statestore lookup: Always get compacted key1 with value = {Count=2,attribute='test1'}
Instead I would like to get all key1 for that window duration.
As part of solution I did below changes but unfortunately it did not worked.
Disabled caching at topology level
cache.max.bytes.buffering to 0
Removing compact policy manually from internal changelog topic
Suspecting changelog topic is compacted and thus get compacted keys upon calling statestore api.
What changes are needed to get noncompated keys through statestore API?
If you want to get all intermediate result, you should not use the suppress() operator. suppress() is designed to emit a single result record per window, i.e., it does the exact opposite of what you want.

How to use auto increment index in Tarantool?

I made auto increment index:
box.space.metric:create_index('primary', {
parts = {{'id', 'unsigned'}},
sequence = true,
})
Then I try to pass nil in id field:
metric.id = nil
When I try insert this values, I catch error:
Tuple field 1 type does not match one required by operation: expected unsigned
What value do I have to pass for autoincrement field?
Second questions. If I use tarantool-cluster with few instances (for ex. cartridge-application based), is it prove use autoincrement indexes? Will there be a cases that there will be duplicate keys on different instances?
It is not possible to pass nil. When you assign nil, you erase field. Use box.NULL instead.
But better, use some kind of cluster id, which perform well across cluster, instead of autoincrement, which works only inside one node.
For cluster-wide ids I could propose UUID or something like ULID (for ex from https://github.com/moonlibs/id)

Tombstone messages not removing record from KTable state store?

I am creating KTable processing data from KStream. But when I trigger a tombstone messages with key and null payload, it is not removing message from KTable.
sample -
public KStream<String, GenericRecord> processRecord(#Input(Channel.TEST) KStream<GenericRecord, GenericRecord> testStream,
KTable<String, GenericRecord> table = testStream
.map((genericRecord, genericRecord2) -> KeyValue.pair(genericRecord.get("field1") + "", genericRecord2))
.groupByKey()
reduce((genericRecord, v1) -> v1, Materialized.as("test-store"));
GenericRecord genericRecord = new GenericData.Record(getAvroSchema(keySchema));
genericRecord.put("field1", Long.parseLong(test.getField1()));
ProducerRecord record = new ProducerRecord(Channel.TEST, genericRecord, null);
kafkaTemplate.send(record);
Upon triggering a message with null value, I can debug in testStream map function with null payload, but it doesn't remove record on KTable change log "test-store". Looks like it doesn't even reach reduce method, not sure what I am missing here.
Appreciate any help on this!
Thanks.
As documented in the JavaDocs of reduce()
Records with {#code null} key or value are ignored.
Because, the <key,null> record is dropped and thus (genericRecord, v1) -> v1 is never executed, no tombstone is written to the store or changelog topic.
For the use case you have in mind, you need to use a surrogate value that indicates "delete", for example a boolean flag within your Avro record. Your reduce function needs to check for the flag and return null if the flag is set; otherwise, it must process the record regularly.
Update:
Apache Kafka 2.6 adds the KStream#toTable() operator (via KIP-523) that allows to transform a KStream into a KTable.
An addition to the above answer by Matthias:
Reduce ignores the first record on the stream, so the mapped and grouped value will be stored as-is in the KTable, never passing through the reduce method for tombstoning. This means that it will not be possible to just join another stream on that table, the value itself also needs to be evaluated.
I hope KIP-523 solves this.

Emit mapper Ignoring member at mapping time

I am using Emit mapper to copy values from one object to the other.
When I am mapping the objects, I need to ignore certain fields from being mapped/copied over. The fields to be ignored keeps changing based on scenario.
How can this be done in EmitMapper? The .Map method itself does not take any additional parameters to ignore certain properties. I can specify fields to be ignored using DefaultMapConfig, but that is static and cannot be changed during mapping.
Please help.
You have to configure the Mapper:
string[] fieldsToIgnore = { "NameOfThePropertyToIgnore" };
var mapper = ObjectMapperManager.DefaultInstance
.GetMapper<SourceClass, DestClass>(
new DefaultMapConfig()
.IgnoreMembers<SourceClass, DestClass>(fieldsToIgnore)
);

Resources