StateMap keys across different instances of the same processor - apache-nifi

Nifi 1.2.0.
In a custom processor, an LSN is used to fetch data from a SQL Server db table.
Following are the snippets of the code used for:
Storing a key-value pair
final StateManager stateManager = context.getStateManager();
try {
StateMap stateMap = stateManager.getState(Scope.CLUSTER);
final Map<String, String> newStateMapProperties = new HashMap<>();
String lsnUsedDuringLastLoadStr = Base64.getEncoder().encodeToString(lsnUsedDuringLastLoad);
//Just a constant String used as key
newStateMapProperties.put(ProcessorConstants.LAST_MAX_LSN, lsnUsedDuringLastLoadStr);
if (stateMap.getVersion() == -1) {
stateManager.setState(newStateMapProperties, Scope.CLUSTER);
} else {
stateManager.replace(stateMap, newStateMapProperties, Scope.CLUSTER);
}
}
Retrieving the key-value pair
final StateManager stateManager = context.getStateManager();
final StateMap stateMap;
final Map<String, String> stateMapProperties;
byte[] lastMaxLSN = null;
try {
stateMap = stateManager.getState(Scope.CLUSTER);
stateMapProperties = new HashMap<>(stateMap.toMap());
lastMaxLSN = (stateMapProperties.get(ProcessorConstants.LAST_MAX_LSN) == null
|| stateMapProperties.get(ProcessorConstants.LAST_MAX_LSN).isEmpty()) ? null
: Base64.getDecoder()
.decode(stateMapProperties.get(ProcessorConstants.LAST_MAX_LSN).getBytes());
}
When a single instance of this processor is running, the LSN is stored and retrieved properly and the logic of fetching data from SQL Server tables works fine.
As per the NiFi doc. about state management :
Storing and Retrieving State State is stored using the StateManager’s
getState, setState, replace, and clear methods. All of these methods
require that a Scope be provided. It should be noted that the state
that is stored with the Local scope is entirely different than state
stored with a Cluster scope. If a Processor stores a value with the
key of My Key using the Scope.CLUSTER scope, and then attempts to
retrieve the value using the Scope.LOCAL scope, the value retrieved
will be null (unless a value was also stored with the same key using
the Scope.CLUSTER scope). Each Processor’s state, is stored in
isolation from other Processors' state.
When two instances of this processor are running, only one is able to fetch the data. This has led to the following question:
Is the StateMap a 'global map' which must have unique keys across the instances of the same processor and also the instances of different processors? In simple words, whenever a processor puts a key in the statemap, the key should be unique across the NiFi processors(and other services, if any, that use the State API) ? If yes, can anyone suggest what unique key should I use in my case?
Note: I quickly glanced at the standard MySQL CDC processor code class(CaptureChangeMySQL.java) and it has a similar logic to store and retrieve the state but then am I overlooking something ?

The StateMap for a processor is stored underneath the id of the component, so if you have two instances of the same type of processor (meaning you can see two processors on the canvas) you would have something like:
/components/1111-1111-1111-1111 -> serialized state map
/components/2222-2222-2222-2222 -> serialized state map
Assuming 1111-1111-1111-1111 was the UUID of processor 1 and 2222-2222-22222-2222 was the UUID of processor 2. So the keys in the StateMap don't have to be unique across all instances because they are scoped per component id.
In a cluster, the component id of each component is the same on all nodes. So if you have a 3 node cluster and processor 1 has id 1111-1111-1111-1111, then there is a processor with that id on each node.
If that processor is scheduled to run on all nodes and stores cluster state, then all three instances of the processor are going to be updating the same StateMap in the clustered state provider (ZooKeeper).

Related

Utilize a single processor to process data from multiple sources of different Key and Value "Serdes"

Is it possible to utilize a single processor to process data from multiple sources of different Key and Value "Serdes"?
Below is my topology
topology.addSource("MarketData", Serdes.String().deserializer(), marketDataSerde.deserializer(),"market.data")
.addSource("EventData", Serdes.String().deserializer(), eventDataSerde.deserializer(),"event.data")
.addProcessor("StrategyTwo", new StrategyTwoProcessorSupplier(), "MarketData", "EventData")
.addSink("StrategyTwoSignal", "signal.data", Serdes.String().serializer(), signalSerde.serializer(),"StrategyTwo");
Below is the process method from the processor.
public void process(Record<String, MarketData> record) {
MarketData marketData = record.value();
}
Is it possible to have a generic record in the process method that can be processed differently depending on the type of record?
In the event that the above solution is not feasible, is it possible to have multiple sources and processors without having intermittent topics as a result? Example:
topology.addSource("MarketData", Serdes.String().deserializer(), marketDataSerde.deserializer(),"market.data")
.addProcessor("StrategyTwoMarketData", new StrategyTwoMarketDataProcessorSupplier(), "MarketData")
.addSource("EventData", Serdes.String().deserializer(), eventDataSerde.deserializer(),"event.data")
.addProcessor("StrategyTwoEventData", new StrategyTwoEventDataProcessorSupplier(), "EventData")
.addProcessor("StrategyTwo", new StrategyTwoProcessorSupplier(), "EventData")
.addSink("StrategyTwoSignal", "signal.data", Serdes.String().serializer(), signalSerde.serializer(),"StrategyTwo");

MeterRegistry creating tags dynamically for gauge and updating data based on tag id for Prometheus

So I have a monitoring service that is essentially trying to monitor timestamps in my Kafka clusters. Each topic has some number of partitions and in my tags, I want to display the partition numbers as well as the topic name, and cluster. Is there a way for me to create the tags dynamically and also check to see if the tags already exist and if it does then we just update the gauge values? If the tags do not exist we will create a new gauge with the appropriate tags?
//Some pseudo-code in java
Map<Int, Int> partitionMap = new HashMap<>() // Key is partition and value is some arbitrary data
for(every Kafka cluster : kc){
for(every key value pair in partitionMap){
AtomicLong myGauge = new AtomicLong(-1);
Tags someTags = Tags.of("topic", kc.topicName, "cluster", kc.clusterName, "partition", key);
if(meterRegistry.get("name.of.query") != null && meterRegistry.get("name.of.query").contains(tags){
myGauge.get(myTags).set(partitionMap.get(key)); // Update the data point at the partition based on value
}else{
myGauge = meterRegistry.gauge("name.of.query", someTags, new AtomicLong(partitionMap.get(key));
}
}
}

Getting non compacted key/value from day window based statestore

Topology Definition:
KStream<String, JsonNode> transactions = builder.stream(inputTopic, Consumed.with(Serdes.String(), jsonSerde));
KTable<Windowed<String>, JsonNode> aggregation =
transactions
.groupByKey()
.windowedBy(
TimeWindows.of(Duration.ofSeconds(windowDuration)).grace(Duration.ofSeconds(windowGraceDuration)))
.aggregate(() -> new Service().buildInitialStats(),
(key, transaction, previous) -> new Service().build(key, transaction, previous),
Materialized.<String, JsonNode, WindowStore<Bytes, byte[]>>as(statStoreName).withRetention(Duration.ofSeconds((windowDuration + windowGraceDuration + windowRetentionDuration)))
.withKeySerde(Serdes.String())
.withValueSerde(jsonSerde)
.withCacheDisabled())
.suppress(Suppressed.untilWindowCloses(Suppressed.BufferConfig.unbounded()));
aggregation.toStream()
.to(outputTopic, Produced.with(windowedSerde, jsonSerde));
State Store API: Fetch key by looking up all timewindows.
Instant timeFrom = Instant.ofEpochMilli(0);
Instant timeTo = Instant.now();
WindowStoreIterator<ObjectNode> value = store.fetch(key,timeFrom,timeTo);
while(value.hasNext()){
System.out.println(value.next());
}
As a part of test,performed 2 transactions and it produces key 1, My requirement is to get key1 twice(current & previous) without compaction when i lookup statestore. Result always returns final result with key and final aggregated value.
Txn1 --> Key - Key1 | Value - {Count=1,attribute='test'}
Txn2 --> Key - Key1 | Value - {Count=2,attribute='test1'}
Current Behavior after statestore lookup: Always get compacted key1 with value = {Count=2,attribute='test1'}
Instead I would like to get all key1 for that window duration.
As part of solution I did below changes but unfortunately it did not worked.
Disabled caching at topology level
cache.max.bytes.buffering to 0
Removing compact policy manually from internal changelog topic
Suspecting changelog topic is compacted and thus get compacted keys upon calling statestore api.
What changes are needed to get noncompated keys through statestore API?
If you want to get all intermediate result, you should not use the suppress() operator. suppress() is designed to emit a single result record per window, i.e., it does the exact opposite of what you want.

Enrich each existing value in a cache with the data from another cache in an Ignite cluster

What is the best way to update a field of each existing value in a Ignite cache with data from another cache in the same cluster in the most performant way (tens of millions of records about a kilobyte each)?
Pseudo code:
try (mappings = getCache("mappings")) {
try (entities = getCache("entities")) {
entities.foreach((key, entity) -> entity.setInternalId(mappings.getValue(entity.getExternalId());
}
}
I would advise to use compute and send a closure to all the nodes in the cache topology. Then, on each node you would iterate through a local primary set and do the updates. Even with this approach you would still be better off batching up updates and issuing them with a putAll call (or maybe use IgniteDataStreamer).
NOTE: for the example below, it is important that keys in "mappings" and "entities" caches are either identical or colocated. More information on collocation is here:
https://apacheignite.readme.io/docs/affinity-collocation
The pseudo code would look something like this:
ClusterGroup cacheNodes = ignite.cluster().forCache("mappings");
IgniteCompute compute = ignite.compute(cacheNodes.nodes());
compute.broadcast(() -> {
IgniteCache<> mappings = getCache("mappings");
IgniteCache<> entities = getCache("entities");
// Iterate over local primary entries.
entities.localEntries(CachePeekMode.PRIMARY).forEach((entry) -> {
V1 mappingVal = mappings.get(entry.getKey());
V2 entityVal = entry.getValue();
V2 newEntityVal = // do enrichment;
// It would be better to create a batch, and then call putAll(...)
// Using simple put call for simplicity.
entities.put(entry.getKey(), newEntityVal);
}
});

Tombstone messages not removing record from KTable state store?

I am creating KTable processing data from KStream. But when I trigger a tombstone messages with key and null payload, it is not removing message from KTable.
sample -
public KStream<String, GenericRecord> processRecord(#Input(Channel.TEST) KStream<GenericRecord, GenericRecord> testStream,
KTable<String, GenericRecord> table = testStream
.map((genericRecord, genericRecord2) -> KeyValue.pair(genericRecord.get("field1") + "", genericRecord2))
.groupByKey()
reduce((genericRecord, v1) -> v1, Materialized.as("test-store"));
GenericRecord genericRecord = new GenericData.Record(getAvroSchema(keySchema));
genericRecord.put("field1", Long.parseLong(test.getField1()));
ProducerRecord record = new ProducerRecord(Channel.TEST, genericRecord, null);
kafkaTemplate.send(record);
Upon triggering a message with null value, I can debug in testStream map function with null payload, but it doesn't remove record on KTable change log "test-store". Looks like it doesn't even reach reduce method, not sure what I am missing here.
Appreciate any help on this!
Thanks.
As documented in the JavaDocs of reduce()
Records with {#code null} key or value are ignored.
Because, the <key,null> record is dropped and thus (genericRecord, v1) -> v1 is never executed, no tombstone is written to the store or changelog topic.
For the use case you have in mind, you need to use a surrogate value that indicates "delete", for example a boolean flag within your Avro record. Your reduce function needs to check for the flag and return null if the flag is set; otherwise, it must process the record regularly.
Update:
Apache Kafka 2.6 adds the KStream#toTable() operator (via KIP-523) that allows to transform a KStream into a KTable.
An addition to the above answer by Matthias:
Reduce ignores the first record on the stream, so the mapped and grouped value will be stored as-is in the KTable, never passing through the reduce method for tombstoning. This means that it will not be possible to just join another stream on that table, the value itself also needs to be evaluated.
I hope KIP-523 solves this.

Resources