Kafka Stream State Store distribution - apache-kafka-streams

I have a query on sharing of state-stores. Do the applications, should have it's local state-store to exchange meta-data?
I mean, let's assume there are 2 appliations, one application process data published to a topic to it's local state-store and the other application, to exposing APIs to access the data. As part of the second application, stream builder adds state-store to it's topology.
Would this be possible by defining application.server in the stream configuration?
Also when we define application.server property for streaming applications, what is the port number, it is any randon number to provide. From the documentation, "host1:4460", "host5:5307" and "host3:4777", these numers stands for what?
Thanks

Related

Best way to track/trace a JSON Object (a time series data) as it flows through a system of microservices on a IOT platform

We are working on an IOT platform, which ingests many device parameter
values (time series) every second from may devices. Once ingested the
each JSON (batch of multiple parameter values captured at a particular
instance) What is the best way to track the JSON as it flows through
many microservices down stream in an event driven way?
We use spring boot technology predominantly and all the services are
containerised.
Eg: Option 1 - Is associating UUID to each object and then updating
the states idempotently in Redis as each microservice processes it
ideal? Problem is each microservice will be tied to Redis now and we
have seen performance of Redis going down as number api calls to Redis
increase as it is single threaded (We can scale this out though).
Option 2 - Zipkin?
Note: We use Kafka/RabbitMQ to process the messages in a distributed
way as you mentioned here. My question is about a strategy to track
each of this message and its status (to enable replay if needed to
attain only once delivery). Let's say a message1 is being by processed
by Service A, Service B, Service C. Now we are having issues to track
if the message failed getting processed at Service B or Service C as
we get a lot of messages
Better approach will be using Kafka instead of Redis.
Create a topic for every microservice & keep moving the packet from
one topic to another after processing.
topic(raw-data) - |MS One| - topic(processed-data-1) - |MS Two| - topic(processed-data-2) ... etc
Keep appending the results to same object and keep moving it down the line, untill every micro-service has processed it.

Publishing snapshot data when subscriber connects to publisher in ZeroMQ PUB/SUB model

I have a simple ZeroMQ PUB/SUB architecture for streaming data from the publisher to subscribers. When the subscribers are connected, publisher starts streaming the data but I want to modify it, so that publisher publishes the most recent snapshot of the data first and after that starts streaming.
How can I achieve this?
Q : How can I achieve this?
( this being: "... streaming the data but I want to modify it, so that publisher publishes the most recent snapshot of the data first and after that starts streaming."
Solution :
Instantiate a pair of PUB-s, the first called aSnapshotPUBLISHER, the second aStreamingPUBLISHER. Using XPUB-archetype for the former may help to easily integrate some add-on logic for subscriber-base management ( a NTH-feature, yet kinda O/T ATM ).
Get configured the former with aSnapshotPUBLISHER.setsockopt( ZMQ_CONFLATE, 1 ), other settings may focus on reducing latency and ensuring all the needed resources are available for both the smooth streaming via aStreamingPUBLISHER while also having the most recent snapshot readily available in aSnapshotPUBLISHER for any newcomer.
SUB-side agents simply follow this approach, having setup a pair of working (.bind()/.connect()) links ( to either of the PUB-s or a pair of XPUB+PUB ) and having got confirmed the links are up and running smooth, stop sourcing the snapshots from aSnapshotPUBLISHER and remain consuming but the (now synced using BaseID / TimeStamp / FrameMarker or similarly aligned) streaming-data from aStreamingPUBLISHER.
The known ZMQ_CONFLATE-mode as-is limitation of not supporting multi-frame message payloads is needless to consider a trouble, once a low-latency rule-of-thumb is to pack/compress any data into right-sized BLOB-s rather than moving any sort of "decorated" but inefficient data-representation formats over the wire.

What is behaviour of ProcessorContext.getStateStore(String name) & ReadOnlyKeyValueStore.get(String key) in Kafka sream

I have 1.0.0 kafka stream application with two classes as updated at How to evaluate consuming time in kafka stream application. In my application, I read the events, perform some conditional checks and forward to same kafka in another topic. During my evaluation , I am getting some of expressions from Kafka with help of global table store. Observed that most of the time was taken while getting the value from store (sample code is below).
Is it read only one time from Kafka and maintain it in local store?
or
Is it read from Kafka whenever we call the org.apache.kafka.streams.state.ReadOnlyKeyValueStore.get(String key) API? If yes then how to maintain local store instead of read everytime from Kafka?
Please help.
Ex:
private KeyValueStore<String, List<String>> policyStore = (KeyValueStore<String, List<String>>) this.context
.getStateStore(policyGlobalTableName);
List<String> policyIds = policyStore.get(event.getCustomerCode());
By default, stores use an application local RocksDB instance to buffer data. Thus, if you query the store with a get() it will not go over the network and not the brokers, but only the local RocksDB.
You can try to change RocksDB setting to improve the performance, but I have no guidelines atm which configs you might wanna change. Configuring RocksDB is a quite tricky thing. But you might want to search the Internet for further information about it.
You can pass in RocksDB configs via StreamsConfig (cf. https://docs.confluent.io/current/streams/developer-guide/config-streams.html#rocksdb-config-setter)
As an alternative, you could also try to reconfigure Streams to use in-memory stores instead of RocksDB. Note, that this will increase your rebalance time, as there is no local buffered state if you use in-memory instead of RocksDB. (cf. https://docs.confluent.io/current/streams/developer-guide/processor-api.html#defining-and-creating-a-state-store)

Spring Cloud | Gather response from multiple destinations

I'm thinking if Spring Cloud Stream can be a good fit for a specific system we're thinking to build ground up. There's currently a Monolith (ESB) which is currently in use but we are looking to get benefitted by the goodness of microservices (spring cloud ecosystem especially).
We receive request from the input source (JMS Queue, ActiveMQ to be specific) at the rate of 5 requests/second.
We will need to have different routing rules (based on the payload or some derived logic) and route the message to different output destinations (say A, B, C). The output destinations are JMS queues.
Finally, we'll have to receive the 3 responses from A,B,C (by listening to different set of queues) and mash up the final response. This response is finally dispatched to another output channel (which is anther JMS queue).
There are a few corner cases such as when the response for A takes more than '5' seconds, then we'll want to mash up the responses of 'B' and 'C' and an error object for 'A'. Same goes for 'B' and 'C' too.
Also, the destinations 'A','B' and 'C' are dynamic. We could have more target systems 'D', 'E' etc in the future. We're looking at not having to change the main orchestration layer if a new system is introduced.
Is Spring Cloud Stream the right choice? I'm looking for more specific pointers in case of Aggregating the responses from multiple JMS queues (with timeouts) and mashing up the response.
What you are talking about is fully sufficient for the Aggregator EIP or its more powerful friend Scatter-Gather .
Both of them are available in Spring Integration:
Aggregator
Scatter-Gather
So, you will need to have some correlationKey to be able to gather all the responses to the same group to aggregate in the end.
Also there is group-timeout option which allows you to release group when there is no all replies after some time.

What is the purpose of spring cloud stream instanceCount?

In Spring cloud stream, what exactly is the usage of that property spring.cloud.stream.instanceCount?
I mean if that value become wrong because at a moment one or more micro services instances are down, how could this affect the behavior of our infrastructure?
instanceCount is used to partition data across different consumers. Having one or more services down should not really impact your producers, that's the job of the broker.
So let's say you have a source that sends data to 3 partitions, so you'd have instanceCount=3 and each instance would have it's own partition assigned via instanceIndex.
Each instance would be consuming data, but if instance 2 crashes, 0,1 would still be reading data from the partitions, and source would still be sending data as usual.
Assuming your platform has some sort of recoverability in place, your crashed instance should come back to life and resume it's operations.
What we still don't support is dynamic allocation of partitions on runtime, we are investigating this as a story for a future release.

Resources