Access Record Offset in Kafka Streams Aggregator

Access Record Offset in Kafka Streams Aggregator - apache-kafka-streams

I have a simple windowing topology:
builder.stream("input-topic", Consumed.with(...))
.groupByKey()
.windowedBy(TimeWindows.of(windowSize).advanceBy(windowAdvance).grace(windowGrace))
.aggregate(Frame::new,
this::windowAggregator,
...
)
.transformValues(FrameTransformer::new)
.toStream()
.selectKey((key, value) -> value...)
.to("output-topic", Produced.with(...));
I'd like to put the actual record offset of the beginning of the window into the Frame aggregating object.
How can I get access to the record offset from the windowAggregator (aggregate() handler) function?
I know that I can get access to the record offset in the FrameTransformer, but that doesn't help me creating accurate Frame objects describing my windows in terms of start and end offset.
I've heard there's a way of doing that by inserting another .transform() call before the groupByKey(), there I can access the offsets, but then I'd need to modify the schema of my event records to store there the offset information inside.
Is there a (simpler) way of achieving my intention?
Update
In fact, I was able to get the accurate window start and end offsets in the Frame objects in the following way
builder.stream("input-topic", Consumed.with(...))
.transformValues(EventTransformer::new)
.groupByKey()
.windowedBy(TimeWindows.of(windowSize).advanceBy(windowAdvance).grace(windowGrace))
.aggregate(Frame::new,
this::windowAggregator,
...
)
.toStream()
.selectKey((key, value) -> value...)
.to("output-topic", Produced.with(...));
But as mentioned above, at the expense of editing the schema of Event object.

How can I get access to the record offset from the windowAggregator (aggregate() handler) function?
You can't. Your approach to use transformValues() before the aggregation (and to enrich the Event object is the right approach.
There was a proposal to extend the API to allow accessing record metadata within aggregate() and other DSL operators, but it was never pushed over the finish line (cf https://cwiki.apache.org/confluence/display/KAFKA/KIP-159%3A+Introducing+Rich+functions+to+Streams).

Related

How to create MutableSharedFlow in Kotlin Coroutines simillar to PublishSubject from RxJava?

Is there an equivalent of PublishSubject from RxJava in Kotlin Coroutines library?
Channels cannot be a replacement for PublishSubject since they do not publish values to multiple collectors (each value can be collected by a single collector only). Even MutableSharedFlow that supports multiple collectors, still does not allow emitting values without waiting for collectors to finish processing previous values. How can we create a flow with functionality similar to the PublishSubject?

The following code will create a Flow equivalent to the PublishSubject:
fun <T> publishFlow(): MutableSharedFlow<T> {
return MutableSharedFlow(
replay = 0,
extraBufferCapacity = Int.MAX_VALUE
)
}
The main attributes of the PublishSubject are that it does not replay old values to new observers, and still allows to publish new values/events without waiting for the observers to handle them. So this functionality can be achieved with MutableSharedFlow by specifying replay = 0 for preventing new collectors from collecting old values, and extraBufferCapacity = Int.MAX_VALUE to allow publishing new values without waiting for busy collectors to finish collecting previous values.
One can add the following forceEmit function to be called instead of tryEmit, to ensure that the value is actually emitted:
fun <T> MutableSharedFlow<T>.forceEmit(value: T) {
val emitted = tryEmit(value)
check(emitted){ "Failed to emit into shared flow." }
}
Since we have a buffer with MAX_VALUE capacity, this forceEmit function should never fail if we use it with our publishFlow. If the flow will be replaced somehow with a different flow that does not support emitting without suspending, we will get an exception and will know to handle the case where the buffer is full and one cannot emit without suspending.
Notice that having a buffer of MAX_VALUE capacity may cause high consumption of memory if the collection of values by the collectors takes a long time, so it is more suitable for cases where the collectors perform a short synchronous operation (similarly to RxJava observers).

How to store the updates of state in an offchain database?

I want to store all the blockchain data in offchain database.
rpc has a function called EXPERIMENTAL_changes, I was told that I can do that by http polling of this method but I am unable to find out how to use it.
http post https://rpc.testnet.near.org jsonrpc=2.0 id=dontcare method=EXPERIMENTAL_changes \ params:='{ "changes_type": "data_changes", "account_ids": ["guest-book.testnet"], "key_prefix_base64": "", "block_id": 19450732 }'
For example here the results give:
"change": { "account_id": "guest-book.testnet", "key_base64": "bTo6Mzk=", "value_base64": "eyJwcmVtaXVtIjpmYWxzZSwic2VuZGVyIjoiZmhyLnRlc3RuZXQiLCJ0ZXh0IjoiSGkifQ==" }
What is key_base64?
Decoding it to string gives m::39
What is m::39?
For example, I have the following state data in the rust structure.
pub struct Demo {
user_profile_map: TreeMap<u128, User>,
user_products_map: TreeMap<u128, UnorderedSet<u128>>, // (user_id, set<product_id>)
product_reviews_map: TreeMap<u128, UnorderedSet<u128>>, // (product_id, set<review_id>)
product_check_bounty: LookupMap<u128, Vector<u64>>
}
How to know anything gets changed in these variables?
Will I have to check every block id for the point the contract is deployed, to know where there is the change?

I want to store all the blockchain data in offchain database.
If so, I recommend you take a look at the Indexer Framework, which allows you to get a stream of blocks and handle them. We use it to build Indexer for Wallet (keeps track of every added and deleted access key, and stores those into Postgres) and Indexer for Explorer (keeps track of every block, chunk, transaction, receipt, execution outcome, state changes, accounts, and access keys, and stores all of that in Postgres)
What is m::39?
Contracts in NEAR Protocol have access to the key-value storage (state), so at the lowest-level, you operate with key-value operations (NEAR SDK for AssemblyScript defines Storage class with get and set operations, and NEAR SDK for Rust has storage_read and storage_write calls to preserve data).
Guest Book example uses a high-level abstraction called PersistentVector, which automatically reads and writes its records from/to NEAR key-value storage (state). As you can see:
export const messages = new PersistentVector<PostedMessage>("m");
Guest Book defines the messages to be stored in the storage with m prefix, hense you see m::39, which basically means it is messages[39] stored in the key-value storage.
What is key_base64?
As key-value storage implies, the data is stored and accessed by keys, and the key can be binary, so base64 encoding is used to enable JSON-RPC API users with a way to query those binary keys as well (there is no way you can pass a raw binary blob in JSON).
How to know anything gets changed in these variables? Will I have to check every block id for the point the contract is deployed, to know where there is the change?
Correct, you need to follow every block, and check the changes. That is why we have built the Indexer Framework in order to enable community building services on top of that (we chose to build applications Indexer for Wallet and Indexer for Explorer, but others may decide to build GraphQL service like TheGraph)

How does Spark Streaming's countByValueAndWindow work?

I have a Spark Streaming application that is processing a stream of website click events. Each event has a property containing a GUID that identifies the user session that the event belongs to.
My application is counting up the number of events that occurred for each session, using windowing:
def countEvents(kafkaStream: DStream[(String, Event)]): DStream[(String, Session)] = {
// Get a list of the session GUIDs from the events
val sessionGuids = kafkaStream
.map(_._2)
.map(_.getSessionGuid)
// Count up the GUIDs over our sliding window
val sessionGuidCountsInWindow = sessionGuids.countByValueAndWindow(Seconds(60), Seconds(1))
// Create new session objects with the event count
sessionGuidCountsInWindow
.map({
case (guidS, eventCount) =>
guidS -> new Session().setGuid(guidS).setEventCount(eventCount)
})
}
My understanding was that the countByValueAndWindow function is only counting the values in the DStream on which the function is called. In other words, in the code above, the call to countByValueAndWindow should return the event counts only for the session GUIDs in the sessionGuids DStream on which we're calling that function.
But I'm observing something different; the call to countByValueAndWindow is returning counts for session GUIDs that are not in sessionGUIDs. It appears to be returning counts for session GUIDs that were processed in previous batches. Am I just misunderstanding how this function works? I haven't been able to find anything in the way of useful documentation online.

A colleague of mine who is much more versed in the ways of Spark than I has helped me with this. Apparently I was mis-understanding the way that the countByValueAndWindow function works. I thought that it would only return counts for values in the DStream for which you're calling the function. But, in fact, it returns counts for all values across the entire window. To address my issue, I simply perform a join between my input DStream and the DStream resulting from the countByValueAndWindow operation. Thus I only end up with results for values in my input DStream.

Closing over java.util.concurrent.ConcurrentHashMap inside a Future of Actor's receive method?

I've an actor where I want to store my mutable state inside a map.
Clients can send Get(key:String) and Put(key:String,value:String) messages to this actor.
I'm considering the following options.
Don't use futures inside the Actor's receive method. In this may have a negative impact on both latency as well as throughput in case I've a large number of gets/puts because all operations will be performed in order.
Use java.util.concurrent.ConcurrentHashMap and then invoke the gets and puts inside a Future.
Given that java.util.concurrent.ConcurrentHashMap is thread-safe and providers finer level of granularity, I was wondering if it is still a problem to close over the concurrentHashMap inside a Future created for each put and get.
I'm aware of the fact that it's a really bad idea to close over mutable state inside a Future inside an Actor but I'm still interested to know if in this particular case it is correct or not?

In general, java.util.concurrent.ConcurrentHashMap is made for concurrent use. As long as you don't try to transport the closure to another machine, and you think through the implications of it being used concurrently (e.g. if you read a value, use a function to modify it, and then put it back, do you want to use the replace(key, oldValue, newValue) method to make sure it hasn't changed while you were doing the processing?), it should be fine in Futures.

May be a little late, but still, in the book Reactive Web Applications, the author has indicated an indirection to this specific problem, using pipeTo as below.
def receive = {
case ComputeReach(tweetId) =>
fetchRetweets(tweetId, sender()) pipeTo self
case fetchedRetweets: FetchedRetweets =>
followerCountsByRetweet += fetchedRetweets -> List.empty
fetchedRetweets.retweets.foreach { rt =>
userFollowersCounter ! FetchFollowerCount(
fetchedRetweets.tweetId, rt.user
)
}
...
}
where followerCountsByRetweet is a mutable state of the actor. The result of fetchRetweets() which is a Future is piped to the same actor as a FetchedRetweets message, which then acts on the message on to modify the state of the acto., this will mitigate any concurrent operation on the state

Relation between command handlers, aggregates, the repository and the event store in CQRS

I'd like to understand some details of the relations between command handlers, aggregates, the repository and the event store in CQRS-based systems.
What I've understood so far:
Command handlers receive commands from the bus. They are responsible for loading the appropriate aggregate from the repository and call the domain logic on the aggregate. Once finished, they remove the command from the bus.
An aggregate provides behavior and an internal state. State is never public. The only way to change state is by using the behavior. The methods that model this behavior create events from the command's properties, and apply these events to the aggregate, which in turn call an event handlers that sets the internal state accordingly.
The repository simply allows loading aggregates on a given ID, and adding new aggregates. Basically, the repository connects the domain to the event store.
The event store, last but not least, is responsible for storing events to a database (or whatever storage is used), and reloading these events as a so-called event stream.
So far, so good.
Now there are some issues that I did not yet get:
If a command handler is to call behavior on a yet existing aggregate, everything is quite easy. The command handler gets a reference to the repository, calls its loadById method and the aggregate is returned. But what does the command handler do when there is no aggregate yet, but one should be created? From my understanding the aggregate should later-on be rebuilt using the events. This means that creation of the aggregate is done in reply to a fooCreated event. But to be able to store any event (including the fooCreated one), I need an aggregate. So this looks to me like a chicken-and-egg problem: I can not create the aggregate without the event, but the only component that should create events is the aggregate. So basically it comes down to: How do I create new aggregates, who does what?
When an aggregate triggers an event, an internal event handler responses to it (typically by being called via an apply method) and changes the aggregate's state. How is this event handed over to the repository? Who originates the "please send the new events to the repository / event store" action? The aggregate itself? The repository by watching the aggregate? Someone else who is subscribed to the internal events? ...?
Last but not least I have a problem understanding the concept of an event stream correctly: In my imagination, it's simply something like an ordered list of events. What's of importance is that it's "ordered". Is this right?

The following is based on my own experience and my experiments with various frameworks like Lokad.CQRS, NCQRS, etc. I'm sure there are multiple ways to handle this. I'll post what makes most sense to me.
1. Aggregate Creation:
Every time a command handler needs an aggregate, it uses a repository. The repository retrieves the respective list of events from the event store and calls an overloaded constructor, injecting the events
var stream = eventStore.LoadStream(id)
var User = new User(stream)
If the aggregate didn't exist before, the stream will be empty and the newly created object will be in it's original state. You might want to make sure that in this state only a few commands are allowed to bring the aggregate to life, e.g. User.Create().
2. Storage of new Events
Command handling happens inside a Unit of Work. During command execution every resulting event will be added to a list inside the aggregate (User.Changes). Once execution is finished, the changes will be appended to the event store. In the example below this happens in the following line:
store.AppendToStream(cmd.UserId, stream.Version, user.Changes)
3. Order of Events
Just imagine what would happen, if two subsequent CustomerMoved events are replayed in the wrong order.
An Example
I'll try to illustrate the with a piece of pseudo-code (I deliberately left repository concerns inside the command handler to show what would happen behind the scenes):
Application Service:
UserCommandHandler
Handle(CreateUser cmd)
stream = store.LoadStream(cmd.UserId)
user = new User(stream.Events)
user.Create(cmd.UserName, ...)
store.AppendToStream(cmd.UserId, stream.Version, user.Changes)
Handle(BlockUser cmd)
stream = store.LoadStream(cmd.UserId)
user = new User(stream.Events)
user.Block(string reason)
store.AppendToStream(cmd.UserId, stream.Version, user.Changes)
Aggregate:
User
created = false
blocked = false
Changes = new List<Event>
ctor(eventStream)
isNewEvent = false
foreach (event in eventStream)
this.Apply(event, isNewEvent)
Create(userName, ...)
if (this.created) throw "User already exists"
isNewEvent = true
this.Apply(new UserCreated(...), isNewEvent)
Block(reason)
if (!this.created) throw "No such user"
if (this.blocked) throw "User is already blocked"
isNewEvent = true
this.Apply(new UserBlocked(...), isNewEvent)
Apply(userCreatedEvent, isNewEvent)
this.created = true
if (isNewEvent) this.Changes.Add(userCreatedEvent)
Apply(userBlockedEvent, isNewEvent)
this.blocked = true
if (isNewEvent) this.Changes.Add(userBlockedEvent)
Update:
As a side note: Yves' answer reminded me of an interesting article by Udi Dahan from a couple of years ago:
Don’t Create Aggregate Roots

A small variation on Dennis excellent answer:
When dealing with "creational" use cases (i.e. that should spin off new aggregates), try to find another aggregate or factory you can move that responsibility to. This does not conflict with having a ctor that takes events to hydrate (or any other mechanism to rehydrate for that matter). Sometimes the factory is just a static method (good for "context"/"intent" capturing), sometimes it's an instance method of another aggregate (good place for "data" inheritance), sometimes it's an explicit factory object (good place for "complex" creation logic).
I like to provide an explicit GetChanges() method on my aggregate that returns the internal list as an array. If my aggregate is to stay in memory beyond one execution, I also add an AcceptChanges() method to indicate the internal list should be cleared (typically called after things were flushed to the event store). You can use either a pull (GetChanges/Changes) or push (think .net event or IObservable) based model here. Much depends on the transactional semantics, tech, needs, etc ...
Your eventstream is a linked list. Each revision (event/changeset) pointing to the previous one (a.k.a. the parent). Your eventstream is a sequence of events/changes that happened to a specific aggregate. The order is only to be guaranteed within the aggregate boundary.

I almost agree with yves-reynhout and dennis-traub but I want to show you how I do this. I want to strip my aggregates of the responsibility to apply the events on themselves or to re-hydrate themselves; otherwise there is a lot of code duplication: every aggregate constructor will look the same:
UserAggregate:
ctor(eventStream)
foreach (event in eventStream)
this.Apply(event)
OrderAggregate:
ctor(eventStream)
foreach (event in eventStream)
this.Apply(event)
ProfileAggregate:
ctor(eventStream)
foreach (event in eventStream)
this.Apply(event)
Those responsibilities could be left to the command dispatcher. The command is handled directly by the aggregate.
Command dispatcher class
dispatchCommand(command) method:
newEvents = ConcurentProofFunctionCaller.executeFunctionUntilSucceeds(tryToDispatchCommand)
EventDispatcher.dispatchEvents(newEvents)
tryToDispatchCommand(command) method:
aggregateClass = CommandSubscriber.getAggregateClassForCommand(command)
aggregate = AggregateRepository.loadAggregate(aggregateClass, command.getAggregateId())
newEvents = CommandApplier.applyCommandOnAggregate(aggregate, command)
AggregateRepository.saveAggregate(command.getAggregateId(), aggregate, newEvents)
ConcurentProofFunctionCaller class
executeFunctionUntilSucceeds(pureFunction) method:
do this n times
try
call result=pureFunction()
return result
catch(ConcurentWriteException)
continue
throw TooManyRetries
AggregateRepository class
loadAggregate(aggregateClass, aggregateId) method:
aggregate = new aggregateClass
priorEvents = EventStore.loadEvents()
this.applyEventsOnAggregate(aggregate, priorEvents)
saveAggregate(aggregateId, aggregate, newEvents)
this.applyEventsOnAggregate(aggregate, newEvents)
EventStore.saveEventsForAggregate(aggregateId, newEvents, priorEvents.version)
SomeAggregate class
handleCommand1(command1) method:
return new SomeEvent or throw someException BUT don't change state!
applySomeEvent(SomeEvent) method:
changeStateSomehow() and not throw any exception and don't return anything!
Keep in mind that this is pseudo code projected from a PHP application; the real code should have things injected and other responsibilities refactored out in other classes. The ideea is to keep aggregates as clean as possible and avoid code duplication.
Some important aspects about aggregates:
command handlers should not change state; they yield events or
throw exceptions
event applies should not throw any exception and should not return anything; they only change internal state
An open-source PHP implementation of this could be found here.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio