With Akka Streams, how do I know when a source has completed? - elasticsearch

I have an Alpakka Elasticsearch Sink that I'm keeping around between requests. When I get a request, I create a Source from an HTTP request and turn that into a Source of Elasticsearch WriteMessages, then run that with mySource.runWith(theElasticseachSink).
How do I get notified when the source has completed? Nothing useful seems to be materialized.
Will completion of the source be passed to the sink, meaning I have to create a new one each time?
If yes to the above, would decoupling them somehow with Flow.fromSourceAndSink help?
My goal is to know when the HTTP download has completed (including the vias it goes through) and to be able to reuse the sink.

you can pass around the single parts of a flow as you wish, you can even pass around the whole executabe graph (those are immutables). The run() call materializes the flow, but does not change your graph or its parts.
1)
Since you want to know when the HttpDownload passed the flow , why not use the full graphs Future[Done] ? Assuming your call to elasticsearch is asynchronous, this should be equal since your sink just fires the call and does not wait.
You could also use Source.queue (https://doc.akka.io/docs/akka/2.5/stream/operators/Source/queue.html) and just add your messages to the queue, which then reuses the defined graph so you can add new messages when proocessing is needed. This one also materializes a SourceQueueWithComplete allowing you to stop the stream.
Apart from this, reuse the sink wherever needed without needing to wait for another stream using it.
2) As described above: no, you do not need to instantiate a sink multiple times.
Best Regards,
Andi

It turns out that Alpakka's Elasticsearch library also supports flow shapes, so I can have my source go via that and run it via any sink that materializes a future. Sink.foreach works fine here for testing purposes, for example, as in https://github.com/danellis/akka-es-test.
Flow fromFunction { product: Product =>
WriteMessage.createUpsertMessage(product.id, product.attributes)
} via ElasticsearchFlow.create[Map[String, String]](index, "_doc")
to define es.flow and then
val graph = response.entity.withSizeLimit(MaxFeedSize).dataBytes
.via(scanner)
.via(CsvToMap.toMap(Utf8))
.map(attrs => Product(attrs("id").decodeString(Utf8), attrs.mapValues(_.decodeString(Utf8))))
.via(es.flow)
val futureDone = graph.runWith(Sink.foreach(println))
futureDone onComplete {
case Success(_) => println("Done")
case Failure(e) => println(e)
}

Related

How does ktor websocket flow api works?

I'm using ktor for server side development with websockets.
Documentations shows us this example of using incoming channel:
for (frame in incoming.mapNotNull { it as? Frame.Text }) {
// some
}
But mapNotNull is marked as deprecated in favor of Flow. How should I use this API and what problems could be there? For example, the Flow is a cold stream. It means that the producer function will be called on each collect. How does it work in context of websocket. Will it be reopened on second collect call, or maybe old messages will be delivered once after the next collect? How can I collect N messages, then stop collecting, then collect again?
Thanks in advance :)
How should I use this API and what problems could be there?
What I am using and what I have seen in one of the examples somewhere in the docs is the consumeAsFlow() method called on ReceiveChannel. Here is the entire snippet:
webSocket("/websocket") { //this: DefaultWebSocketServerSession
incoming
.consumeAsFlow()
.map { receive(it) }
.collect()
}
Haven't seen major issues with this approach. One thing you should be aware of (but that goes for the non-flow approach as well) is that if you throw inside your flow, then it will break the WebSocket connection, which is usually not something you'd like to do. It might be worth considering wrapping the entire thing in a try-catch.
Will it be reopened on second collect call, or maybe old messages will be delivered once after the next collect?
You open the websocket before you even start consuming the messages from the flow. You can see that inside webSocket() {} you are in the context of DefaultWebSocketServerSession. This is your connection management. Inside your flow you are simply receiving messages one by one as they arrive (after the connection has been established). If the connection breaks, then you're out of the flow. It needs to be re-established before you can process your messages. This establishing bit is done by the Route.webSocket() method. I do recommend taking a look at its Javadoc.
If you wish to add some clean up after the connection is closed you can add a finally block like so:
webSocket("/chat") {
try {
incoming
.consumeAsFlow()
.map { receive(it, client) }
.collect()
} finally {
// cleanup
}
}
In short: collect is called once per received message. If there is no connection (or it was broken) then collect won't be called.
How can I collect N messages, then stop collecting, then collect again?
What is the use case for this? I don't think you should be doing this with any flow. You can of course take(n) items from a flow, but you won't be able to take any more from it again.

Project reactor processors v3.X

We are trying to migrate from 2.X to 3.X.
https://github.com/reactor/reactor-core/issues/375
We have used the EventBus as event manager in our application(Low latency FX system) and it works very well for us.
After the change we decided to take every module and create his own processor to handle event.
1. Does this use seems to be correct from your point of view? Because lack of document at the current stage and after reviewing everything we could we don't really know what to do here
2. We have tried to use Flux in order to perform action every X interval
For example: Market is arriving 1000 for 1 second but we want to process an update only 4 time in a second. After upgrading we are using:
Processor with buffer and sending to another method.
In this method we have Flux that get list and try to work in parallel in order to complete his task.
We had 2 major problems:
1. Sometimes we received Null event which we cannot find that our system is sending to i suppose maybe we are miss using the processor
//Definition of processor
ReplayProcessor<Event> classAEventProcessor = ReplayProcessor.create();
//Event handler subscribing
public void onMyEventX(Consumer<Event> consumer) {
Flux<Event> handler = classAEventProcessor .filter(event -> event.getType().equals(EVENT_X));
handler.subscribe(consumer);
}
in the example above the event in the handler sometimes get null.. Once he does the stream stop working until we are restating server(Because only on restart we are doing creating processor)
2.We have tried to us parallel but sometimes some of the message were disappeared so maybe we are misusing the framework
//On constructor
tickProcessor.buffer(1024, Duration.of(250, ChronoUnit.MILLIS)).subscribe(markets ->
handleMarkets(markets));
//Handler
Flux.fromIterable(getListToProcess())
.parallel()
.runOn(Schedulers.parallel())
.doOnNext(entryMap -> {
DoBlockingWork(entryMap);
})
.sequential()
.subscribe();
The intention of this is that the processor will wakeup every 250ms and invoke the handler. The handler will work work with Flux parallel in order to make better and faster processing.
*In case that DoBlockingWork takes more than 250ms i couldn't understand what will be the behavior
UPDATE:
The EventBus was wrapped by us and every event subscribed throw the wrapped event manager.
Now we have tried to create event processor for every module but it works very slow. We have used TopicProcessor with ThreadExecutor and still very slow.. EventBus did the same work in high speed
Anyone has any idea? BTW when i tried to use DirectProcessor it seems to work much better that the TopicProcessor
Reactor 3 is built around the concept that you should avoid blocking as much as you can, so in your second snippet DoBlockingWork doesn't look good.
How are the events generated? Do you maybe have an listener-based asynchronous API to get them? If so, you could try using Flux.create.
For your use case of "we have 1000 events in 1 second, but only want to process 4", I'd chain a sample operator. For instance, sample(Duration.ofMillis(250)) will divide each second into 4 windows, from which it will only emit the last element.
The reference guide is being written, as well as a page where you can find links to external articles and learning material.There's a preview of the WIP reference guide here and the learning resources page here.

IBM Integration Bus, best practices for calling multiple services

So I have this requirement, that takes in one document, and from that needs to create one or more documents in the output.
During the cause of this, it needs to determine if the document is already there, because there are different operations to apply for create and update scenarios.
In straight code, this would be simple (conceptually)
InputData in = <something>
if (getItemFromExternalSystem(in.key1) == null) {
createItemSpecificToKey1InExternalSystem(in.key1);
}
if (getItemFromExternalSystem(in.key2) == null) {
createItemSpecificToKey2InExternalSystem(in.key1, in.key2);
}
createItemFromInput(in.key1,in.key2, in.moreData);
In effect a kind of "ensure this data is present".
However, in IIB How would i go about achieving this? If i used a subflow for the Get/create cycle, the output of the subflow would be whatever the result of the last operation is, is returned from the subflow as the new "message" of the flow, but really, I don't care about the value from the "ensure data present" subflow. I need instead to keep working on my original message, but still wait for the different subflows to finish before i can run my final "createItem"
You can use Aggregation Nodes: for example, use 3 flows:
first would be propagate your original message to third
second would be invoke operations createItemSpecificToKey1InExternalSystem and createItemSpecificToKey2InExternalSystem
third would be aggregate results of first and second and invoke createItemFromInput.
Have you considered using the Collector node? It will collect your records into N 'collections', and then you can iterate over the collections and output one document per collection.

Closing over java.util.concurrent.ConcurrentHashMap inside a Future of Actor's receive method?

I've an actor where I want to store my mutable state inside a map.
Clients can send Get(key:String) and Put(key:String,value:String) messages to this actor.
I'm considering the following options.
Don't use futures inside the Actor's receive method. In this may have a negative impact on both latency as well as throughput in case I've a large number of gets/puts because all operations will be performed in order.
Use java.util.concurrent.ConcurrentHashMap and then invoke the gets and puts inside a Future.
Given that java.util.concurrent.ConcurrentHashMap is thread-safe and providers finer level of granularity, I was wondering if it is still a problem to close over the concurrentHashMap inside a Future created for each put and get.
I'm aware of the fact that it's a really bad idea to close over mutable state inside a Future inside an Actor but I'm still interested to know if in this particular case it is correct or not?
In general, java.util.concurrent.ConcurrentHashMap is made for concurrent use. As long as you don't try to transport the closure to another machine, and you think through the implications of it being used concurrently (e.g. if you read a value, use a function to modify it, and then put it back, do you want to use the replace(key, oldValue, newValue) method to make sure it hasn't changed while you were doing the processing?), it should be fine in Futures.
May be a little late, but still, in the book Reactive Web Applications, the author has indicated an indirection to this specific problem, using pipeTo as below.
def receive = {
case ComputeReach(tweetId) =>
fetchRetweets(tweetId, sender()) pipeTo self
case fetchedRetweets: FetchedRetweets =>
followerCountsByRetweet += fetchedRetweets -> List.empty
fetchedRetweets.retweets.foreach { rt =>
userFollowersCounter ! FetchFollowerCount(
fetchedRetweets.tweetId, rt.user
)
}
...
}
where followerCountsByRetweet is a mutable state of the actor. The result of fetchRetweets() which is a Future is piped to the same actor as a FetchedRetweets message, which then acts on the message on to modify the state of the acto., this will mitigate any concurrent operation on the state

Relation between command handlers, aggregates, the repository and the event store in CQRS

I'd like to understand some details of the relations between command handlers, aggregates, the repository and the event store in CQRS-based systems.
What I've understood so far:
Command handlers receive commands from the bus. They are responsible for loading the appropriate aggregate from the repository and call the domain logic on the aggregate. Once finished, they remove the command from the bus.
An aggregate provides behavior and an internal state. State is never public. The only way to change state is by using the behavior. The methods that model this behavior create events from the command's properties, and apply these events to the aggregate, which in turn call an event handlers that sets the internal state accordingly.
The repository simply allows loading aggregates on a given ID, and adding new aggregates. Basically, the repository connects the domain to the event store.
The event store, last but not least, is responsible for storing events to a database (or whatever storage is used), and reloading these events as a so-called event stream.
So far, so good.
Now there are some issues that I did not yet get:
If a command handler is to call behavior on a yet existing aggregate, everything is quite easy. The command handler gets a reference to the repository, calls its loadById method and the aggregate is returned. But what does the command handler do when there is no aggregate yet, but one should be created? From my understanding the aggregate should later-on be rebuilt using the events. This means that creation of the aggregate is done in reply to a fooCreated event. But to be able to store any event (including the fooCreated one), I need an aggregate. So this looks to me like a chicken-and-egg problem: I can not create the aggregate without the event, but the only component that should create events is the aggregate. So basically it comes down to: How do I create new aggregates, who does what?
When an aggregate triggers an event, an internal event handler responses to it (typically by being called via an apply method) and changes the aggregate's state. How is this event handed over to the repository? Who originates the "please send the new events to the repository / event store" action? The aggregate itself? The repository by watching the aggregate? Someone else who is subscribed to the internal events? ...?
Last but not least I have a problem understanding the concept of an event stream correctly: In my imagination, it's simply something like an ordered list of events. What's of importance is that it's "ordered". Is this right?
The following is based on my own experience and my experiments with various frameworks like Lokad.CQRS, NCQRS, etc. I'm sure there are multiple ways to handle this. I'll post what makes most sense to me.
1. Aggregate Creation:
Every time a command handler needs an aggregate, it uses a repository. The repository retrieves the respective list of events from the event store and calls an overloaded constructor, injecting the events
var stream = eventStore.LoadStream(id)
var User = new User(stream)
If the aggregate didn't exist before, the stream will be empty and the newly created object will be in it's original state. You might want to make sure that in this state only a few commands are allowed to bring the aggregate to life, e.g. User.Create().
2. Storage of new Events
Command handling happens inside a Unit of Work. During command execution every resulting event will be added to a list inside the aggregate (User.Changes). Once execution is finished, the changes will be appended to the event store. In the example below this happens in the following line:
store.AppendToStream(cmd.UserId, stream.Version, user.Changes)
3. Order of Events
Just imagine what would happen, if two subsequent CustomerMoved events are replayed in the wrong order.
An Example
I'll try to illustrate the with a piece of pseudo-code (I deliberately left repository concerns inside the command handler to show what would happen behind the scenes):
Application Service:
UserCommandHandler
Handle(CreateUser cmd)
stream = store.LoadStream(cmd.UserId)
user = new User(stream.Events)
user.Create(cmd.UserName, ...)
store.AppendToStream(cmd.UserId, stream.Version, user.Changes)
Handle(BlockUser cmd)
stream = store.LoadStream(cmd.UserId)
user = new User(stream.Events)
user.Block(string reason)
store.AppendToStream(cmd.UserId, stream.Version, user.Changes)
Aggregate:
User
created = false
blocked = false
Changes = new List<Event>
ctor(eventStream)
isNewEvent = false
foreach (event in eventStream)
this.Apply(event, isNewEvent)
Create(userName, ...)
if (this.created) throw "User already exists"
isNewEvent = true
this.Apply(new UserCreated(...), isNewEvent)
Block(reason)
if (!this.created) throw "No such user"
if (this.blocked) throw "User is already blocked"
isNewEvent = true
this.Apply(new UserBlocked(...), isNewEvent)
Apply(userCreatedEvent, isNewEvent)
this.created = true
if (isNewEvent) this.Changes.Add(userCreatedEvent)
Apply(userBlockedEvent, isNewEvent)
this.blocked = true
if (isNewEvent) this.Changes.Add(userBlockedEvent)
Update:
As a side note: Yves' answer reminded me of an interesting article by Udi Dahan from a couple of years ago:
Don’t Create Aggregate Roots
A small variation on Dennis excellent answer:
When dealing with "creational" use cases (i.e. that should spin off new aggregates), try to find another aggregate or factory you can move that responsibility to. This does not conflict with having a ctor that takes events to hydrate (or any other mechanism to rehydrate for that matter). Sometimes the factory is just a static method (good for "context"/"intent" capturing), sometimes it's an instance method of another aggregate (good place for "data" inheritance), sometimes it's an explicit factory object (good place for "complex" creation logic).
I like to provide an explicit GetChanges() method on my aggregate that returns the internal list as an array. If my aggregate is to stay in memory beyond one execution, I also add an AcceptChanges() method to indicate the internal list should be cleared (typically called after things were flushed to the event store). You can use either a pull (GetChanges/Changes) or push (think .net event or IObservable) based model here. Much depends on the transactional semantics, tech, needs, etc ...
Your eventstream is a linked list. Each revision (event/changeset) pointing to the previous one (a.k.a. the parent). Your eventstream is a sequence of events/changes that happened to a specific aggregate. The order is only to be guaranteed within the aggregate boundary.
I almost agree with yves-reynhout and dennis-traub but I want to show you how I do this. I want to strip my aggregates of the responsibility to apply the events on themselves or to re-hydrate themselves; otherwise there is a lot of code duplication: every aggregate constructor will look the same:
UserAggregate:
ctor(eventStream)
foreach (event in eventStream)
this.Apply(event)
OrderAggregate:
ctor(eventStream)
foreach (event in eventStream)
this.Apply(event)
ProfileAggregate:
ctor(eventStream)
foreach (event in eventStream)
this.Apply(event)
Those responsibilities could be left to the command dispatcher. The command is handled directly by the aggregate.
Command dispatcher class
dispatchCommand(command) method:
newEvents = ConcurentProofFunctionCaller.executeFunctionUntilSucceeds(tryToDispatchCommand)
EventDispatcher.dispatchEvents(newEvents)
tryToDispatchCommand(command) method:
aggregateClass = CommandSubscriber.getAggregateClassForCommand(command)
aggregate = AggregateRepository.loadAggregate(aggregateClass, command.getAggregateId())
newEvents = CommandApplier.applyCommandOnAggregate(aggregate, command)
AggregateRepository.saveAggregate(command.getAggregateId(), aggregate, newEvents)
ConcurentProofFunctionCaller class
executeFunctionUntilSucceeds(pureFunction) method:
do this n times
try
call result=pureFunction()
return result
catch(ConcurentWriteException)
continue
throw TooManyRetries
AggregateRepository class
loadAggregate(aggregateClass, aggregateId) method:
aggregate = new aggregateClass
priorEvents = EventStore.loadEvents()
this.applyEventsOnAggregate(aggregate, priorEvents)
saveAggregate(aggregateId, aggregate, newEvents)
this.applyEventsOnAggregate(aggregate, newEvents)
EventStore.saveEventsForAggregate(aggregateId, newEvents, priorEvents.version)
SomeAggregate class
handleCommand1(command1) method:
return new SomeEvent or throw someException BUT don't change state!
applySomeEvent(SomeEvent) method:
changeStateSomehow() and not throw any exception and don't return anything!
Keep in mind that this is pseudo code projected from a PHP application; the real code should have things injected and other responsibilities refactored out in other classes. The ideea is to keep aggregates as clean as possible and avoid code duplication.
Some important aspects about aggregates:
command handlers should not change state; they yield events or
throw exceptions
event applies should not throw any exception and should not return anything; they only change internal state
An open-source PHP implementation of this could be found here.

Resources