So I have this requirement, that takes in one document, and from that needs to create one or more documents in the output.
During the cause of this, it needs to determine if the document is already there, because there are different operations to apply for create and update scenarios.
In straight code, this would be simple (conceptually)
InputData in = <something>
if (getItemFromExternalSystem(in.key1) == null) {
createItemSpecificToKey1InExternalSystem(in.key1);
}
if (getItemFromExternalSystem(in.key2) == null) {
createItemSpecificToKey2InExternalSystem(in.key1, in.key2);
}
createItemFromInput(in.key1,in.key2, in.moreData);
In effect a kind of "ensure this data is present".
However, in IIB How would i go about achieving this? If i used a subflow for the Get/create cycle, the output of the subflow would be whatever the result of the last operation is, is returned from the subflow as the new "message" of the flow, but really, I don't care about the value from the "ensure data present" subflow. I need instead to keep working on my original message, but still wait for the different subflows to finish before i can run my final "createItem"
You can use Aggregation Nodes: for example, use 3 flows:
first would be propagate your original message to third
second would be invoke operations createItemSpecificToKey1InExternalSystem and createItemSpecificToKey2InExternalSystem
third would be aggregate results of first and second and invoke createItemFromInput.
Have you considered using the Collector node? It will collect your records into N 'collections', and then you can iterate over the collections and output one document per collection.
Related
I'm looking at writing one of our ETL (or ETL like) processes in kiba and I wonder how to structure it. The main question I have is the overall architecture. The process works roughly like this:
Fetch data from an HTTP endpoint.
For each item returned from that API and make one more HTTP call
Do some transformations for each of the items returned from step 2
Send each item somewhere else
Now my question is: Is it OK if only step one is a source and anything until the end is a transform? Or would it be better to somehow have each HTTP call be a source and then combine these somehow, maybe using multiple jobs?
It is indeed best to use a single source, that you will use to fetch the main stream of the data.
General advice: try to work in batches as much as you can (e.g. pagination in the source, but also bulk HTTP lookup if the API supports it in step 2).
Source section
The source in your case could be a paginating HTTP resource, for instance.
A first option to implement it would be to write write a dedicated class like explained in the documentation.
A second option is to use Kiba::Common::Sources::Enumerable (https://github.com/thbar/kiba-common#kibacommonsourcesenumerable) like this:
source Kiba::Common::Sources::Enumerable, -> {
Enumerator.new do |y|
# do your pagination & splitting here
y << your_item
end
}
# then
transform Kiba::Common::Transforms::EnumerableExploder
Join with secondary HTTP source
It can be done this way:
transform do |r|
# here make secondary HTTP query
result = my_query(...)
# then merge the result
r.merge(secondary_data: ...)
end
There is support for parallelisation of the queries in that step via Kiba Pro's ParallelTransform (https://github.com/thbar/kiba/wiki/Parallel-Transform):
parallel_transform(max_threads: 10) do |r|
# this code will run in its own thread
extra_data = get_extra_json_hash_from_http!(r.fetch(:extra_data_url))
r.merge(extra_data: extra_data)
end
It must be noted too that if you can structure your HTTP calls to handle N rows at one time (if the HTTP backend is flexible enough) things will be even faster.
Step 3 does not need specific advice.
Send each item somewhere else
I would most likely implement a destination for that (but it could also be implemented as a transform actually, and parallelized still with parallel_transform if needed).
I have written a REST API (javax.ws.rs) that uses the high-level Elasticsearch API. This is with ES 7.2.
The client needs to index a record, then execute a search that includes that record and there is some delay after the index operation before the document will actually appear in searches.
Is there any way to block the index operation until the newly index record appears in search results?
Failing that, is there any way to get an asynchronous notification that the document is now searchable?
To give an idea of my use case, here is the code from the client side:
const cr = await this.client.dNodeCreate(fixedNode).toPromise();
const fr = await this.client.dNodeGetById(cr._id).toPromise();
await this.client.dNodeCreate(replyRoot).toPromise();
The first line causes a Index request to ES, and returns the status object. That object includes the ID of the new document.
The second line fetches the record by ID. This always works.
The third line fails. The document it attempts to index is dependent on the first document, which the REST middleware attempts to look up by a search (not by the ID). This is the equivalent of an SQL relation enforced by the REST layer.
I can always make the code work by introducing a delay (say 1500ms) before the third call but this is really a non-robust solution. It might always work in development mode (all the servers are on my laptop and no other users) but there is no way to predict how long the delay needs to be in actual production.
UPDATE: Solved.
The marked answer below seems to do the trick. For reference, the needed call in the Java API looks like this:
IndexRequest req = new IndexRequest(DNode.INDEX);
req.setRefreshPolicy(WriteRequest.RefreshPolicy.WAIT_UNTIL);
The 'refresh' parameter is what you are looking for. From the Elasticsearch documentation:
Refresh (Index API): ) If true, Elasticsearch refreshes the affected shards to make this operation visible to search, if wait_for then wait for a refresh to make this operation visible to search, if false do nothing with refreshes. Valid values: true, false, wait_for. Default: false
So your index request should look something like this:
PUT /<index>/_doc/<_id>?refresh=wait_for
I do not believe there is a built-in way to get an asynchronous notification that the document is now searchable. That being said, if you already have access to the document ID it might make more sense to use that in the code instead of a search.
I have an Alpakka Elasticsearch Sink that I'm keeping around between requests. When I get a request, I create a Source from an HTTP request and turn that into a Source of Elasticsearch WriteMessages, then run that with mySource.runWith(theElasticseachSink).
How do I get notified when the source has completed? Nothing useful seems to be materialized.
Will completion of the source be passed to the sink, meaning I have to create a new one each time?
If yes to the above, would decoupling them somehow with Flow.fromSourceAndSink help?
My goal is to know when the HTTP download has completed (including the vias it goes through) and to be able to reuse the sink.
you can pass around the single parts of a flow as you wish, you can even pass around the whole executabe graph (those are immutables). The run() call materializes the flow, but does not change your graph or its parts.
1)
Since you want to know when the HttpDownload passed the flow , why not use the full graphs Future[Done] ? Assuming your call to elasticsearch is asynchronous, this should be equal since your sink just fires the call and does not wait.
You could also use Source.queue (https://doc.akka.io/docs/akka/2.5/stream/operators/Source/queue.html) and just add your messages to the queue, which then reuses the defined graph so you can add new messages when proocessing is needed. This one also materializes a SourceQueueWithComplete allowing you to stop the stream.
Apart from this, reuse the sink wherever needed without needing to wait for another stream using it.
2) As described above: no, you do not need to instantiate a sink multiple times.
Best Regards,
Andi
It turns out that Alpakka's Elasticsearch library also supports flow shapes, so I can have my source go via that and run it via any sink that materializes a future. Sink.foreach works fine here for testing purposes, for example, as in https://github.com/danellis/akka-es-test.
Flow fromFunction { product: Product =>
WriteMessage.createUpsertMessage(product.id, product.attributes)
} via ElasticsearchFlow.create[Map[String, String]](index, "_doc")
to define es.flow and then
val graph = response.entity.withSizeLimit(MaxFeedSize).dataBytes
.via(scanner)
.via(CsvToMap.toMap(Utf8))
.map(attrs => Product(attrs("id").decodeString(Utf8), attrs.mapValues(_.decodeString(Utf8))))
.via(es.flow)
val futureDone = graph.runWith(Sink.foreach(println))
futureDone onComplete {
case Success(_) => println("Done")
case Failure(e) => println(e)
}
In a twitter-like application, one of the things they do is when someone posts a tweet, they iterate over all followers and create a copy of the tweet in their timeline. I need something similar. What is the best way to insert a tweet ID into say 10/100/1000 followers assuming I have a list of follower IDs.
I am doing it within Azure WebJobs using Azure Redis. Each webjob is automatically created for every tweet received in the queue. So I may have around 16 simultaneous jobs running at the same time where each one goes through followers and inserts tweets.I'm thinking if 99% of inserts happen, they should not stop because one or a few have failed. I need to continue but log it.
Question: Should I do CreateBatch like below? If I need to retrieve latest tweets first in reverse chronological order is below fine? performant?
var tasks = new List<Task>();
var batch = _cache.CreateBatch();
//loop start
tasks.Add(batch.ListRightPushAsync("follower_id", "tweet_id"));
//loop end
batch.Execute();
await Task.WhenAll(tasks.ToArray());
a) But how do I catch if something fails? try catch?
b) how do I check in a batch for a total # in each list and pop one out if it reaches a certain #? I want to do a LeftPop if the list is > 800. Not sure how to do it all inside the batch.
Please point me to a sample or let me have a snippet here. Struggling to find a good way. Thank you so much.
UPDATE
Does this look right based on #marc's comments?
var tasks = new List<Task>();
followers.ForEach(f =>
{
var key = f.FollowerId;
var task = _cache.ListRightPushAsync(key, value);
task.ContinueWith(t =>
{
if (t.Result > 800) _cache.ListLeftPopAsync(key).Wait();
});
tasks.Add(task);
});
Task.WaitAll(tasks.ToArray());
CreateBatch probably doesn't do what you think it does. What it does is defer a set of operations and ensure they get sent contiguously relative to a single connection - there are some occasions this is useful, but not all that common - I'd probably just send them individually if it was me. There is also CreateTransaction (MULTI/EXEC), but I don't think that would be a good choice here.
That depends on whether you care about the data you're popping. If not: I'd send a LTRIM, [L|R]PUSH pair - to trim the list to (max-1) before adding. Another option would be Lua, but it seems overkill. If you care about the old data, you'll need to do a range query too.
I'd like to understand some details of the relations between command handlers, aggregates, the repository and the event store in CQRS-based systems.
What I've understood so far:
Command handlers receive commands from the bus. They are responsible for loading the appropriate aggregate from the repository and call the domain logic on the aggregate. Once finished, they remove the command from the bus.
An aggregate provides behavior and an internal state. State is never public. The only way to change state is by using the behavior. The methods that model this behavior create events from the command's properties, and apply these events to the aggregate, which in turn call an event handlers that sets the internal state accordingly.
The repository simply allows loading aggregates on a given ID, and adding new aggregates. Basically, the repository connects the domain to the event store.
The event store, last but not least, is responsible for storing events to a database (or whatever storage is used), and reloading these events as a so-called event stream.
So far, so good.
Now there are some issues that I did not yet get:
If a command handler is to call behavior on a yet existing aggregate, everything is quite easy. The command handler gets a reference to the repository, calls its loadById method and the aggregate is returned. But what does the command handler do when there is no aggregate yet, but one should be created? From my understanding the aggregate should later-on be rebuilt using the events. This means that creation of the aggregate is done in reply to a fooCreated event. But to be able to store any event (including the fooCreated one), I need an aggregate. So this looks to me like a chicken-and-egg problem: I can not create the aggregate without the event, but the only component that should create events is the aggregate. So basically it comes down to: How do I create new aggregates, who does what?
When an aggregate triggers an event, an internal event handler responses to it (typically by being called via an apply method) and changes the aggregate's state. How is this event handed over to the repository? Who originates the "please send the new events to the repository / event store" action? The aggregate itself? The repository by watching the aggregate? Someone else who is subscribed to the internal events? ...?
Last but not least I have a problem understanding the concept of an event stream correctly: In my imagination, it's simply something like an ordered list of events. What's of importance is that it's "ordered". Is this right?
The following is based on my own experience and my experiments with various frameworks like Lokad.CQRS, NCQRS, etc. I'm sure there are multiple ways to handle this. I'll post what makes most sense to me.
1. Aggregate Creation:
Every time a command handler needs an aggregate, it uses a repository. The repository retrieves the respective list of events from the event store and calls an overloaded constructor, injecting the events
var stream = eventStore.LoadStream(id)
var User = new User(stream)
If the aggregate didn't exist before, the stream will be empty and the newly created object will be in it's original state. You might want to make sure that in this state only a few commands are allowed to bring the aggregate to life, e.g. User.Create().
2. Storage of new Events
Command handling happens inside a Unit of Work. During command execution every resulting event will be added to a list inside the aggregate (User.Changes). Once execution is finished, the changes will be appended to the event store. In the example below this happens in the following line:
store.AppendToStream(cmd.UserId, stream.Version, user.Changes)
3. Order of Events
Just imagine what would happen, if two subsequent CustomerMoved events are replayed in the wrong order.
An Example
I'll try to illustrate the with a piece of pseudo-code (I deliberately left repository concerns inside the command handler to show what would happen behind the scenes):
Application Service:
UserCommandHandler
Handle(CreateUser cmd)
stream = store.LoadStream(cmd.UserId)
user = new User(stream.Events)
user.Create(cmd.UserName, ...)
store.AppendToStream(cmd.UserId, stream.Version, user.Changes)
Handle(BlockUser cmd)
stream = store.LoadStream(cmd.UserId)
user = new User(stream.Events)
user.Block(string reason)
store.AppendToStream(cmd.UserId, stream.Version, user.Changes)
Aggregate:
User
created = false
blocked = false
Changes = new List<Event>
ctor(eventStream)
isNewEvent = false
foreach (event in eventStream)
this.Apply(event, isNewEvent)
Create(userName, ...)
if (this.created) throw "User already exists"
isNewEvent = true
this.Apply(new UserCreated(...), isNewEvent)
Block(reason)
if (!this.created) throw "No such user"
if (this.blocked) throw "User is already blocked"
isNewEvent = true
this.Apply(new UserBlocked(...), isNewEvent)
Apply(userCreatedEvent, isNewEvent)
this.created = true
if (isNewEvent) this.Changes.Add(userCreatedEvent)
Apply(userBlockedEvent, isNewEvent)
this.blocked = true
if (isNewEvent) this.Changes.Add(userBlockedEvent)
Update:
As a side note: Yves' answer reminded me of an interesting article by Udi Dahan from a couple of years ago:
Don’t Create Aggregate Roots
A small variation on Dennis excellent answer:
When dealing with "creational" use cases (i.e. that should spin off new aggregates), try to find another aggregate or factory you can move that responsibility to. This does not conflict with having a ctor that takes events to hydrate (or any other mechanism to rehydrate for that matter). Sometimes the factory is just a static method (good for "context"/"intent" capturing), sometimes it's an instance method of another aggregate (good place for "data" inheritance), sometimes it's an explicit factory object (good place for "complex" creation logic).
I like to provide an explicit GetChanges() method on my aggregate that returns the internal list as an array. If my aggregate is to stay in memory beyond one execution, I also add an AcceptChanges() method to indicate the internal list should be cleared (typically called after things were flushed to the event store). You can use either a pull (GetChanges/Changes) or push (think .net event or IObservable) based model here. Much depends on the transactional semantics, tech, needs, etc ...
Your eventstream is a linked list. Each revision (event/changeset) pointing to the previous one (a.k.a. the parent). Your eventstream is a sequence of events/changes that happened to a specific aggregate. The order is only to be guaranteed within the aggregate boundary.
I almost agree with yves-reynhout and dennis-traub but I want to show you how I do this. I want to strip my aggregates of the responsibility to apply the events on themselves or to re-hydrate themselves; otherwise there is a lot of code duplication: every aggregate constructor will look the same:
UserAggregate:
ctor(eventStream)
foreach (event in eventStream)
this.Apply(event)
OrderAggregate:
ctor(eventStream)
foreach (event in eventStream)
this.Apply(event)
ProfileAggregate:
ctor(eventStream)
foreach (event in eventStream)
this.Apply(event)
Those responsibilities could be left to the command dispatcher. The command is handled directly by the aggregate.
Command dispatcher class
dispatchCommand(command) method:
newEvents = ConcurentProofFunctionCaller.executeFunctionUntilSucceeds(tryToDispatchCommand)
EventDispatcher.dispatchEvents(newEvents)
tryToDispatchCommand(command) method:
aggregateClass = CommandSubscriber.getAggregateClassForCommand(command)
aggregate = AggregateRepository.loadAggregate(aggregateClass, command.getAggregateId())
newEvents = CommandApplier.applyCommandOnAggregate(aggregate, command)
AggregateRepository.saveAggregate(command.getAggregateId(), aggregate, newEvents)
ConcurentProofFunctionCaller class
executeFunctionUntilSucceeds(pureFunction) method:
do this n times
try
call result=pureFunction()
return result
catch(ConcurentWriteException)
continue
throw TooManyRetries
AggregateRepository class
loadAggregate(aggregateClass, aggregateId) method:
aggregate = new aggregateClass
priorEvents = EventStore.loadEvents()
this.applyEventsOnAggregate(aggregate, priorEvents)
saveAggregate(aggregateId, aggregate, newEvents)
this.applyEventsOnAggregate(aggregate, newEvents)
EventStore.saveEventsForAggregate(aggregateId, newEvents, priorEvents.version)
SomeAggregate class
handleCommand1(command1) method:
return new SomeEvent or throw someException BUT don't change state!
applySomeEvent(SomeEvent) method:
changeStateSomehow() and not throw any exception and don't return anything!
Keep in mind that this is pseudo code projected from a PHP application; the real code should have things injected and other responsibilities refactored out in other classes. The ideea is to keep aggregates as clean as possible and avoid code duplication.
Some important aspects about aggregates:
command handlers should not change state; they yield events or
throw exceptions
event applies should not throw any exception and should not return anything; they only change internal state
An open-source PHP implementation of this could be found here.