How do I combine a Kafka Processor with a Kafka Streams application? - apache-kafka-streams

I am trying to construct a Kafka Streams operator that takes in a Stream< timestamp, data> and outputs another Stream where the timestamps are sorted in ascending order; the purpose is to deal with streams that have "out of order" entries due to delays in the supplier.
At first, I thought about doing this with time-windowed aggregation, but then I happened upon a solution using a Kafka Processor. I figured I could then say something like:
class SortProcessor implements Processor<timestamp,data> ...
class SortProcessorSupplier ...supplies suitably initialized SortProcessor
KStream<timestamp,data> input_stream = ...sourced from "input_topic"
KStream<timestamp,data> output_stream =
input_stream.process( new SortProcessorSupplier(...parameters...) );
However, this doesn't work because KStream.process returns void.
So, my question is: How do I "wrap" the Processor so that I can use it as follows:
KStream<timestamp,data> input_stream = ...sourced from "input_topic"
KStream<timestamp,data> output_stream =
new WrappedSortProcessor( input_stream, ...parameters... )

Instead of a Processor you can use a Transformer, which is very similar to a Processor but is better suited to forwarding results on to the stream. You can then invoke it from the stream using the KStream.transform() method instead of process().

Related

RunnableGraph to wait for multiple response from source

I am using Akka in Play Controller and performing ask() to a actor by name publish , and internal publish actor performs ask to multiple actors and passes reference of sender. The controller actor needs to wait for response from multiple actors and create a list of response.
Please find the code below. but this code is only waiting for 1 response and latter terminating. Please suggest
// Performs ask to publish actor
Source<Object,NotUsed> inAsk = Source.fromFuture(ask(publishActor,service.getOfferVerifyRequest(request).getPayloadData(),1000));
final Sink<String, CompletionStage<String>> sink = Sink.head();
final Flow<Object, String, NotUsed> f3 = Flow.of(Object.class).map(elem -> {
log.info("Data in Graph is " +elem.toString());
return elem.toString();
});
RunnableGraph<CompletionStage<String>> result = RunnableGraph.fromGraph(
GraphDSL.create(
sink , (builder , out) ->{
final Outlet<Object> source = builder.add(inAsk).out();
builder
.from(source)
.via(builder.add(f3))
.to(out); // to() expects a SinkShape
return ClosedShape.getInstance();
}
));
ActorMaterializer mat = ActorMaterializer.create(aSystem);
CompletionStage<String> fin = result.run(mat);
fin.toCompletableFuture().thenApply(a->{
log.info("Data is "+a);
return true;
});
log.info("COMPLETED CONTROLLER ");
If you have several responses ask won't cut it, that is only for a single request-response where the response ends up in a Future/CompletionStage.
There are a few different strategies to wait for all answers:
One is to create an intermediate actor whose only job is to collect all answers and then when all partial responses has arrived respond to the original requestor, that way you could use ask to get a single aggregate response back.
Another option would be to use Source.actorRef to get an ActorRef that you could use as sender together with tell (and skip using ask). Inside the stream you would then take elements until some criteria is met (time has passed or elements have been seen). You may have to add an operator to mimic the ask response timeout to make sure the stream fails if the actor never responds.
There are some other issues with the code shared, one is creating a materializer on each request, these have a lifecycle and will fill up your heap over time, you should rather get a materializer injected from play.
With the given logic there is no need whatsoever to use the GraphDSL, that is only needed for complex streams with multiple inputs and outputs or cycles. You should be able to compose operators using the Flow API alone (see for example https://doc.akka.io/docs/akka/current/stream/stream-flows-and-basics.html#defining-and-running-streams )

There is no way to create a reference to stream & it’s not possible to reuse the same stream multiple times

Reading article about java 8 stream, and found
Java Streams are consumable, so there is no way to create a reference
to stream for future usage. Since the data is on-demand, it’s not
possible to reuse the same stream multiple times.
at the same time at the same article
//sequential stream
Stream<Integer> sequentialStream = myList.stream();
//parallel stream
Stream<Integer> parallelStream = myList.parallelStream();
What does it mean of "there is no way to create a reference to stream for future usage" ? aren't sequentialStream and parallelStream references to streams
also what does it mean of "it’s not possible to reuse the same stream multiple times" ?
What it means is that every time you need to operate on a stream, you must make a new one.
So you cannot, for example, have something like:
Class Person {
private Stream<String> phoneNumbers;
Stream<String> getPhoneNumbers() {
return phoneNumbers;
}
}
and just reuse that one stream whenever you like. Instead, you must have something like
Class Person {
private List<String> phoneNumbers;
Stream<String> getPhoneNumbers() {
return phoneNumbers.stream(); // make a NEW stream over the same data
}
}
The code snipped you included does just that. It makes 2 different streams over the same data

Is possible to know the size of a stream without using a terminal operation

I have 3 interfaces
public interface IGhOrg {
int getId();
String getLogin();
String getName();
String getLocation();
Stream<IGhRepo> getRepos();
}
public interface IGhRepo {
int getId();
int getSize();
int getWatchersCount();
String getLanguage();
Stream<IGhUser> getContributors();
}
public interface IGhUser {
int getId();
String getLogin();
String getName();
String getCompany();
Stream<IGhOrg> getOrgs();
}
and I need to implement Optional<IGhRepo> highestContributors(Stream<IGhOrg> organizations)
this method returns a IGhRepo with most Contributors(getContributors())
I tried this
Optional<IGhRepo> highestContributors(Stream<IGhOrg> organizations){
return organizations
.flatMap(IGhOrg::getRepos)
.max((repo1,repo2)-> (int)repo1.getContributors().count() - (int)repo2.getContributors().count() );
}
but it gives me the
java.lang.IllegalStateException: stream has already been operated upon or closed
I understand that count() is a terminal operation in Stream but I can't solve this problem, please help!
thanks
Is possible to know the size of a stream without using a terminal operation
No it's not, because streams can be infinite or generate output on demand. It's not necessary that they are backed by collections.
but it gives me the
java.lang.IllegalStateException: stream has already been operated upon or closed
That's becase you are returning the same stream instance on each method invocation. You should return a new Stream instead.
I understand that count() is a terminal operation in Stream but I can't solve this problem, please help!
IMHO you are misusing the streams here. Performance and simplicity wise it's much better that you return some Collection<XXX> instead of Stream<XXX>
NO.
This is not possible to know the size of a stream in java.
As mentioned in java 8 stream docs
No storage. A stream is not a data structure that stores elements;
instead, it conveys elements from a source such as a data structure,
an array, a generator function, or an I/O channel, through a pipeline
of computational operations.
You don't specify this, but it looks like some or possibly all of the interface methods that return Stream<...> values don't return a fresh stream each time they are called.
This seems problematic to me from an API point of view, as it means each of these streams, and a fair chunk of the object's functionality can be used at most once.
You may be able to solve the particular problem you are having by ensuring that the stream from each object is used only once in the method, something like this:
Optional<IGhRepo> highestContributors(Stream<IGhOrg> organizations) {
return organizations
.flatMap(IGhOrg::getRepos)
.distinct()
.map(repo -> new AbstractMap.SimpleEntry<>(repo, repo.getContributors().count()))
.max(Map.Entry.comparingByValue())
.map(Map.Entry::getKey);
}
Unfortunately it looks like you will now be stuck if you want to (for example) print a list of the contributors, as the stream returned from getContributors() for the returned IGhRepo has already been consumed.
You might want to consider having your implementation objects return a fresh stream each time a stream returning method is called.
You could keep a counter that is incremented per "iteration" using peek. In the example below the counter is incremented before every item is processed with doSomeLogic
final var counter = new AtomicInteger();
getStream().peek(item -> counter.incrementAndGet()).forEach(this::doSomeLogic);

how to choose a field value from a specific stream in storm

public void execute(Tuple input) {
Object value = input.getValueByField(FIELD_NAME);
...
}
When calling getValueByField, how do I specify a particular stream name emitted by previous Bolt/Spout so that particular FIELD_NAME is coming from that stream?
I need to know this because I'm facing the following exception:
InvalidTopologyException(msg:Component: [bolt2-name] subscribes from non-existent stream: [default] of component [bolt1-name])
So, I want to specify a particular stream while calling getValueBy... methods.
I don't remember a way of doing it on a tuple, but you can get the information of who sent you the tuple:
String sourceComponent = tuple.getSourceComponent();
String streamId = tuple.getSourceStreamId();
Then you can use a classic switch/case in java to call a specific method that will know which fields are available.
You can also iterate through fields included in your tuple to check if the field is available but I find this way dirty.
for (String field : tuple.getFields()) {
// Check something on field...
}
Just found out that the binding to a specific stream could be done while building topology.
The Spout could declare fields to a stream (in declareOutputFields method)
declarer.declareStream(streamName, new Fields(field1, field2));
...and emit value to the stream
collector.emit(streamName, new Values(value1, value2...), msgID);
When Bolt is being added in the topology, it could subscribe to a specific stream from preceding spout or bolt like following
topologyBuilder.setBolt(boltId, new BoltClass(), parallelismLevel)
.localOrShuffleGrouping(spoutORBoltID, streamID);
The overloaded version of the method localOrShuffleGrouping provides an option to specify streamID as last argument.

Can I write different types of messages to a chronicle-queue?

I would like to write different types of messages to a chronicle-queue, and process messages in consumers depending on their types.
How can I do that?
Chronicle-Queue provides low level building blocks you can use to write any kind of message so it is up to you to choose the right data structure.
As example, you can prefix the data you write to a chronicle with a small header with some meta-data and then use it as discriminator for data processing.
To achieve this I use Wire
try (DocumentContext dc = appender.writingDocument())
{
final Wire wire = dc.wire();
final ValueOut v = wire.getValueOut();
valueOut.typePrefix(m.getClass());
valueOut.marshallable(m);
}
When reading back I:
try (DocumentContext dc = tailer.readingDocument())
{
final Wire wire = dc.wire();
final ValueIn valueIn = wire.getValueIn();
final Class clazz = valueIn.typePrefix();
// msgPool is a prealloacted hashmap containing the messages I read
final ReadMarshallable readObject = msgPool.get(clazz);
valueIn.readMarshallable(readObject)
// readObject can now be used
}
You can also write/read a generic object. This will be slightly slower than using your own scheme, but is it a simple way to always read the type you wrote.

Resources