When creating a Kafka stream with GQLAlchemy which arguments are required? - memgraphdb

I've taken a look at the documentation on creating Kafka streams using GQLAlchemy, but I don't get which parameters are optional and which ones are required.

Required parameters are name, topics and transform. When you create MemgraphKafkaStream instance, you actually run CREATE KAFKA STREAM query, so it's best to refer to the Memgraph docs for such questions. There you can see that other arguments are optional.

Related

KafkaItemWriter writing but not to default topic

Can someone post an example or help how i can use KafkaItemWriter to write java objects into non-default topic. Just like in SpringKafka we use KafkaTemplate.send(topicName, user) is there a way for kafkaItemWriter also? I do not want to write to default topic configured through properites file like spring.kafka.template.default-topic=topicName.
In my springBatch job i want to run to different batch jobs and like to write into two different topics.
I’m not sure this is what you want or not.
But KafkaTemplate has method “setDefaultTopic” (https://docs.spring.io/spring-kafka/api/org/springframework/kafka/core/KafkaTemplate.html#setDefaultTopic(java.lang.String))
I think you can change default topic what you want by using setDefaultTopic before call KafakItemWriter step.

Kafka Streams - override default addSink implementation / custom producer

It is my first post to this here and I am not sure if this was covered here before, but here goes: I have a Kafka Streams application, using Processor API, following the topology below:
1. Consume data from an input topic (processor.addSource())
2. Inserts data into a DB (processor.addProcessor())
3. Produce its process status to an output topic (processor.addSink())
App works big time, however, for traceability purposes, I need to have in the logs the moment kstreams produced a message to the output topic, as well as its RecordMetaData (topic, partition, offset).
Example below:
KEY="MY_KEY" OUTPUT_TOPIC="MY-OUTPUT-TOPIC" PARTITION="1" OFFSET="1000" STATUS="SUCCESS"
I am not sure if there is a way to override the default kafka streams producer to add this logging or maybe creating my own producer to plug it on the addSink process. I partially achieved it by implementing my own ExceptionHandler (default.producer.exception.handler), but it only covers the exceptions.
Thanks in advance,
Guilherme
If you configure the streams application to use a ProducerInterceptor, then you should be able to get the information you need. Specifically, implementing the onAcknowledgement() will provide access to everything you listed above.
To configure interceptors in a streams application:
Properties props = new Properties();
// add this configuration in addition to your other streams configs
props.put(StreamsConfig.producerPrefix(ProducerConfig.INTERCEPTOR_CLASSES_CONFIG), Collections.singletonList(MyProducerInterceptor.class));
You can provide more than one interceptor if desired, just add the class name and change the list implementation from a singleton to a regular List. Execution of the interceptors follows the order of the classes in the list.
EDIT: Just to be clear, you can override the provided Producer in Kafka Streams via the KafkaClientSupplier interface, but IMHO using an interceptor is the cleaner approach. But which direction to go is up to you. You pass in your KafkaClientSupplier in an overloaded Kafka Streams constructor.

Kafka streams using context forward from processor called in dsl api

I have a processor and would like to call context.forward() in it. However I feel like I need to set a sink topic for it to actually get forwarded. If I was using the Toplogy I would just .addSource(), .addProcessor(), .addSink(). However with the DSL I have a StreamsBuilder/KStream. Is there anyway to use context.forward() when calling a processor from the dsl?
NOTE: I need to use a processor instead of a transform as I have custom logic on when to forward records down stream.
stream.process(() -> new WindowAggregatorProcessor(storeName), storeName);
stream.process() is a terminal operation in the DSL. You can use stream.transform() instead to get an output stream. A Transformer is basically the same as a Processor.

Is there a way to get offset for each message consumed in kafka streams?

In order to avoid reading of messages which are processed but missed to get committed when a KAFKA STREAMS is killed , I want to get the offset for each message along with the key and value so that I can store it somewhere and use it to avoid the reprocessing of already processed messages.
Yes, this is possible. See the FAQ entry at http://docs.confluent.io/current/streams/faq.html#accessing-record-metadata-such-as-topic-partition-and-offset-information.
I'll copy-paste the key information below:
Accessing record metadata such as topic, partition, and offset information?
Record metadata is accessible through the Processor API.
It is also accessible indirectly through the DSL thanks to its
Processor API integration.
With the Processor API, you can access record metadata through a
ProcessorContext. You can store a reference to the context in an
instance field of your processor during Processor#init(), and then
query the processor context within Processor#process(), for example
(same for Transformer). The context is updated automatically to match
the record that is currently being processed, which means that methods
such as ProcessorContext#partition() always return the current
record’s metadata. Some caveats apply when calling the processor
context within punctuate(), see the Javadocs for details.
If you use the DSL combined with a custom Transformer, for example,
you could transform an input record’s value to also include partition
and offset metadata, and subsequent DSL operations such as map or
filter could then leverage this information.

#Storm: how to setup various metrics for the same data source

I'm trying to setup Storm to aggregate a stream, but with various (DRPC available) metrics on the same stream.
E.g. the stream is consisted of messages that have a sender, a recipient, the channel through which the message arrived and a gateway through which it was delivered. I'm having trouble deciding how to organize one or more topologies that could give me e.g. total count of messages by gateway and/or by channel. And besides the total, counts per minute would be nice too.
The basic idea is to have a spout that will accept messaging events, and from there aggregate the data as needed. Currently I'm playing around with Trident and DRPC and I've came up with two possible topologies that solve the problem at this stage. Can't decide which approach is better, if any?!
The entire source is available at this gist.
It has three classes:
RandomMessageSpout
used to emit the messaging data
simulates the real data source
SeparateTopology
creates a separate DRPC stream for each metric needed
also a separate query state is created for each metric
they all use the same spout instance
CombinedTopology
creates a single DRPC stream with all the metrics needed
creates a separate query state for each metric
each query state extracts the desired metric and groups results for it
Now, for the problems and questions:
SeparateTopology
is it necessary to use the same spout instance or can I just say new RandomMessageSpout() each time?
I like the idea that I don't need to persist grouped data by all the metrics, but just the groupings we need to extract later
is the spout emitted data actually processed by all the state/query combinations, e.g. not the first one that comes?
would this also later enable dynamic addition of new state/query combinations at runtime?
CombinedTopology
I don't really like the idea that I need to persist data grouped by all the metrics since I don't need all the combinations
it came as a surprise that the all the metrics always return the same data
e.g. channel and gateway inquiries return status metrics data
I found that this was always the data grouped by the first field in state definition
this topic explains the reasoning behind this behaviour
but I'm wondering if this is a good way of doing thins in the first place (and will find a way around this issue if need be)
SnapshotGet vs TupleCollectionGet in stateQuery
with SnapshotGet things tended to work, but not always, only TupleCollectionGet solved the issue
any pointers as to what is correct way of doing that?
I guess this is a longish question / topic, but any help is really appreciated!
Also, if I missed the architecture entirely, suggestions on how to accomplish this would be most welcome.
Thanks in advance :-)
You can't actually split a stream in SeparateTopology by invoking newStream() using the same spout instance, since that would create new instances of the same RandomMessageSpout spout, which would result in duplicate values being emitted to your topology by multiple, separate spout instances. (Spout parallelization is only possible in Storm with partitioned spouts, where each spout instance processes a partition of the whole dataset -- a Kafka partition, for example).
The correct approach here is to modify the CombinedTopology to split the stream into multiple streams as needed for each metric you need (see below), and then do a groupBy() by that metric's field and persistentAggregate() on each newly branched stream.
From the Trident FAQ,
"each" returns a Stream object, which you can store in a variable. You can then run multiple eaches on the same Stream to split it, e.g.:
Stream s = topology.each(...).groupBy(...).aggregate(...)
Stream branch1 = s.each(...)
Stream branch2 = s.each(...)
See this thread on Storm's mailing list, and this one for more information.

Resources