What are the differences between KTable vs GlobalKTable and leftJoin() vs outerJoin()? - apache-kafka-streams

In Kafka Stream library, I want to know difference between KTable and GlobalKTable.
Also in KStream class, there are two methods leftJoin() and outerJoin(). What is the difference between these two methods also?
I read KStream.leftJoin, but did not manage to find an exact difference.

KTable VS GlobalKTable
A KTable shardes the data between all running Kafka Streams instances, while a GlobalKTable has a full copy of all data on each instance. The disadvantage of GlobalKTable is that it obviously needs more memory. The advantage is, that you can do a KStream-GlobalKTable join with a non-key attribute from the stream. For a KStream-KTable join and a non-key stream attribute for the join is only possible by extracting the join attribute and set it as the key before doing the join -- this will result in a repartitioning step of the stream before the join can be computed.
Note though, that there is also a semantical difference: For stream-table join, Kafka Stream align record processing ordered based on record timestamps. Thus, the update to the table are aligned with the records of you stream. For GlobalKTable, there is no time synchronization and thus update to GlobalKTable and completely decoupled from the processing of the stream records (thus, you get weaker semantics).
For further details, see KIP-99: Add Global Tables to Kafka Streams.
leftJoin() VS outerJoin()
About left and outer joins: it's like in a database a left-outer and full-outer join, respectively.
For a left outer join, you might "lose" data of your right input stream in case there is no match for the join in the left-hand side.
For a (full)outer join, no data will be dropped and each input record of both streams will be in the result stream.

Related

Flink Join two streams inside a session window

I have two streams and want join second stream to first inside a window because I need to do some computation on the join of the two streams related to a session (on of the streams governs the session).
Actually, as read from the documentation, the (session) window allows computations only on a single stream not in a join.
I have tried to use a combination of session window and coprocessor function but the result is not exactly what I expected.
Is there a way to merge two streams related to a session window in Flink?
Flink's DataStream API includes a session window join, which is described here.
You'll have to see if its semantics match what you have in mind. The session gap is defined by both streams having no events during that interval, and the join is an inner join, so if there is a session window that only contains elements from one stream, no output will be emitted.
If that doesn't meet your needs, then I would suggest a CoProcessFunction, but without a session window. In other words, I'm suggesting you might implement all of the logic yourself.

Joining separate topics with Kafka Streams?

In my current project we have created a data-pipeline using Kafka, Kafka Connect, Elasticsearch. The data ends up on a topic "signal-topic" and is off the form
KeyValue<id:String, obj:Signal>
Now Im trying to introduce Kafka Streams to be able to do some processing of the data in its way from Kafka to Elasticsearch.
My first goal is to be able to enhance the data with different kinds of side-information. A typical scenario would be to attach another field to the data based on some information already existing in the data. For instance, the data contains a "rawevent"-field and based on that I want to add a "event-description" and then output to a different topic.
What would be the "correct" way of implementing this?
I was thinking of maby having the side-data on a separate
topic in kafka
KeyValue<rawEvent:String, eventDesc:String>
and having streams joining the two topics , but I'm not sure how to accomplish that.
Would this be possible? All examples that I've come across seem to require that the keys of the data-sources would be the same and since mine are'nt I'm not sure its possible.
If anyone have a snippet for how this could be done it would be great.
Thanks in advance.
You have two possibilities:
You can extractrawEvent from Signal and set as new Key to do the join against a KTable<rawEvent:String, eventDesc:String>. Something like KStream#selectKey(...)#join(KTable...)
You can do KStream-GlobalKTable join: this allows to extract a non-key join attribute from the KStream (in your case rawEvent) that is used to do a GlobalKTable lookup to compute the join.
Note, that both joins do provide different semantics as a KStream-KTable join is synchronized on time, while a KStream-GlobalKTable join is not synchronized. Check out this blog post for more details: https://www.confluent.io/blog/crossing-streams-joins-apache-kafka/

Kafka streams to build materialised views

I'm trying to produce some kind of materialized view from a stream of database updates (provided by e.g. the DBMS's transaction log, with the help of e.g. maxwell-daemon). The view is materialized as a Kafka compacted topic.
The view is a simple join and could be expressed as a query like this:
SELECT u.email user_email, t.title todo_title, t.state todo_state
FROM User u
JOIN Todo t
ON t.user_id = u.id
I want the view to be updated every time User or Todo change (a message to be published on the view's kafka topic).
With Kafka Streams it seems to be possible to achieve that by doing this:
Make a KTable of User changes
Make a KTable of Todo changes
Join both
However, I'm not sure of a few things:
Is that even possible ?
Will this maintain original ordering of events ? e.g. if User is changed, then Todo is changed, am I guaranteed to see these changes in this order in the result of the join ?
How to handle transactions ? e.g. multiple database changes might be part of the same transaction. How to make sure that both KTables are updates atomically, and that all join results show only fully-applied transactions ?
Is that even possible ?
Yes. The pattern you describe will compute what you want out-of-the-box.
Will this maintain original ordering of events ? e.g. if User is changed, then Todo is changed, am I guaranteed to see these changes in this order in the result of the join ?
Streams will process data according to timestamps (ie, records with smaller timestamps first). Thus, in general this will work as expected. However, there is no strict guarantee because in stream processing it's more important to make progress all the time (and don't block). Thus, Streams only applies a "best effort approach" with regard to processing records in timestamp order. For example, if one changelog does not provide any data, Streams will just keep going only processing data from the other changelog (and not block). This might lead to "out of order" processing with regard to timestamps from different partitions/topics.
How to handle transactions ? e.g. multiple database changes might be part of the same transaction. How to make sure that both KTables are updates atomically, and that all join results show only fully-applied transactions ?
That's not possible at the moment. Each update will be processed individually and you will see each intermediate (ie, not committed) result. However, Kafka will introduce "transactional processing" in the future that will enable to handle transactions. (see https://cwiki.apache.org/confluence/display/KAFKA/KIP-98+-+Exactly+Once+Delivery+and+Transactional+Messaging and https://cwiki.apache.org/confluence/display/KAFKA/KIP-129%3A+Streams+Exactly-Once+Semantics)

Does the Kafka streams aggregation have any ordering guarantee?

My Kafka topic contains statuses keyed by deviceId. I would like to use KStreamBuilder.stream().groupByKey().aggregate(...) to only keep the latest value of a status in a TimeWindow. I guess that, as long as the topic is partitioned by key, the aggregation function can always return the latest values in this fashion:
(key, value, older_value) -> value
Is this a guarantee I can expect from Kafka Streams? Should I roll my own processing method that checks the timestamp?
Kafka Streams guaranteed ordering by offsets but not by timestamp. Thus, by default "last update wins" policy is based on offsets but not on timestamp. Late arriving records ("late" defined on timestamps) are out-of-order based on timestamps and they will not be reordered to keep original offsets order.
If you want to have your window containing the latest value based on timestamps you will need to use Processor API (PAPI) to make this work.
Within Kafka Streams' DSL, you cannot access the record timestamp that is required to get the correct result. A easy way might be to put a .transform() before .groupBy() and add the timestamp to the record (ie, its value) itself. Thus, you can use the timestamp within your Aggregator (btw: a .reduce() that is simpler to use might also work instead of .aggregate()). Finally, you need to do .mapValues() after your .aggregate() to remove the timestamp from the value again.
Using this mix-and-match approach of DSL and PAPI should simplify your code, as you can use DSL windowing support and KTable and do not need to do low-level time-window and state management.
Of course, you can also just do all this in a single low-level stateful processor, but I would not recommend it.

Which is faster in Apache Pig: Split then Union or Filter and Left Join?

I am currently processing a large input table (10^7 rows) in Pig Latin where the table is filtered on some field, processed and the processed rows are returned back into the original table. When the processed rows are returned back into the original table the fields the filters are based on are changed so that in subsequent filtering the processed fields are ignored.
Is it more efficient in Apache Pig to first split the processed and unprocessed tables on the filtering criteria, apply processing and union the two tables back together or to filter the first table, apply the process to the filtered table and perform a left join back into the original table using a primary key?
I can't say which one will actually run faster, I would simply run both versions and compare execution times :)
If you go for the first solution (split, then join) make sure to specify the smaller (if there is one) of the two tables first in the join operation (probably that's going to be the newly added data). The Pig documentation suggests that this will lead to a performance improvement because the last table is "not brought into memory but streamed through instead".

Resources