I have two streams and want join second stream to first inside a window because I need to do some computation on the join of the two streams related to a session (on of the streams governs the session).
Actually, as read from the documentation, the (session) window allows computations only on a single stream not in a join.
I have tried to use a combination of session window and coprocessor function but the result is not exactly what I expected.
Is there a way to merge two streams related to a session window in Flink?
Flink's DataStream API includes a session window join, which is described here.
You'll have to see if its semantics match what you have in mind. The session gap is defined by both streams having no events during that interval, and the join is an inner join, so if there is a session window that only contains elements from one stream, no output will be emitted.
If that doesn't meet your needs, then I would suggest a CoProcessFunction, but without a session window. In other words, I'm suggesting you might implement all of the logic yourself.
Related
I'm considering to create a SpringBatch job that uses rows from a table to create events and push the events to a PubSub implementation. The problem here is that the order of events should be the same as the order of the rows in the table that used as source for the events creation process.
It seems to me now that the SpringBatch is not designed for such order perseverance, as batches are processed and then written in parallel. The only ugly but probably working solution for this problem would be to do all the processing in the reader (so the reader could do reading+processing+writing to PubSub), that could help to keep order inside paginated batches, but even that doesn't seem to guarantee the batches order, according to the doc
Any ideas how the transition ordered rows->ordered events could be implemented using SpringBatch or, at least, SpringBoot? Thank you in advance!
It seems to me now that the SpringBatch is not designed for such order perseverance, as batches are processed and then written in parallel.
This is true only for a multi-threaded or a partitioned step. The default (single-threaded) chunk-oriented step implementation processes items in the same order returned by the item reader. So if you make your database reader return items in the order you want, those items will be written to your Pub/Sub broker in the same order.
In my current project we have created a data-pipeline using Kafka, Kafka Connect, Elasticsearch. The data ends up on a topic "signal-topic" and is off the form
KeyValue<id:String, obj:Signal>
Now Im trying to introduce Kafka Streams to be able to do some processing of the data in its way from Kafka to Elasticsearch.
My first goal is to be able to enhance the data with different kinds of side-information. A typical scenario would be to attach another field to the data based on some information already existing in the data. For instance, the data contains a "rawevent"-field and based on that I want to add a "event-description" and then output to a different topic.
What would be the "correct" way of implementing this?
I was thinking of maby having the side-data on a separate
topic in kafka
KeyValue<rawEvent:String, eventDesc:String>
and having streams joining the two topics , but I'm not sure how to accomplish that.
Would this be possible? All examples that I've come across seem to require that the keys of the data-sources would be the same and since mine are'nt I'm not sure its possible.
If anyone have a snippet for how this could be done it would be great.
Thanks in advance.
You have two possibilities:
You can extractrawEvent from Signal and set as new Key to do the join against a KTable<rawEvent:String, eventDesc:String>. Something like KStream#selectKey(...)#join(KTable...)
You can do KStream-GlobalKTable join: this allows to extract a non-key join attribute from the KStream (in your case rawEvent) that is used to do a GlobalKTable lookup to compute the join.
Note, that both joins do provide different semantics as a KStream-KTable join is synchronized on time, while a KStream-GlobalKTable join is not synchronized. Check out this blog post for more details: https://www.confluent.io/blog/crossing-streams-joins-apache-kafka/
In Kafka Stream library, I want to know difference between KTable and GlobalKTable.
Also in KStream class, there are two methods leftJoin() and outerJoin(). What is the difference between these two methods also?
I read KStream.leftJoin, but did not manage to find an exact difference.
KTable VS GlobalKTable
A KTable shardes the data between all running Kafka Streams instances, while a GlobalKTable has a full copy of all data on each instance. The disadvantage of GlobalKTable is that it obviously needs more memory. The advantage is, that you can do a KStream-GlobalKTable join with a non-key attribute from the stream. For a KStream-KTable join and a non-key stream attribute for the join is only possible by extracting the join attribute and set it as the key before doing the join -- this will result in a repartitioning step of the stream before the join can be computed.
Note though, that there is also a semantical difference: For stream-table join, Kafka Stream align record processing ordered based on record timestamps. Thus, the update to the table are aligned with the records of you stream. For GlobalKTable, there is no time synchronization and thus update to GlobalKTable and completely decoupled from the processing of the stream records (thus, you get weaker semantics).
For further details, see KIP-99: Add Global Tables to Kafka Streams.
leftJoin() VS outerJoin()
About left and outer joins: it's like in a database a left-outer and full-outer join, respectively.
For a left outer join, you might "lose" data of your right input stream in case there is no match for the join in the left-hand side.
For a (full)outer join, no data will be dropped and each input record of both streams will be in the result stream.
I'm trying to produce some kind of materialized view from a stream of database updates (provided by e.g. the DBMS's transaction log, with the help of e.g. maxwell-daemon). The view is materialized as a Kafka compacted topic.
The view is a simple join and could be expressed as a query like this:
SELECT u.email user_email, t.title todo_title, t.state todo_state
FROM User u
JOIN Todo t
ON t.user_id = u.id
I want the view to be updated every time User or Todo change (a message to be published on the view's kafka topic).
With Kafka Streams it seems to be possible to achieve that by doing this:
Make a KTable of User changes
Make a KTable of Todo changes
Join both
However, I'm not sure of a few things:
Is that even possible ?
Will this maintain original ordering of events ? e.g. if User is changed, then Todo is changed, am I guaranteed to see these changes in this order in the result of the join ?
How to handle transactions ? e.g. multiple database changes might be part of the same transaction. How to make sure that both KTables are updates atomically, and that all join results show only fully-applied transactions ?
Is that even possible ?
Yes. The pattern you describe will compute what you want out-of-the-box.
Will this maintain original ordering of events ? e.g. if User is changed, then Todo is changed, am I guaranteed to see these changes in this order in the result of the join ?
Streams will process data according to timestamps (ie, records with smaller timestamps first). Thus, in general this will work as expected. However, there is no strict guarantee because in stream processing it's more important to make progress all the time (and don't block). Thus, Streams only applies a "best effort approach" with regard to processing records in timestamp order. For example, if one changelog does not provide any data, Streams will just keep going only processing data from the other changelog (and not block). This might lead to "out of order" processing with regard to timestamps from different partitions/topics.
How to handle transactions ? e.g. multiple database changes might be part of the same transaction. How to make sure that both KTables are updates atomically, and that all join results show only fully-applied transactions ?
That's not possible at the moment. Each update will be processed individually and you will see each intermediate (ie, not committed) result. However, Kafka will introduce "transactional processing" in the future that will enable to handle transactions. (see https://cwiki.apache.org/confluence/display/KAFKA/KIP-98+-+Exactly+Once+Delivery+and+Transactional+Messaging and https://cwiki.apache.org/confluence/display/KAFKA/KIP-129%3A+Streams+Exactly-Once+Semantics)
I'm trying to create a Ruby script that spawns several concurrent child processes, each of which needs to access the same data store (a queue of some type) and do something with the data. The problem is that each row of data should be processed only once, and a child process has no way of knowing whether another child process might be operating on the same data at the same instant.
I haven't picked a data store yet, but I'm leaning toward PostgreSQL simply because it's what I'm used to. I've seen the following SQL fragment suggested as a way to avoid race conditions, because the UPDATE clause supposedly locks the table row before the SELECT takes place:
UPDATE jobs
SET status = 'processed'
WHERE id = (
SELECT id FROM jobs WHERE status = 'pending' LIMIT 1
) RETURNING id, data_to_process;
But will this really work? It doesn't seem intuitive the Postgres (or any other database) could lock the table row before performing the SELECT, since the SELECT has to be executed to determine which table row needs to be locked for updating. In other words, I'm concerned that this SQL fragment won't really prevent two separate processes from select and operating on the same table row.
Am I being paranoid? And are there better options than traditional RDBMSs to handle concurrency situations like this?
As you said, use a queue. The standard solution for this in PostgreSQL is PgQ. It has all these concurrency problems worked out for you.
Do you really want many concurrent child processes that must operate serially on a single data store? I suggest that you create one writer process who has sole access to the database (whatever you use) and accepts requests from the other processes to do the database operations you want. Then do the appropriate queue management in that thread rather than making your database do it, and you are assured that only one process accesses the database at any time.
The situation you are describing is called "Non-repeatable read". There are two ways to solve this.
The preferred way would be to set the transaction isolation level to at least REPEATABLE READ. This will mean that any row that concurrent updates of the nature you described will fail. if two processes update the same rows in overlapping transactions one of them will be canceled, its changes ignored, and will return an error. That transaction will have to be retried. This is achieved by calling
SET TRANSACTION ISOLATION LEVEL REPEATABLE READ
At the start of the transaction. I can't seem to find documentation that explains an idiomatic way of doing this for ruby; you may have to emit that sql explicitly.
The other option is to manage the locking of tables explicitly, which can cause a transaction to block (and possibly deadlock) until the table is free. Transactions won't fail in the same way as they do above, but contention will be much higher, and so I won't describe the details.
That's pretty close to the approach I took when I wrote pg_message_queue, which is a simple queue implementation for PostgreSQL. Unlike PgQ, it requires no components outside of PostgreSQL to use.
It will work just fine. MVCC will come to the rescue.