Kafka state-store on different scaled instances - spring-boot

I have 5 different machine with each scaled 5 spring boot instance that uses kafka-streams application. I am using 50 partitions compacted topic with different 2-3 topics and each my instance has 10 concurrency. I am using docker swarm and docker volume. Using these topics KTable or KStream do some flatMap, map and join operations with my kafka streams app.
props.put(StreamsConfig.STATE_DIR_CONFIG, /tmp/kafka-streams);
props.put(StreamsConfig.REPLICATION_FACTOR_CONFIG, 3);
props.put(StreamsConfig.NUM_STANDBY_REPLICAS_CONFIG, 2);
props.put(StreamsConfig.COMMIT_INTERVAL_MS_CONFIG, 100);
props.put(StreamsConfig.PROCESSING_GUARANTEE_CONFIG, EXACTLY_ONCE);
props.put("num.stream.threads", 10);
props.put("application.id", applicationId);
If everything goes OK nothing is wrong or no data loss in my application with .join() operations, but when one of my instance is down my join operations are not able to do the join actually.
My question is: When the app is restarted or redeployed (and given that it's working inside a non-persistent container) its state is gone right? Than my join operations don't work. It is When I redeploy my instance and populate my compacted topic from elasticsearch with the latest entities my join operations are OK. So I think when my application starts at new machine my local state-store is gone ? But the kafka document says:
If tasks run on a machine that fails and are restarted on another machine, Kafka Streams guarantees to restore their associated state stores to the content before the failure by replaying the corresponding changelog topics prior to resuming the processing on the newly started tasks. As a result, failure handling is completely transparent to the end user.
Note that the cost of task (re)initialization typically depends primarily on the time for restoring the state by replaying the state stores' associated changelog topics. To minimize this restoration time, users can configure their applications to have standby replicas of local states (i.e. fully replicated copies of the state). When a task migration happens, Kafka Streams then attempts to assign a task to an application instance where such a standby replica already exists in order to minimize the task (re)initialization cost. See num.standby.replicas at the Kafka Streams Configs Section.
(https://kafka.apache.org/0102/documentation/streams/architecture)
Does my downed instance refresh kafka state-store when it goes up ? If it is why I am losing data and I have no idea :/ Or can't reload state-store because of commit_offset because all my instance's use same applicationId ?
Thanks !

The changelog topics are always read from the earliest offset, and they're compacted, so they don't lose data.
If you're joining non compact topics, then sure, you lose data, but that's not limited to Kafka Streams or your specific use case... You'll need to configure the topic to retain data for at least as long as you think it'll take you to solve any issues with topic downtime. While the data is retained, you can always seek your consumer to it
If you want persistent storage, use a volume mount to your container via Kubernetes, for example, or plug in a state state store stored externally to the container like Redis : https://github.com/andreas-schroeder/redisks

Related

Uploding data to kafka producer

I am new to Kafka in Spring Boot, I have been through many tutorials and got fair knowledge about the same.
Currently I have been assigned a task and I am facing an issue. Hope to get some help here.
The scenario is as follows.
1)I have a DB which is getting updated continuously with millions of data.
2)I have to hit the DB after every 5 mins and pick the recently updated data and send it to Kafka.
Condition- The old data that I have picked in my previous iteration should not be picked in my next DB call and Kafka pushing.
I am done with the part of Spring Scheduling to pick the data by using findAll() of spring boot JPA, but how can I write the logic so that it does not pick the old DB records and just take the new record and push it to kafka.
My DB table also have a field called "Recent_timeStamp" of type "datetime"
Its hard to tell without really seeing your logic and the way you work with the database, but from what you've described you should do just "findAll" here.
Instead you should treat your DB table as a time-driven data:
Since it has a field of timestamp, make sure there is an index on it
Instead of "findAll" execute something like:
SELECT <...>
FROM <YOUR_TABLE>
WHERE RECENT_TIMESTAMP > ?
ORDER BY RECENT_TIMESTAMP ASC
In this case you'll get the records ordered by the increasing timestamp
Now the ? denotes the last memorized timestamp that you've handled
So you'll have to maintain the state here
Another option is to query the data whose timestamp is "less" than 5 minutes, in this case the query will look like this (pseudocode since the actual syntax varies):
SELECT <...>
FROM <YOUR_TABLE>
WHERE RECENT_TIMESTAMP < now() - 5 minutes
ORDER BY RECENT_TIMESTAMP ASC
The first method is more robust because if your spring boot application is "down" for some reason you'll be able to recover and query all your records from the point it has failed to send the data. On the other hand you'll have to save this kind of pointer in some type of persistent storage.
The second solution is "easier" in a sense that you don't have a state to maintain but on the other hand you will miss the data after the restart.
In both of the cases you might want to use some kind of pagination because basically you don't know how many records you'll get from the database and if the amount of records exceeds your memory limits, the application with end up with OutOfMemory error thrown.
A Completely different approach is throwing the data to kafka when you write to the database instead of when you read from it. At that point you might have a data chunk of (probably) reasonably limited size and in general you don't need the state because you can store to db and send to kafka from the same service, if the architecture of your application permits to do so.
You can look into kafka connect component if it serves your purpose.
Kafka Connect is a tool for scalably and reliably streaming data between Apache Kafka® and other data systems. It makes it simple to quickly define connectors that move large data sets in and out of Kafka. Kafka Connect can ingest entire databases or collect metrics from all your application servers into Kafka topics, making the data available for stream processing with low latency. An export connector can deliver data from Kafka topics into secondary indexes like Elasticsearch, or into batch systems–such as Hadoop for offline analysis.

system design - How to update cache only after persisted to database?

After watching this awesome talk by Martin Klepmann about how Kafka can be used to stream events so that we can get rid of 2-phase-commits, I have a couple of questions related to updating a cache only when the database is updated properly.
Problem Statement
Lets say you have a Redis cache which stores the user's profile pic and a Postgres database which is used for all the User related operations(creating, updation, deletion, etc)
I want to update my Redis cache only and only when a new user has been successfully added to my database.
How can I do that using Kafka ?
If I am to take the example given in the video then the workflow would follow something like this:
User registers
Request is handled by User Registration Micro service
User Registration Microservice inserts a new entry into the User's table.
Then generates an User Creation Event in the user_created topic.
Cache population microservice consumes the newly created User Creation Event
Cache population microservice updates the redis cache.
The problem starts what would happen if the User Registration Microservice crashed just after writing to the database, but failed to send the event to Kafka ?
What would be the correct way of handling this ?
Does the User Registration Microservice maintain the last event it published ? How can it reliably do that ? Does it write to a DB ? Then the problem starts all over again, what if it published the event to Kafka but failed before it could update its last known offset.
There are three broad approaches one can take for this:
There's the transactional outbox pattern, wherein, in the same transaction as inserting the new entry into the user table, a corresponding user creation event is inserted into an outbox table. Some process then eventually queries that outbox table, publishes the events in that table to Kafka, and deletes the events in the table. Since the inserts are in the same transaction, they either both occur or neither occurs; barring a bug in the process which publishes the outbox to Kafka, this guarantees that every user insert eventually has an associated event published (at least once) to Kafka.
There's a more event-sourcingish pattern, where you publish the user creation event to Kafka and then some consuming process inserts into the user table based on the event. Since this happens with a delay, this strongly suggests that the user registration service needs to keep state of which users it has published creation events for (with the combination of Kafka and Postgres being the source of truth for this). Since Kafka allows a message to be consumed by arbitrarily many consumers, a different consumer can then update Redis.
Change data capture (e.g. Debezium) can be used to tie into Postgres' write-ahead log (as Postgres actually event sources under the hood...) and publish an event that essentially says "this row was inserted into the user table" to Kafka. A consumer of that event can then translate that into a user created event.
CDC in some sense moves the transactional outbox into the infrastructure, at the cost of requiring that the context it inherently throws away be reconstructed later (which is not always possible).
That said, I'd strongly advise against having ____ creation be a microservice and I'd likewise strongly advise against a RInK store like Redis. Both of these smell like attempts to paper over architectural deficiencies by adding microservices and caches.
The one-foot-on-the-way-to-event-sourcing approach isn't one I'd recommend, but if one starts there, the requirement to make the registration service stateful suddenly opens up possibilities which may remove the need for Redis, limit the need for a Kafka-like thing, and allow you to treat the existence of a DB as an implementation detail.

Is KSQL making remote requests under the hood, or is a Table actually a global KTable?

I have a Kafka topic containing customer records, called "customer-created". Each customer is a new record in the topic. There are 4 partitions.
I have two ksql-server instances running, based on the docker image confluentinc/cp-ksql-server:5.3.0. Both use the same KSQL Service Id.
I've created a table:
CREATE TABLE t_customer (id VARCHAR,
firstname VARCHAR,
lastname VARCHAR)
WITH (KAFKA_TOPIC = 'customer-created',
VALUE_FORMAT='JSON',
KEY = 'id');
I'm new to KSQL, but my understanding was that KSQL builds on top of Kafka Streams and that each ksql-server instance is roughly equivalent to a Kafka streams application instance. The first thing I notice is that as soon as I start a new instance of the ksql-server, it already knows about the tables/streams created on the first instance, even though it is an interactive instance in developer mode. Second of all, I can select the same customer based on it's ID from both instances, but I expected to only be able to do that from one of the instances, because I assumed a KSQL Table is equivalent to a KTable, i.e. it should only contain local data, i.e. from the partitions being processed by the ksql-server instance.
SET 'auto.offset.reset'='earliest';
select * from t_customer where id = '7e1a141b-b8a6-4f4a-b368-45da2a9e92a1';
Regardless of which instance of the ksql-server I attach the ksql-cli to, I get a result. The only way that I can get this to work when using plain Kafka Streams, is to use a global KTable. The fact that I get the result from both instances surprised me a little because according to the docs, "Only the Kafka Streams DSL has the notion of a GlobalKTable", so I expected only one of the two instances to find the customer. I haven't found any docs anywhere that explain how to specify that a KSQL Table should be a local or global table.
So here is my question: is a KSQL Table the equivalent of a global KTable and the docs are misleading, or is the ksql-server instance that I am connected to, making a remote request under the hood, to the instance responsible for the ID (presumably based on the partition), as described here, for Kafka Streams?
KSQL does not support GlobalKTables atm.
Your analogy between a KSQL server and a Kafka Streams program is not 100% accurate though. Each query is a Kafka Streams program (note, that a "program" can have multiple instances). Also, there is a difference between persistent queries and transient queries. When you create a TABLE from a topic, the command itself is a metadata operation only (similar for CREATE STREAM from a topic). For both, no query is executed and no Kafka Streams program is started.
The information about all creates STREAMS and TABLES is stored in a shared "command topic" in the Kafka Cluster. All servers with the same ID receive the same information about created streams, tables.
Queries run in the CLI are transient queries and they will be executed by a single server. The information about such transient queries is not distributed to other servers. Basically, a unique query-id (ie, application.id) is generated and the servers runs a single instance KafakStreams program. Hence, the server/program will subscribe to all partitions.
A persistent query (ie, CREATE STREAM AS or CREATE TABLE AS) is a query that queries a STREAM or TABLE and produces a STREAM or TABLE as output. The information about persistent queries is distributed via the "command topic" to all servers (however, not all servers will execute all persistent queries -- it depends on the configured parallelism how many will execute it). For persistent queries, each server that participates to execute the query creates a KafkaStreams instance running the same program, and all will use the same query-Id (ie, application.id) and thus different servers will subscribe to different topics.

How does KafkaStreams determine whether a GlobalKTable is fully populated while bootstrapping?

The topic I use to create a GlobalKTable is very active. In the documentation of KStream-GlobalKTable join I read
The GlobalKTable is fully bootstrapped upon (re)start of a KafkaStreams instance, which means the table is fully populated with all the data in the underlying topic that is available at the time of the startup. The actual data processing begins only once the bootstrapping has completed.
How does KafkaStreams determine whether all data is read? Does it read all the messages with a timestamp below the KafkaStreams instance bootstrap time? Or does it use some kind of timeout?
Either way, I guess we better get the retention and log compaction of the underlying topic right or a restart might take a while.
On startup, Kafka Streams reads the current log-end-offsets and bootstrapping is finished after all those data was loaded (cf. KIP-99).
Note, GlobalKTable is designed with static/rarely changing data in mind.
Either way, I guess we better get the retention and log compaction of the underlying topic right or a restart might take a while.
GlobalKTable checkpoints as of 0.11 (released today) so bootstrapping should be much faster on restart than in 0.10.2.

Kafka Streams with lookup data on HDFS

I'm writing an application with Kafka Streams (v0.10.0.1) and would like to enrich the records I'm processing with lookup data. This data (timestamped file) is written into a HDFS directory on daily basis (or 2-3 times a day).
How can I load this in the Kafka Streams application and join to the actual KStream?
What would be the best practice to reread the data from HDFS when a new file arrives there?
Or would it be better switching to Kafka Connect and write the RDBMS table content to a Kafka topic which can be consumed by all the Kafka Streams application instances?
Update:
As suggested Kafka Connect would be the way to go. Because the lookup data is updated in the RDBMS on a daily basis I was thinking about running Kafka Connect as a scheduled one-off job instead of keeping the connection always open. Yes, because of semantics and the overhead of keeping a connection always open and making sure that it won't be interrupted..etc. For me having a scheduled fetch in this case looks safer.
The lookup data is not big and records may be deleted / added / modified. I don't know either how I can always have a full dump into a Kafka topic and truncate the previous records. Enabling log compaction and sending null values for the keys that have been deleted would probably won't work as I don't know what has been deleted in the source system. Additionally AFAIK I don't have a control when the compaction happens.
The recommend approach is indeed to ingest the lookup data into Kafka, too -- for example via Kafka Connect -- as you suggested above yourself.
But in this case how can I schedule the Connect job to run on a daily basis rather than continuously fetch from the source table which is not necessary in my case?
Perhaps you can update your question you do not want to have a continuous Kafka Connect job running? Are you concerned about resource consumption (load on the DB), are you concerned about the semantics of the processing if it's not "daily udpates", or...?
Update:
As suggested Kafka Connect would be the way to go. Because the lookup data is updated in the RDBMS on a daily basis I was thinking about running Kafka Connect as a scheduled one-off job instead of keeping the connection always open. Yes, because of semantics and the overhead of keeping a connection always open and making sure that it won't be interrupted..etc. For me having a scheduled fetch in this case looks safer.
Kafka Connect is safe, and the JDBC connector has been built for exactly the purpose of feeding DB tables into Kafka in a robust, fault-tolerant, and performant way (there are many production deployments already). So I would suggest to not fallback to "batch update" pattern just because "it looks safer"; personally, I think triggering daily ingestions is operationally less convenient than just keeping it running for continuous (and real-time!) ingestion, and it also leads to several downsides for your actual use case (see next paragraph).
But of course, your mileage may vary -- so if you are set on updating just once a day, go for it. But you lose a) the ability to enrich your incoming records with the very latest DB data at the point in time when the enrichment happens, and, conversely, b) you might actually enrich the incoming records with stale/old data until the next daily update completed, which most probably will lead to incorrect data that you are sending downstream / making available to other applications for consumption. If, for example, a customer updates her shipping address (in the DB) but you only make this information available to your stream processing app (and potentially many other apps) once per day, then an order processing app will ship packages to the wrong address until the next daily ingest will complete.
The lookup data is not big and records may be deleted / added / modified. I don't know either how I can always have a full dump into a Kafka topic and truncate the previous records. Enabling log compaction and sending null values for the keys that have been deleted would probably won't work as I don't know what has been deleted in the source system.
The JDBC connector for Kafka Connect already handles this automatically for you: 1. it ensures that DB inserts/updates/deletes are properly reflected in a Kafka topic, and 2. Kafka's log compaction ensures that the target topic doesn't grow out of bounds. Perhaps you may want to read up on the JDBC connector in the docs to learn which functionality you just get for free: http://docs.confluent.io/current/connect/connect-jdbc/docs/ ?

Resources