Kafka Streams RocksDB large state - events

Is it okay to hold large state in RocksDB when using Kafka Streams? We are planning to use RocksDB as an eventstore to hold billions of events for ininite of time.

Yes, you can store a lot of state there but there are some considerations:
The entire state will also be replicated on the changelog topics, which means your broker will need to have enough disk space for it. Note that this will NOT be mitigated by KIP-405 (Tiered Storage) as tiered storage does not apply for compacted topics.
As #OneCricketeer mentioned, rebuilding the state can take a long time if there's a crash. However, you can mitigate it via multiple ways:
Use a persistent store and re-start the application on a node with access to the same disk (StatefulSet + PersistentVolume in K8s works).
In exactly-once semantics, until KIP-844 is implemented upon an unclean shutdown the state will still be rebuilt from scratch. But once that PR is merged then only a small amount of data will have to be replayed.
Have standby replicas. They will enable failover as soon as the consumer session timeout expires once the kafka streams instance crashes.

The main limitation would be disk space, so sure, it can be done, but if the app crashes for any reason, you might be waiting for a while for the app to rebuild its state.

Related

How can I do when redis is down?

There I have many spring-boot service depends on a redis to generate a continuous id such as 1,2,3...
How can I do when redis is down?
extra:
one Redis, not master-slave
Does Redis persistence keep data from being lost?
You can configure Redis to persist data on disk, i.e. AOF and RDB format. However, since the persistence is asynchronous (with AOF, you can sync your write for every operation, but in that way, you'll have performance problems), you still might loose data.
In your case, it seems that you might use the INCR command to generate id. If Redis is down without dumping all data, you'll get duplicate ids when Redis restarts.
This problem cannot be solved, even if you have a master-replica setup, since the synchronization between master and replica is also asynchronous.

How to share a cache in Flink kinesis stream

I've been using Flink and kinesis analytics recently.
I have a stream of data and also I need a cache to be shared with the stream.
To share the cache data with the kinesis stream, it's connected to a broadcast stream.
The cache source extends SourceFunction and implements ProcessingTimeCallback. Gets the data from DynamoDB every 300 seconds and broadcast it to the next stream using KeyedBroadcastProcessFuction.
But after adding the broadcast stream (in the previous version I hadn't
a cache and I was using KeyedProcessFuction for kinesis stream), when I execute it in kinesis analytics, it keeps restarting about every 1000 seconds without any exception!
I have no configuration with this value and the scenario works fine in between!
Could anybody help me what could be the issue?
My first thought is to wonder if this might be related to checkpointing. Do you have access to the server logs? Flink's logging should make it somewhat clear what's causing the restart.
The reason why I suspect checkpointing is that it occurs at predictable times (and with a long timeout), and using broadcast state can put a lot of pressure on checkpointing. Each parallel instance will checkpoint a full copy of the broadcast state.
Broadcast state has to be kept on-heap, so another possibility is that you are running out of memory.

Kafka state store Rock DB is fault tolerant?

Kafka state store Rock DB is fault tolerant , from the change log how can restore that piece of data which is not functioning ?
The restoration of all built-in storage engines in the Kafka Streams API is fully automated.
Further details are described at http://docs.confluent.io/current/streams/developer-guide.html#fault-tolerant-state-stores, some of which I quote here:
In order to make state stores fault-tolerant (e.g., to recover from machine crashes) as well as to allow for state store migration without data loss (e.g., to migrate a stateful stream task from one machine to another when elastically adding or removing capacity from your application), a state store can be continuously backed up to a Kafka topic behind the scenes. We sometimes refer to this topic as the state store’s associated changelog topic or simply its changelog. In the case of a machine failure, for example, the state store and thus the application’s state can be fully restored from its changelog. You can enable or disable this backup feature for a state store, and thus its fault tolerance.

What is use of kafka in Big Data cluster?

I have recently deployed Big Data cluster. In that I've used Apache Kafka and zookeeper. But still I didn't understand about its usage in cluster. When both are required and for what purpose?
I am simplifying the concepts here. You can find detailed explanation at this article
Kafka is a fast, scalable, distributed in nature by its design, partitioned and replicated commit log service. It has a unique design.
A stream of Messages of a particular type is defined as a Topic.
A Producer can be anyone who can publish messages to a Topic.
The published messages are then stored at a set of servers called Brokers or Kafka Cluster.
A Consumer can subscribe to one or more Topics and consume the published Messages by pulling data from the Brokers.
ZooKeeper is a distributed, hierarchical file system that facilitates loose coupling between clients.
ZooKeeper achieves high availability by running multiple ZooKeeper servers, called an ensemble.
ZooKeeper is used for managing, coordinating Kafka broker.
Each Kafka broker is coordinating with other Kafka brokers using ZooKeeper.
Producer and consumer are notified by ZooKeeper service about the presence of new broker in Kafka system or failure of the broker in Kafka system.
Kafka is a distributed messaging system optimised for high throughput. It is has a persistent queue with messages being appended to to files with on disk structures and performs consistently, even with very modest hardware. In short you will use Kafka to load data into your big data clusters and you will be able to do this at a high speed even when using modest hardware because of the distributed nature of Kafka.
Regarding Zookeeper, its a centralized distributed configuration service and naming registry for large distributed systems. It is robust, since the persisted data is distributed between multiple nodes and one client connects to any of them , migrating if one node fails; as long as a strict majority of nodes are working. So in short, Zookeeper makes sure your big data cluster remains online even if some of its nodes are offline.
In regards to Kafka, I would add a couple things.
Kafka describes itself as being a log not a queue. A log is an append-only, totally-ordered sequence of records ordered by time.
In a strict data structures sense, a queue is FIFO collection that is designed to hold data, but then once it is taken out of the queue there's no way to get it back. Jaco does describe it has being a persistent queue, but using different terms (queue v. log) can help in understanding.
Kafka's log is saved to disk instead of being kept in memory. The designers of Kafka have chosen this because 1. They wanted to avoid a lot of the JVM overhead you get when storing things in data structures. 2. They wanted messages to persist even if the Java process dies for some reason.
Kafka is designed for multiple consumers (Kafka term) to read from the same logs. Each consumer tracks its own offset in the log, Consumer A might be at offset 2, Consumer B might be at offset 8, etc. Tracking Consumers by offset eliminates a lot of complexities from Kafka's side.
Reading that first link will explain a lot of the differences between Kafka and other messaging services.

Data replication in Micro Services: restoring database backup

I am currently working with a legacy system that consists of several services which (among others) communicate through some kind of Enterprise Service Bus (ESB) to synchronize data.
I would like to gradually work this system towards the direction of micro services architecture. I am planning to reduce the dependency on ESB and use more of message broker like RabbitMQ or Kafka. Due to some resource/existing technology limitation, I don't think I will be able to completely avoid data replication between services even though I should be able to clearly define a single service as the data owner.
What I am wondering now, how can I safely do a database backup restore for a single service when necessary? Doing so will cause the service to be out of sync with other services that hold the replicated data. Any experience/suggestion regarding this?
Have your primary database publish events every time a database mutation occurs, and let the replicated services subscribe to this event and apply the same mutation on their replicated data.
You already use a message broker, so you can leverage your existing stack for broadcasting the events. By having replication done through events, a restore being applied to the primary database will be propagated to all other services.
Depending on the scale of the backup, there will be a short period where the data on the other services will be stale. This might or might not be acceptable for your use case. Think of the staleness as some sort of eventual consistency model.

Resources