How to share a cache in Flink kinesis stream - caching

I've been using Flink and kinesis analytics recently.
I have a stream of data and also I need a cache to be shared with the stream.
To share the cache data with the kinesis stream, it's connected to a broadcast stream.
The cache source extends SourceFunction and implements ProcessingTimeCallback. Gets the data from DynamoDB every 300 seconds and broadcast it to the next stream using KeyedBroadcastProcessFuction.
But after adding the broadcast stream (in the previous version I hadn't
a cache and I was using KeyedProcessFuction for kinesis stream), when I execute it in kinesis analytics, it keeps restarting about every 1000 seconds without any exception!
I have no configuration with this value and the scenario works fine in between!
Could anybody help me what could be the issue?

My first thought is to wonder if this might be related to checkpointing. Do you have access to the server logs? Flink's logging should make it somewhat clear what's causing the restart.
The reason why I suspect checkpointing is that it occurs at predictable times (and with a long timeout), and using broadcast state can put a lot of pressure on checkpointing. Each parallel instance will checkpoint a full copy of the broadcast state.
Broadcast state has to be kept on-heap, so another possibility is that you are running out of memory.

Related

Kafka Streams RocksDB large state

Is it okay to hold large state in RocksDB when using Kafka Streams? We are planning to use RocksDB as an eventstore to hold billions of events for ininite of time.
Yes, you can store a lot of state there but there are some considerations:
The entire state will also be replicated on the changelog topics, which means your broker will need to have enough disk space for it. Note that this will NOT be mitigated by KIP-405 (Tiered Storage) as tiered storage does not apply for compacted topics.
As #OneCricketeer mentioned, rebuilding the state can take a long time if there's a crash. However, you can mitigate it via multiple ways:
Use a persistent store and re-start the application on a node with access to the same disk (StatefulSet + PersistentVolume in K8s works).
In exactly-once semantics, until KIP-844 is implemented upon an unclean shutdown the state will still be rebuilt from scratch. But once that PR is merged then only a small amount of data will have to be replayed.
Have standby replicas. They will enable failover as soon as the consumer session timeout expires once the kafka streams instance crashes.
The main limitation would be disk space, so sure, it can be done, but if the app crashes for any reason, you might be waiting for a while for the app to rebuild its state.

NiFi from hadoop to kafak with exactly once guarantee

Is it possible for NiFi to read from hdfs (or hive) and publish data-rows to kafka with exactly once delivery guarantee?
Publishing to Kafka from NiFi is at-least-once guarantee because a failure could occur after Kafka has already received the message, but before NiFi receives the response, which could be due to a network issue, or maybe nifi crashed and restarted at that exact moment.
In any of those cases, the flow file would be put back in the original queue before the publish kafka processor (i.e. the session was never committed), and so it would be tried again.
Due to the threading model where different threads may execute the processor, it can't be guaranteed that the same thread that originally did the publishing will be the same thread that does the retry, and therefore can't make use of the "idempotent producer" concept.

What happens when the ouput of Auditbeat is down

I am using the following pipeline to forward data
Auditbeat ---> logstash ---> ES
Suppose if the logstash machine goes down, I want to know how the Auditbeat handles the situation.
I would like to know the specifics like
is there a retry mechanism?
how long will it retry?
what happens to the audit logs, will it be lost?
the reason that I ask question 3 is that, we enable auditbeat by disabling auditd service (which was generating the auditlogs under /var/log/audit/audit.log). SO
if logstash goes down there is no data forwarding happening and hence there is a chance of data loss. Please clarify.
if auditbeat is storing the data while logstash is down, where is it doing so? and what is the memory(disk space) allocated to this saving process?
Thanks in advance
Auditbeat has an internal queue which stores the events before sending it to the configured output, by default this queue is a memory queue that will store up to 4096 events.
If the queue is full, no more events will be stored until the output comes back and start to receive data from auditbeat, there is a risk of data loss here.
You can change the number of the events that the memory queue stores.
There is also the option to use a file queue, which will save the events to disk before sending to the configured output, but this feature is still in beta.
You can read about the internal queue in the documentation.

Kafka state store Rock DB is fault tolerant?

Kafka state store Rock DB is fault tolerant , from the change log how can restore that piece of data which is not functioning ?
The restoration of all built-in storage engines in the Kafka Streams API is fully automated.
Further details are described at http://docs.confluent.io/current/streams/developer-guide.html#fault-tolerant-state-stores, some of which I quote here:
In order to make state stores fault-tolerant (e.g., to recover from machine crashes) as well as to allow for state store migration without data loss (e.g., to migrate a stateful stream task from one machine to another when elastically adding or removing capacity from your application), a state store can be continuously backed up to a Kafka topic behind the scenes. We sometimes refer to this topic as the state store’s associated changelog topic or simply its changelog. In the case of a machine failure, for example, the state store and thus the application’s state can be fully restored from its changelog. You can enable or disable this backup feature for a state store, and thus its fault tolerance.

Concurrent batch jobs writing logs to database

My production system has nearly 170 Ctrl-M jobs(essentially cron jobs) running every day. These jobs are weaved together(by creating dependencies) to perform ETL operations. eg: Ctrl-M(scheduler like CRON) almost always start with a shell script, which then executes a bunch of python, hive scripts or map-reduce jobs in a specific order.
I am trying to implement logging into each of these processes to be able to better monitor the tasks and the pipelines in whole. The logs would be used to build a monitoring dashboard.
Currently I have implemented logging using a central wrapper which would be called by each of the processes to log information. This wrapper in turn opens up a teradata connection EACH time and calls a teradata stored procedure to write into a teradata table.
This works fine for now. But in my case, multiple concurrent processes (spawning even more parallel child processes) run at the same time and I have started experiencing dropped connections while doing some load testing. Below is an approach I have been thinking about:
Make processes write to some kind of message queues(eg: AWS sqs). A listener would pick data from these message queues asynchronously and then batch write to teradata.
Using files or some structure to perform batch writing to teradata db.
I would definitely like to hear your thoughts on that or any other better approaches. Eventually the end point of logging will be shifted to redshift and hence thinking in the lines of AWS SQS queues.
Thanks in advance.
I think Kinesis firehose is the perfect solution for this. Setting up a the firehose stream is incredibly quick and easy to configure, very inexpensive and will stream your data to s3 bucket of your choice and optionally stream your logs directly to redshift.
If redshift is your end goal (or even just s3), kinesis firehose couldn't make it easier.
https://aws.amazon.com/kinesis/firehose/
Amazon Kinesis Firehose is the easiest way to load streaming data into
AWS. It can capture and automatically load streaming data into Amazon
S3 and Amazon Redshift, enabling near real-time analytics with
existing business intelligence tools and dashboards you’re already
using today. It is a fully managed service that automatically scales
to match the throughput of your data and requires no ongoing
administration. It can also batch, compress, and encrypt the data
before loading it, minimizing the amount of storage used at the
destination and increasing security. You can easily create a Firehose
delivery stream from the AWS Management Console, configure it with a
few clicks, and start sending data to the stream from hundreds of
thousands of data sources to be loaded continuously to AWS – all in
just a few minutes.

Resources