Is it possible to configure Kafka Streams to use RockesDB-Cloud rather than the default RockesDB database as storage engine? If so, is there any configuration recipe? I would like to persist data on S3 buckets instead of local filesystem.
Related
I am streaming real-time data with kinesis data streams into kinesis data analytics and I want to send the transformed data into dynamoDB. Currently it can be done by sending the transformed data into another stream which will trigger lambdas to write into dynamoDB.
But I was wondering if there is a way to directly call lambda from kinesis data analytics?
Officially Apache Flink does not provide any sink connector for Dynamo DB.
Either you have to build your own sink connector or use some non-official ones. You can give it a try to the following one -
Flink connector Dynamodb
I am pulling data stream from RabbitMQ using Apache Flink 1.10.0, now I am using default checkpoint config in memory. Now to make it recovery when task manager restart, I need to store the state and checkpoint in filesystem, the all demo tell should using "hdfs://namenode:4000/....", but now I have no HDFS cluster, my Apache Flink is running in kubernetes cluster, how to store my check point in filesystem?
I read the docs of Apache Flink and tell me it support:
A persistent (or durable) data source that can replay records for a certain amount of time. Examples for such sources are persistent messages queues (e.g., Apache Kafka, RabbitMQ, Amazon Kinesis, Google PubSub) or file systems (e.g., HDFS, S3, GFS, NFS, Ceph, …).
A persistent storage for state, typically a distributed filesystem (e.g., HDFS, S3, GFS, NFS, Ceph, …)
how to config flink to using NFS to store checkpoint and state? I search from internete and find no story about this solution.
To use NFS for checkpointing with Flink you should specify a checkpoint directory using a file: URI that is accessible from every node in the cluster (the job manager and all task managers need to have access using the same URI).
So, for example, you might mount your NFS volume at /data/flink/checkpoints on each machine, and then specify
state.checkpoints.dir: file:///data/flink/checkpoints
My application is configured to read a topic from a configured Kafka, then write the transformed result in the Hadoop HDFS. In order to do so, it needs to be launched on a Yarn cluster node.
In order to do so, we'd like to use Spring DataFlow. But since this application doesn't need any input from another flow (it already knows where to pull its source), and outputs nothing, how can I create a valid DataFlow stream from it ?
In other words, this would be a stream composed of only one app, that should run indefinitely on a Yarn Node.
In this case you need a stream definition that connects to a named destination in Kafka and write to HDFS.
For instance, the stream would look like this:
stream create a1 --definition ":myKafkaTopic > hdfs"
You can read here for more info on this.
We have a project requirement of testing the data at Kafka Layer. So JSON files are moving into hadoop area and kafka is reading the live data in hadoop(Raw Json File). Now I have to test whether the data sent from the other system and read by kafka should be same.
Can i validate the data at kafka?. Does kafka store the messages internally on HDFS?. If yes then is it stored in a file structure similar to what hive saves internally just like a single folder for single table.
Kafka stores data in local files (ie, local file system for each running broker). For those files, Kafka uses its own storage format that is based on a partitioned append-only log abstraction.
The local storage directory, can be configured via parameter log.dir. This configuration happens individually for each broker, ie, each broker can use a different location. The default value is /tmp/kafka-logs.
The Kafka community is also working on tiered-storage, that will allow brokers to no only use local disks, but to offload "cold data" into a second tier: https://cwiki.apache.org/confluence/display/KAFKA/KIP-405%3A+Kafka+Tiered+Storage
Furthermore, each topic has multiple partitions. How partitions are distributed, is a Kafka internal implementation detail. Thus you should now rely on it. To get the current state of your cluster, you can request meta data about topics and partitions etc. (see https://cwiki.apache.org/confluence/display/KAFKA/Finding+Topic+and+Partition+Leader for an code example). Also keep in mind, that partitions are replicated and if you write, you always need to write to the partition leader (if you create a KafkaProducer is will automatically find the leader for each partition you write to).
For further information, browse https://cwiki.apache.org/confluence/display/KAFKA/Index
I think you can, but you have to do that manually. You can let kafka sink whatever output to HDFS. Maybe my answer is a bit late and this 'confluent' reference appeared after that, but briefly one can do the followings:
Assuming you have all servers are running (check the confluent
website)
Create your connector:
name=hdfs-sink
connector.class=io.confluent.connect.hdfs.HdfsSinkConnector
tasks.max=1
topics='your topic'
hdfs.url=hdfs://localhost:9000
flush.size=3
Note: The approach assumes that you are using their platform
(confluent platform) which I haven't use.
Fire the kafka-hdfs streamer.
Also you might find more useful details in this Stack Overflow discussion.
This happens with most of the beginner. Let's first understand that component you see in Big Data processing may not be at all related to Hadoop.
Yarn, MapReduce, HDFS are 3 main core component of Hadoop. Hive, Pig, OOOZIE, SQOOP, HBase etc work on top of Hadoop.
Frameworks like Kafka or Spark are not dependent on Hadoop, they are independent entities. Spark supports Hadoop, like Yarn, can be used for Spark's Cluster mode, HDFS for storage.
Same way Kafka as an independent entity, can work with Spark. It stores its messages in the local file system.
log.dirs=/tmp/kafka-logs
You can check this at $KAFKA_HOME/config/server.properties
Hope this helps.
is it possible to get the data from oracle using flume and store it in Linux local folders, not in hdfs?
Using "File Roll Sink" you can store streaming data to local system. But Flume can't use to ingest data from any RDMS tool.
Not sure about Oracle, but writing to local filesystem is implemented by File Roll Sink.