how to store apache flink checkpoint in nfs filesystem - hadoop

I am pulling data stream from RabbitMQ using Apache Flink 1.10.0, now I am using default checkpoint config in memory. Now to make it recovery when task manager restart, I need to store the state and checkpoint in filesystem, the all demo tell should using "hdfs://namenode:4000/....", but now I have no HDFS cluster, my Apache Flink is running in kubernetes cluster, how to store my check point in filesystem?
I read the docs of Apache Flink and tell me it support:
A persistent (or durable) data source that can replay records for a certain amount of time. Examples for such sources are persistent messages queues (e.g., Apache Kafka, RabbitMQ, Amazon Kinesis, Google PubSub) or file systems (e.g., HDFS, S3, GFS, NFS, Ceph, …).
A persistent storage for state, typically a distributed filesystem (e.g., HDFS, S3, GFS, NFS, Ceph, …)
how to config flink to using NFS to store checkpoint and state? I search from internete and find no story about this solution.

To use NFS for checkpointing with Flink you should specify a checkpoint directory using a file: URI that is accessible from every node in the cluster (the job manager and all task managers need to have access using the same URI).
So, for example, you might mount your NFS volume at /data/flink/checkpoints on each machine, and then specify
state.checkpoints.dir: file:///data/flink/checkpoints

Related

What is the recommended DefaultFS (File system) for Hadoop on ephemeral Dataproc clusters?

What is the recommended DefaultFS (File system) for Hadoop on Dataproc. Are there any benchmarks, considerations available around using GCS vs HDFS as the default file system?
I was also trying to test things out and discovered that when I set the DefaultFS to a gs:// path, the Hive scratch files are getting created - both on HDFS as well as the GCS paths. Is this happening synchronously and adding to latency or does the write to GCS happen after the fact?
Would appreciate any guidance, reference around this.
Thank you
PS: These are ephemeral Dataproc clusters that are going to be using GCS for all persistent data.
HDFS is faster. There should already be public benchmarks for that, or just taken as a fact because GCS is networked storage where HDFS is directly mounted in the Dataproc VMs.
"Recommended" would be persistent storage, though, so GCS, but maybe only after finalizing the data in the applications. For example, you might not want Hive scratch files in GCS since they'll never be used outside of the current query session, but you would want Spark checkpoints if you're running periodic batch jobs that scale down the HDFS cluster in between executions
I would say the default (HDFS) is the recommended. Typically, the input and output data of Dataproc jobs are persisted outside of the cluster in GCS or BigQuery, the cluster is used for compute and intermediate data. These intermediate data are stored on local disks directly or through HDFS which eventually also goes to local disks. After the job is done, you can safely delete the cluster, only pay for the storage of input and output data to save cost.
Also HDFS usually has lower latency for intermediate data, especially for lots of small files and metadata operations, e.g. dir rename. GCS is better at throughput for large files.
But when using HDFS, you need to provision sufficient disk space (at least 1TB each node) and consider using local SSDs. See https://cloud.google.com/dataproc/docs/support/spark-job-tuning#optimize_disk_size for more details.

nifi putHDFS writes to local filesystem

Challenge
I currently have two hortonworks clusters, a NIFI cluster and a HDFS cluster, and want to write to HDFS using NIFI.
On the NIFI cluster I use a simple GetFile connected to a PutHDFS.
When pushing a file through this, the PutHDFS terminates in success. However, rather than seeing a file dropped on my HFDS (on the HDFS cluster) I just see a file being dropped onto the local filesystem where I run NIFI.
This confuses me, hence my question:
How to make sure PutHDFS writes to HDFS, rather than to the local filesystem?
Possibly relevant context:
In the PutHDFS I have linked to the hive-site and core-site of the HDFS cluster (I tried updating all server references to the HDFS namenode, but with no effect)
I don't use Kerberos on the HDFS cluster (I do use it on the NIFI cluster)
I did not see anything looking like an error in the NIFI app log (which makes sense as it succesfully writes, just in the wrong place)
Both clusters are newly generated on Amazon AWS with CloudBreak, and opening all nodes to all traffic did not help
Can you make sure that you are able move file from NiFi node to Hadoop using below command:-
hadoop fs -put
If you are able move your file using above command then you must check your Hadoop config file which you are passing in your PutHDFS processor.
Also, check that you don't have anyother flow running to make sure that no other flow is processing that file.

Flink Cluster Performance is much worse than standalone

I use flink to process HDFS files or local files.
When I use standalone setup, the server can process the data at 500k/s.
But when I use cluster,the server can only process the data at 100k/s.
It is so weird,I can not figure out what is going on.
I found that when I use cluster(2 servers), there is always one server which has low speeds to read/write data. The flink cluster is based on hadoop.
Can anyone help me?

Does Apache Kafka Store the messages internally in HDFS or Some other File system

We have a project requirement of testing the data at Kafka Layer. So JSON files are moving into hadoop area and kafka is reading the live data in hadoop(Raw Json File). Now I have to test whether the data sent from the other system and read by kafka should be same.
Can i validate the data at kafka?. Does kafka store the messages internally on HDFS?. If yes then is it stored in a file structure similar to what hive saves internally just like a single folder for single table.
Kafka stores data in local files (ie, local file system for each running broker). For those files, Kafka uses its own storage format that is based on a partitioned append-only log abstraction.
The local storage directory, can be configured via parameter log.dir. This configuration happens individually for each broker, ie, each broker can use a different location. The default value is /tmp/kafka-logs.
The Kafka community is also working on tiered-storage, that will allow brokers to no only use local disks, but to offload "cold data" into a second tier: https://cwiki.apache.org/confluence/display/KAFKA/KIP-405%3A+Kafka+Tiered+Storage
Furthermore, each topic has multiple partitions. How partitions are distributed, is a Kafka internal implementation detail. Thus you should now rely on it. To get the current state of your cluster, you can request meta data about topics and partitions etc. (see https://cwiki.apache.org/confluence/display/KAFKA/Finding+Topic+and+Partition+Leader for an code example). Also keep in mind, that partitions are replicated and if you write, you always need to write to the partition leader (if you create a KafkaProducer is will automatically find the leader for each partition you write to).
For further information, browse https://cwiki.apache.org/confluence/display/KAFKA/Index
I think you can, but you have to do that manually. You can let kafka sink whatever output to HDFS. Maybe my answer is a bit late and this 'confluent' reference appeared after that, but briefly one can do the followings:
Assuming you have all servers are running (check the confluent
website)
Create your connector:
name=hdfs-sink
connector.class=io.confluent.connect.hdfs.HdfsSinkConnector
tasks.max=1
topics='your topic'
hdfs.url=hdfs://localhost:9000
flush.size=3
Note: The approach assumes that you are using their platform
(confluent platform) which I haven't use.
Fire the kafka-hdfs streamer.
Also you might find more useful details in this Stack Overflow discussion.
This happens with most of the beginner. Let's first understand that component you see in Big Data processing may not be at all related to Hadoop.
Yarn, MapReduce, HDFS are 3 main core component of Hadoop. Hive, Pig, OOOZIE, SQOOP, HBase etc work on top of Hadoop.
Frameworks like Kafka or Spark are not dependent on Hadoop, they are independent entities. Spark supports Hadoop, like Yarn, can be used for Spark's Cluster mode, HDFS for storage.
Same way Kafka as an independent entity, can work with Spark. It stores its messages in the local file system.
log.dirs=/tmp/kafka-logs
You can check this at $KAFKA_HOME/config/server.properties
Hope this helps.

Google cloud click to deploy hadoop

Why does google cloud click to deploy hadoop workflow requires picking size for local persistent disk even if you plan to use the hadoop connector for cloud storage? The default size is 500 GB .. I was thinking if it does need some disk it should be much smaller in size. Is there a recommended persistent disk size when using cloud storage connector with hadoop in google cloud?
"Deploying Apache Hadoop on Google Cloud Platform
The Apache Hadoop framework supports distributed processing of large data sets across a clusters of computers.
Hadoop will be deployed in a single cluster. The default deployment creates 1 master VM instance and 2 worker VMs, each having 4 vCPUs, 15 GB of memory, and a 500-GB disk. A temporary deployment-coordinator VM instance is created to manage cluster setup.
The Hadoop cluster uses a Cloud Storage bucket as its default file system, accessed through Google Cloud Storage Connector. Visit Cloud Storage browser to find or create a bucket that you can use in your Hadoop deployment.
Apache Hadoop on Google Compute Engine
Click to Deploy Apache Hadoop
Apache Hadoop
ZONE
us-central1-a
WORKER NODE COUNT
CLOUD STORAGE BUCKET
Select a bucket
HADOOP VERSION
1.2.1
MASTER NODE DISK TYPE
Standard Persistent Disk
MASTER NODE DISK SIZE (GB)
WORKER NODE DISK TYPE
Standard Persistent Disk
WORKER NODE DISK SIZE (GB)
"
The three big uses of persistent disks (PDs) are:
Logs, both daemon and job (or container in YARN)
These can get quite large with debug logging turned on and can result in many writes per second
MapReduce shuffle
These can be large, but benefit more from higher IOPS and throughput
HDFS (image and data)
Due to the layout of directories, persistent disks will also be used for other items like job data (JARs, auxiliary data distributed with the application, etc), but those could just as easily use the boot PD.
Bigger persistent disks are almost always better due to the way GCE scales IOPS and throughput with disk size [1]. 500G is probably a good starting point to start profiling your applications and uses. If you don't use HDFS, find that your applications don't log much, and don't spill to disk when shuffling, then a smaller disk can probably work well.
If you find that you actually don't want or need any persistent disk, then bdutil [2] also exists as a command line script that can create clusters with more configurability and customizability.
https://cloud.google.com/developers/articles/compute-engine-disks-price-performance-and-persistence/
https://cloud.google.com/hadoop/

Resources