Does Apache Kafka Store the messages internally in HDFS or Some other File system - hadoop

We have a project requirement of testing the data at Kafka Layer. So JSON files are moving into hadoop area and kafka is reading the live data in hadoop(Raw Json File). Now I have to test whether the data sent from the other system and read by kafka should be same.
Can i validate the data at kafka?. Does kafka store the messages internally on HDFS?. If yes then is it stored in a file structure similar to what hive saves internally just like a single folder for single table.

Kafka stores data in local files (ie, local file system for each running broker). For those files, Kafka uses its own storage format that is based on a partitioned append-only log abstraction.
The local storage directory, can be configured via parameter log.dir. This configuration happens individually for each broker, ie, each broker can use a different location. The default value is /tmp/kafka-logs.
The Kafka community is also working on tiered-storage, that will allow brokers to no only use local disks, but to offload "cold data" into a second tier: https://cwiki.apache.org/confluence/display/KAFKA/KIP-405%3A+Kafka+Tiered+Storage
Furthermore, each topic has multiple partitions. How partitions are distributed, is a Kafka internal implementation detail. Thus you should now rely on it. To get the current state of your cluster, you can request meta data about topics and partitions etc. (see https://cwiki.apache.org/confluence/display/KAFKA/Finding+Topic+and+Partition+Leader for an code example). Also keep in mind, that partitions are replicated and if you write, you always need to write to the partition leader (if you create a KafkaProducer is will automatically find the leader for each partition you write to).
For further information, browse https://cwiki.apache.org/confluence/display/KAFKA/Index

I think you can, but you have to do that manually. You can let kafka sink whatever output to HDFS. Maybe my answer is a bit late and this 'confluent' reference appeared after that, but briefly one can do the followings:
Assuming you have all servers are running (check the confluent
website)
Create your connector:
name=hdfs-sink
connector.class=io.confluent.connect.hdfs.HdfsSinkConnector
tasks.max=1
topics='your topic'
hdfs.url=hdfs://localhost:9000
flush.size=3
Note: The approach assumes that you are using their platform
(confluent platform) which I haven't use.
Fire the kafka-hdfs streamer.
Also you might find more useful details in this Stack Overflow discussion.

This happens with most of the beginner. Let's first understand that component you see in Big Data processing may not be at all related to Hadoop.
Yarn, MapReduce, HDFS are 3 main core component of Hadoop. Hive, Pig, OOOZIE, SQOOP, HBase etc work on top of Hadoop.
Frameworks like Kafka or Spark are not dependent on Hadoop, they are independent entities. Spark supports Hadoop, like Yarn, can be used for Spark's Cluster mode, HDFS for storage.
Same way Kafka as an independent entity, can work with Spark. It stores its messages in the local file system.
log.dirs=/tmp/kafka-logs
You can check this at $KAFKA_HOME/config/server.properties
Hope this helps.

Related

How to set HDFS as statebackend for flink

I want to store flink store in HDFS so that after crash I can recover the flink state from HDFS. I am planning to write state to HDFS every 60 second. How Can I achieve this ?
Is this the config I need to follow ?
https://ci.apache.org/projects/flink/flink-docs-release-1.4/ops/state/state_backends.html#setting-default-state-backend
And where do I specify the check point interval ? Any link or sample code would be helpful
Choosing where checkpoints are stored (e.g., HDFS) is separate from deciding which state backend to use for managing your working state (which can be on-heap, or in local files managed by the RocksDB library).
These two concepts were cleanly separated in Flink 1.12. In early versions of Flink, the two appeared to be more strongly related than they actually are because the filesystem and rocksdb state backend constructors took a file URI as a parameter, specifying where the checkpoints should be stored.
The best way to manage all of this is to leave this out of your code, and to specify the configuration you want in flink-conf.yaml, e.g.,
state.backend: filesystem
state.checkpoints.dir: hdfs://namenode-host:port/flink-checkpoints
execution.checkpointing.interval: 10s
Information about checkpointing and savepointing can be found at https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/fault-tolerance/checkpointing/
On how to configure HDFS as a filesystem, you should check https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/filesystems/overview/

Newbie: Hadoop IIS Logs - Reasonable approach?

I am a totaly beginner at the topic hadoop - so sorry if this is a stupid question.
My fictional scenario is, that I have several webserver (IIS) with several log locations. I want to centralize this log files and based on the data I want to analyze the health of the applications and the webservers.
Since the eco system of hadoop overs a variety of tools I am not sure if my solution is a valid one.
So I thought that I move the log files to hdfs, create an external table on the directory and an internal table and copy the data via hive (insert into ...select from) from the external table to internal table (with some filtering because of the comment lines beginning with #)
When the data is stored within the internal table I delete the previous moved files from hdfs.
Technical it works, I tried it already - but is this is reasonable aproach?
And if yes - how would I automatize this steps since now I did all the stuff manually via Ambari.
THanks for your input
BW
Yes, this is perfectly fine approach.
Outside of setting up the Hive table ahead of time, what's the left to automate?
You want to run things on a schedule? Use Oozie, Luigi, Airflow, or Azkaban.
Ingesting logs from other Windows servers because you have a highly available web service? Use Puppet, for example, to configure your log collections agents (not Hadoop related)
Note, if it's only log file collection that you care about, I would probably have used Elasticsearch instead of Hadoop to store data, Filebeat to continuously watch log files, Logstash to apply per-message level filtering, and Kibana to do visualizations. If combining Elasticsearch for fast indexing/searching and Hadoop for archival, you can insert Kafka between the log message ingestion and message writers/consumers

Change single to cluster hadoop installation keeping persisted data

I'm going to do a Hadoop POC in a production environment. The POC consists of:
1. Receive lots of (real life) events
2. Accumulate them to have a set of events with enough size
3. Persist the set of events in a single file HDFS
In case the POC is successful, I want to install a cluster environment but I need to keep the data persisted in the single cluster installation (POC).
Then, the question: How difficult is to migrate the data already persisted in HDFS single cluster to a real cluster HDFS environment?
Thanks in advance (and sorry for my bad english)
Regards
You don't need to migrate anything.
If you're running Hadoop in Pseudo distributed mode, all you need to do is add datanodes that are pointing at your existing namenode and that's it!
I would like to point out
Persist the set of events in a single file HDFS
I'm not sure about making "a single file", but I suggest you do periodic checkpointing. What if the stream fails? How do you catch dropped events? Spark, Flume, Kafka Connect, NiFi, etc can allow you to do this.
And if all you're doing is streaming events, and want to store them for a variable time period, then Kafka is more built for that use case. You don't necessarily need Hadoop. Push events to Kafka, consume them where it makes sense, for example, a search engine or a database (Hadoop is not a database)

Spark Architecture for processing small binary files saved in HDFS

I don't know how to build architecture for following use case:
I have an Web application where users can upload files(pdf&pptx) and directories to be processed. After upload is complete web application put this files and directories in HDFS, then send a messages on kafka with path to this files.
Spark Application read messages from kafka streaming, collect them on master(driver), and after that process them. I collect messages first because i need to move the code to data, and not move data where the message is received. I understood that spark assign job to executor which already have file locally.
I have issues with kafka because i was forced to collect them first for the above reason, and when want to create checkpoint app crash "because you are attempting to reference SparkContext from a broadcast variable" even if the code run before adding checkpointing( I use sparkContext there because i need to save data to ElasticSearch and PostgreSQL. I don't know how exactly i can do code upgrading in this conditions.
I read about hadoop small files problems, and I understand what problems are in this case. I read that HBase is a better solution to save small files than just save in hdfs. Other problem in hadoop small files problems is big number of mappers and reducers created for computation, but i don't understand if this problem there in spark.
What is the best architecture for this use case?
How to do Job Scheduling? It's kafka good for that? or I need to use other service like rabbitMQ or something else?
Exist some method to add jobs to an running Spark application through some REST API?
How is the best way to save files? Is better to use Hbase because i have small files(<100MB)? Or I need to use SequenceFile? I think SequenceFile isn't for my use case because i need to reprocess some files randomly.
What is the best architecture do you think for this use case?
Thanks!
There is no one single "the best" way to build architecture. You need to make decisions and stick to them. Make the architecture flexible and decoupled so that you can easily replace components if needed.
Consider following stages/layers in your architecture:
Retrieval/Acquisition/Transport of source data (files)
Data processing/transformation
Data archival
As a retrieval component, I would use Flume. It is flexible, supports a lot of sources, channels (including Kafka) and sinks. In your case you can configure source that monitors the directory and extracts the newly received files.
For data processing/transformation - it depends what task you are solving. You probably decided on Spark Streaming. Spark streaming can be integrated with Flume sink (http://spark.apache.org/docs/latest/streaming-flume-integration.html) There are other options available, e.g. Apache Storm. Flume combines very well with Storm. Some transformations can also be applied in Flume.
For data archival - do not store/archive the files directly in Hadoop, unless they are bigger than few hundredths of megabytes. One solution would be to put them in HBase.
Make your architecture more flexible. I would place processed files in a temporary HDFS location and have some job regualarly archive them into zip, HBase, Hadoop Archive (there is such an animal) or any other solution.
Consider using Apache NiFi (aka HDF - Hortonworks Data Flow). It uses internally queues, provides a lot of processors. It can make your life easier and get the workflow developed in minutes. Give it a try. There is nice Hortonworks tutorial which , combined with HDP Sandbox running on a virtual machine/Docker, can bring you up to speed in very short time (1-2 hours?).

how to load a Kafka topic to HDFS?

I am using hortonworks sandbox.
creating topic:
./kafka-topics.sh --create --zookeeper 10.25.3.207:2181 --replication-factor 1 --partitions 1 --topic lognew
tailing the apache access log directory:
tail -f /var/log/httpd/access_log |./kafka-console-producer.sh --broker-list 10.25.3.207:6667 --topic lognew
At another terminal (of kafka bin) start consumer:
./kafka-console-consumer.sh --zookeeper 10.25.3.207:2181 --topic lognew --from-beginning
The apache access logs are sent to the kafka topic "lognew".
I need to store them to HDFS.
Any ideas or suggestions regarding how to do this.
Thanks in advance.
Deepthy
we use camus.
Camus is a simple MapReduce job developed by LinkedIn to load data
from Kafka into HDFS. It is capable of incrementally copying data from
Kafka into HDFS such that every run of the MapReduce job picks up
where the previous run left off. At LinkedIn, Camus is used to load
billions of messages per day from Kafka into HDFS.
But it looks like it's replaced with gobblin
Gobblin is a universal data ingestion framework for extracting,
transforming, and loading large volume of data from a variety of data
sources, e.g., databases, rest APIs, FTP/SFTP servers, filers, etc.,
onto Hadoop. Gobblin handles the common routine tasks required for all
data ingestion ETLs, including job/task scheduling, task partitioning,
error handling, state management, data quality checking, data
publishing, etc. Gobblin ingests data from different data sources in
the same execution framework, and manages metadata of different
sources all in one place. This, combined with other features such as
auto scalability, fault tolerance, data quality assurance,
extensibility, and the ability of handling data model evolution, makes
Gobblin an easy-to-use, self-serving, and efficient data ingestion
framework.
You have several other options as well:
Use Apache Flume to read messages from Kafka and write them to your HDFS. There are several examples of how you can set it up, but one article from Cloudera covers that topic quite well. They even named the solution Flafka ;)
Use Kafka HDFS Connector, which is quite simple to set up. However, it would require Confluent Kafka (which still is open sourced).
We tested both quite successfully.

Resources