Read from Kafka and write to hdfs in parquet - hadoop

I am new to the BigData eco system and kind of getting started.
I have read several articles about reading a kafka topic using spark streaming but would like to know if it is possible to read from kafka using a spark job instead of streaming ?
If yes, could you guys help me in pointing out to some articles or code snippets that can get me started.
My second part of the question is writing to hdfs in parquet format.
Once i read from Kafka , i assume i will have an rdd.
Convert this rdd into a dataframe and then write the dataframe as a parquet file.
Is this the right approach.
Any help appreciated.
Thanks

For reading data from Kafka and writing it to HDFS, in Parquet format, using Spark Batch job instead of streaming, you can use Spark Structured Streaming.
Structured Streaming is a scalable and fault-tolerant stream processing engine built on the Spark SQL engine. You can express your streaming computation the same way you would express a batch computation on static data. The Spark SQL engine will take care of running it incrementally and continuously and updating the final result as streaming data continues to arrive. You can use the Dataset/DataFrame API in Scala, Java, Python or R to express streaming aggregations, event-time windows, stream-to-batch joins, etc. The computation is executed on the same optimized Spark SQL engine. Finally, the system ensures end-to-end exactly-once fault-tolerance guarantees through checkpointing and Write Ahead Logs. In short, Structured Streaming provides fast, scalable, fault-tolerant, end-to-end exactly-once stream processing without the user having to reason about streaming.
It comes with Kafka as a built in Source, i.e., we can poll data from Kafka. It’s compatible with Kafka broker versions 0.10.0 or higher.
For pulling the data from Kafka in batch mode, you can create a Dataset/DataFrame for a defined range of offsets.
// Subscribe to 1 topic defaults to the earliest and latest offsets
val df = spark
.read
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("subscribe", "topic1")
.load()
df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.as[(String, String)]
// Subscribe to multiple topics, specifying explicit Kafka offsets
val df = spark
.read
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("subscribe", "topic1,topic2")
.option("startingOffsets", """{"topic1":{"0":23,"1":-2},"topic2":{"0":-2}}""")
.option("endingOffsets", """{"topic1":{"0":50,"1":-1},"topic2":{"0":-1}}""")
.load()
df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.as[(String, String)]
// Subscribe to a pattern, at the earliest and latest offsets
val df = spark
.read
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("subscribePattern", "topic.*")
.option("startingOffsets", "earliest")
.option("endingOffsets", "latest")
.load()
df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.as[(String, String)]
Each row in the source has the following schema:
| Column | Type |
|:-----------------|--------------:|
| key | binary |
| value | binary |
| topic | string |
| partition | int |
| offset | long |
| timestamp | long |
| timestampType | int |
Now, to write Data to HDFS in parquet format, following code can be written:
df.write.parquet("hdfs://data.parquet")
For more information on Spark Structured Streaming + Kafka, please refer to following guide - Kafka Integration Guide
I hope it helps!

You already have a couple of good answers on the topic.
Just wanted to stress out - be careful to stream directly into a parquet table.
Parquet's performance shines when parquet row group sizes are large enough (for simplicity, you can say file size should be in order of 64-256Mb for example), to take advantage of dictionary compression, bloom filters etc. (one parquet file can have multiple row chunks in it, and normally does have multiple row chunks in each file; although row chunks can't span multiple parquet files)
If you're streaming directly to a parquet table, then you'll end up very likely with a bunch of tiny parquet files (depending on mini-batch size of Spark Streaming, and volume of data). Querying such files can be very slow. Parquet may require reading all files' headers to reconcile schema for example and it's a big overhead. If this is the case, you will need to have a separate process that will, for example, as a workaround, read older files, and writes them "merged" (this wouldn't be a simple file-level merge, a process would actually need to read in all parquet data and spill out larger parquet files).
This workaround may kill the original purpose of data "streaming". You could look at other technologies here too - like Apache Kudu, Apache Kafka, Apache Druid, Kinesis etc that can work here better.
Update: since I posted this answer, there is now a new strong player here - Delta Lake. https://delta.io/ If you're used to parquet, you'll find Delta very attractive (actually, Delta is built on top of parquet layer + metadata). Delta Lake offers:
ACID transactions on Spark:
Serializable isolation levels ensure that readers never see inconsistent data.
Scalable metadata handling: Leverages Spark’s distributed processing power to handle all the metadata for petabyte-scale tables with billions of files at ease.
Streaming and batch unification: A table in Delta Lake is a batch table as well as a streaming source and sink. Streaming data ingest, batch historic backfill, interactive queries all just work out of the box.
Schema enforcement: Automatically handles schema variations to prevent insertion of bad records during ingestion.
Time travel: Data versioning enables rollbacks, full historical audit trails, and reproducible machine learning experiments.
Upserts and deletes: Supports merge, update and delete operations to enable complex usecases like change-data-capture, slowly-changing-dimension (SCD) operations, streaming upserts, and so on.

Use Kafka Streams. SparkStreaming is an misnomer (it's mini-batch under the hood, at least up to 2.2).
https://eng.verizondigitalmedia.com/2017/04/28/Kafka-to-Hdfs-ParquetSerializer/

Related

How to manage small files created due to insertion of stream data into Hive?

I am reading Kafka messages using simple Kafka consumer.
Storing the output into HDFS and doing some filtering.
After filtration, I am writing this data into Hive, which causes small orc files into the hive.
Could someone advise me how to handle such a scenario?
You can reduce the number of existing ORC files afterwards by running
ALTER TABLE tablename CONCATENATE;
or ALTER TABLE tablename PARTITION (field=value) CONCATENATE;
To prevent HIVE generating too many ORC files, try with
set hive.merge.mapredfiles=true;
There's tools out there such as Camus and Apache Gobblin which have scripts for the purposes of pulling Kafka data continuously, and having "sweeper / compaction" processes that can be run by schedulers such as Oozie to build larger time partitions
You can also look at Kafka Connect framework with the HDFS plugin by Confluent (you do not need to be running Confluent's Kafka installation to use it). It has support for batching up and large files (I've gotten up to 4GB files per Kafka partition from it) and it will build Hive partitions for you automatically
Or Apache Nifi can be used in between your streams and storage to compress the data before landing on Hadoop
The only other alternative I know of are mapreduce based tools on Github (filecrush is one) or writing your own Hive/Pig/Spark script that reads a location, does very little transformation to it (like calculating a date partition), then writes it out somewhere else. This will cause the smaller blocks to be combined into multiple, and there are hadoop settings in each framework to control how much data should be output per file

Import small stream in Impala

We are currently on a Big Data project.
The Big Data platform Hadoop Cloudera.
Input of our system we have a small flow of data, we collect via Kafka (approximately 80Mo/h continuously).
Then the messages are stored in HDFS to be queried via Impala.
Our client does not want to separate the hot data with the cold data. After 5 mins, the data must be accessible in the history data (cold data). We chose to have a single database.
To insert the data, we use the JDBC connector provided by Impala API (eg INSERT INTO ...).
we are aware that this is not the recommended solution, each Impala insertion creates a file (<10kb) in HDFS.
We seek a solution to insert a small stream in a Imapala base which avoids getting many small files.
What solution we preconize?

Clarification of Sqoop and Flume

I am very new to big data and i have little confusion regarding Sqoop and Flume
So i get that difference between the Sqoop and Flume
Sqoop is for transferring bulk data from RDBMS
Flume is for streaming of data such as log files
My confusion is because big data architecture i am looking at (which i have no virtual copy of) grouped structured data and its transferred by Sqoop and Unstructured streamed by Flume.
My question regard that is does that mean Flume is only for streaming?
What about high frequency data? and does Flume support transfer of unstructured data that are non-log files (i.e. audio, video) or would Sqoop be able to handle that?
Final question is can Sqoop work with federated data sources? if yes with both real and virtual?
Thanks,
Apache Flume is a distributed, reliable, and available system for efficiently collecting, aggregating and moving large amounts of log data from many different sources to a centralized data store.
The use of Apache Flume is not only restricted to log data aggregation. Since data sources are customizable, Flume can be used to transport massive quantities of event data including but not limited to network traffic data, social-media-generated data, email messages and pretty much any data source possible.
Apache Sqoop is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases(it imports data, transform the data in Hadoop MapReduce, and then export the data).
Sqoop automates most of this process, relying on the database to describe the schema for the data to be imported. Sqoop uses MapReduce to import and export the data, which provides parallel operation as well as fault tolerance.
Source: sqoop-vs-flume-battle-of-the-hadoop
Reference: INGESTION AND STREAMING
Flume is efficient with streams and if you want to just dump data from RDBMS why not use sqoop?
By high frequency data if you mean social media yes flume can handle it. Unstructured data yes, flume may handle that too.
sqoop is essentially a tool to ingest data in HDFS from RDBMS. Under the hood, it generates simple Java code which submit a query to a RDBMS and writes the result to HDFS. This means that you can import with sqoop everything which can be accessed via JDBC connection and which has a Java driver available. For this reason, you can't use it for files (like logs) or things like that.
Then sqoop can't handle video or audio files.
Flume, instead, is used to monitor and ingesting in real time informations. You can ingest everything for which there is a Flume source available (https://flume.apache.org/FlumeUserGuide.html#flume-sources).

how to efficiently move data from Kafka to an Impala table?

Here are the steps to the current process:
Flafka writes logs to a 'landing zone' on HDFS.
A job, scheduled by Oozie, copies complete files from the landing zone to a staging area.
The staging data is 'schema-ified' by a Hive table that uses the staging area as its location.
Records from the staging table are added to a permanent Hive table (e.g. insert into permanent_table select * from staging_table).
The data, from the Hive table, is available in Impala by executing refresh permanent_table in Impala.
I look at the process I've built and it "smells" bad: there are too many intermediate steps that impair the flow of data.
About 20 months ago, I saw a demo where data was being streamed from an Amazon Kinesis pipe and was queryable, in near real-time, by Impala. I don't suppose they did something quite so ugly/convoluted. Is there a more efficient way to stream data from Kafka to Impala (possibly a Kafka consumer that can serialize to Parquet)?
I imagine that "streaming data to low-latency SQL" must be a fairly common use case, and so I'm interested to know how other people have solved this problem.
If you need to dump your Kafka data as-is to HDFS the best option is using Kafka Connect and Confluent HDFS connector.
You can either dump the data to a parket file on HDFS you can load in Impala.
You'll need I think you'll want to use a TimeBasedPartitioner partitioner to make parquet files every X miliseconds (tuning the partition.duration.ms configuration parameter).
Addign something like this to your Kafka Connect configuration might do the trick:
# Don't flush less than 1000 messages to HDFS
flush.size = 1000
# Dump to parquet files
format.class=io.confluent.connect.hdfs.parquet.ParquetFormat
partitioner.class = TimebasedPartitioner
# One file every hour. If you change this, remember to change the filename format to reflect this change
partition.duration.ms = 3600000
# Filename format
path.format='year'=YYYY/'month'=MM/'day'=dd/'hour'=HH/'minute'=mm
Answering that question in year 2022, I would say that solution would be streaming messages from Kafka to Kudu and integrate Impala with Kudu, as it has already tight integration.
Here is example of Impala schema for Kudu:
CREATE EXTERNAL TABLE my_table
STORED AS KUDU
TBLPROPERTIES (
'kudu.table_name' = 'my_kudu_table'
);
Apache Kudu supports SQL inserts and it uses own file format under the hood. Alternatively you could use Apache Phoenix which supports inserts and upserts (if you need exactly once semantic) and uses HBase under the hood.
As long as the Impala is your final way of accessing the data, you shouldn't care about underlaying formats.

How to get data from HDFS? Hive?

I am new to Hadoop. I ran a map reduce on my data and now I want to query it so I can put it into my website. Is Apache Hive the best way to do that? I would greatly appreciate any help.
Keep in mind that Hive is a batch processing system, which under the hoods converts the SQL statements to bunch of MapReduce jobs with stage builds in between. Also, Hive is a high latency system i.e. based on your dataset sizes you are looking at minutes to hours or even days to process a complicated query.
So, if you want to serve the results from your MapReduce job output in your website, its highly recommended you export the results back to a RDBMS using sqoop and then take it from there.
Or, if the data itself is huge and cannot be exported back to RDBMS. Then another option you could think of is using a NoSQL system like HBase.
welcome to Hadoop!
I highly recommend you watch Cloudera Essentials for Apache Hadoop | Chapter 5: The Hadoop Ecosystem and familiarize yourself with the different ways to transfer data inbound and outbound from your HDFS cluster. The video is easy-to-watch and describes advantages / disadvantages to each tool, but this outline should give you the basics of the Hadoop Ecosystem:
Flume - Data integration and import of flat files into HDFS. Designed for asynchronous data streams (e.g., log files). Distributed, scalable, and extensible. Supports various endpoints. Allows preprocessing on data before loading to HDFS.
Sqoop - Bidirectional transfer of structured data (RDBMS) and HDFS. Permits incremental import to HDFS. RDBMS must support JDBC or ODBC.
Hive - SQL-like interface to Hadoop. Requires table structure. JDBC and/or ODBC is required.
Hbase - Allows interactive access of HDFS. Sits on top of HDFS and apply structure to data. Allows for random reads, scales horizontally with cluster. Not a full query language; only permits get/put/scan operations (can be used with Hive and/or Impala). Row-key indexes only on data. Does not use Map Reduce paradigm.
Impala - Similar to Hive, high-performance SQL Engine for querying vast amounts of data stored in HDFS. Does not use Map Reduce. Good alternative to Hive.
Pig - Data flow language for transforming large datasets. Permits schema optionally defined at runtime. PigServer (Java API) permits programmatic access.
Note: I assume the data you are trying to read already exists in HDFS. However, some of the products in the Hadoop ecosystem may be useful for your application or as a general reference, so I included them.
If you're only looking to get data from HDFS then yes, you can do so via Hive.
However, you'll most beneficiate from it if your data are already organized (for instance, in columns).
Lets take an example : your map-reduce job produced a csv file named wordcount.csv and containing two rows : word and count. This csv file is on HDFS.
Let's now suppose you want to know the occurence of the word "gloubiboulga". You can simply achieve this via the following code :
CREATE TABLE data
(
word STRING,
count INT,
text2 STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ",";
LOAD DATA LOCAL INPATH '/wordcount.csv'
OVERWRITE INTO TABLE data;
select word, count from data where word=="gloubiboulga";
Please note that while this language looks highly like SQL, you'll still have to learn a few things about it.

Resources