how to efficiently move data from Kafka to an Impala table? - hadoop

Here are the steps to the current process:
Flafka writes logs to a 'landing zone' on HDFS.
A job, scheduled by Oozie, copies complete files from the landing zone to a staging area.
The staging data is 'schema-ified' by a Hive table that uses the staging area as its location.
Records from the staging table are added to a permanent Hive table (e.g. insert into permanent_table select * from staging_table).
The data, from the Hive table, is available in Impala by executing refresh permanent_table in Impala.
I look at the process I've built and it "smells" bad: there are too many intermediate steps that impair the flow of data.
About 20 months ago, I saw a demo where data was being streamed from an Amazon Kinesis pipe and was queryable, in near real-time, by Impala. I don't suppose they did something quite so ugly/convoluted. Is there a more efficient way to stream data from Kafka to Impala (possibly a Kafka consumer that can serialize to Parquet)?
I imagine that "streaming data to low-latency SQL" must be a fairly common use case, and so I'm interested to know how other people have solved this problem.

If you need to dump your Kafka data as-is to HDFS the best option is using Kafka Connect and Confluent HDFS connector.
You can either dump the data to a parket file on HDFS you can load in Impala.
You'll need I think you'll want to use a TimeBasedPartitioner partitioner to make parquet files every X miliseconds (tuning the partition.duration.ms configuration parameter).
Addign something like this to your Kafka Connect configuration might do the trick:
# Don't flush less than 1000 messages to HDFS
flush.size = 1000
# Dump to parquet files
format.class=io.confluent.connect.hdfs.parquet.ParquetFormat
partitioner.class = TimebasedPartitioner
# One file every hour. If you change this, remember to change the filename format to reflect this change
partition.duration.ms = 3600000
# Filename format
path.format='year'=YYYY/'month'=MM/'day'=dd/'hour'=HH/'minute'=mm

Answering that question in year 2022, I would say that solution would be streaming messages from Kafka to Kudu and integrate Impala with Kudu, as it has already tight integration.
Here is example of Impala schema for Kudu:
CREATE EXTERNAL TABLE my_table
STORED AS KUDU
TBLPROPERTIES (
'kudu.table_name' = 'my_kudu_table'
);
Apache Kudu supports SQL inserts and it uses own file format under the hood. Alternatively you could use Apache Phoenix which supports inserts and upserts (if you need exactly once semantic) and uses HBase under the hood.
As long as the Impala is your final way of accessing the data, you shouldn't care about underlaying formats.

Related

How to manage small files created due to insertion of stream data into Hive?

I am reading Kafka messages using simple Kafka consumer.
Storing the output into HDFS and doing some filtering.
After filtration, I am writing this data into Hive, which causes small orc files into the hive.
Could someone advise me how to handle such a scenario?
You can reduce the number of existing ORC files afterwards by running
ALTER TABLE tablename CONCATENATE;
or ALTER TABLE tablename PARTITION (field=value) CONCATENATE;
To prevent HIVE generating too many ORC files, try with
set hive.merge.mapredfiles=true;
There's tools out there such as Camus and Apache Gobblin which have scripts for the purposes of pulling Kafka data continuously, and having "sweeper / compaction" processes that can be run by schedulers such as Oozie to build larger time partitions
You can also look at Kafka Connect framework with the HDFS plugin by Confluent (you do not need to be running Confluent's Kafka installation to use it). It has support for batching up and large files (I've gotten up to 4GB files per Kafka partition from it) and it will build Hive partitions for you automatically
Or Apache Nifi can be used in between your streams and storage to compress the data before landing on Hadoop
The only other alternative I know of are mapreduce based tools on Github (filecrush is one) or writing your own Hive/Pig/Spark script that reads a location, does very little transformation to it (like calculating a date partition), then writes it out somewhere else. This will cause the smaller blocks to be combined into multiple, and there are hadoop settings in each framework to control how much data should be output per file

What are Hive Common Use Cases?

I'm new to Hive; so, I'm not sure how companies use Hive. Let me give you a scenario and see if I'm conceptually correct about the use of Hive.
Let's say my company wants to keep some web server log files and be able to always search through and analyze the logs. So, I create a table columns of which correspond to the columns in the log file. Then I load the log file into the table. Now, I can start query the data. So, as the data comes in at future dates, I just keep adding the data to this table, and thus I always have my log files as a table in Hive that I can search through and analyze.
Is that scenario above a common use? And if it is, then how do I keep adding new log files to the table? Do I have to keep adding them to the table manually each day?
You can use Hive, for analysis over static datasets, but if you have streaming logs, I really wouldn't suggest Hive for this. It's not a search engine and will take minutes just to find any reasonable data you're looking for.
HBase would probably be a better alternative if you must stay within the Hadoop ecosystem. (Hive can query Hbase)
Use Splunk, or the open source alternatives of Solr / Elasticsearch / Graylog if you want reasonable tools for log analysis.
But to answer your questions
how do I keep adding new log files to the table? Do I have to keep adding them to the table manually each day?
Use an EXTERNAL Hive table over an HDFS location for your logs. Use Flume to send log data to that path (or send your logs to Kafka, and from Kafka to HDFS, as well as a search/analytics system)
You only need to update the table if you're adding date partitions (which you should because that's how you get faster Hive queries). You'd use MSCK REPAIR TABLE to detect missing partitions on HDFS. Or run ALTER TABLE ADD PARTITION yourself on a schedule. Note: Confluent's HDFS Kafka Connect will automatically create Hive table partitions for you
If you must use Hive, you can improve the queries better if you convert the data into ORC or Parquet format

Import small stream in Impala

We are currently on a Big Data project.
The Big Data platform Hadoop Cloudera.
Input of our system we have a small flow of data, we collect via Kafka (approximately 80Mo/h continuously).
Then the messages are stored in HDFS to be queried via Impala.
Our client does not want to separate the hot data with the cold data. After 5 mins, the data must be accessible in the history data (cold data). We chose to have a single database.
To insert the data, we use the JDBC connector provided by Impala API (eg INSERT INTO ...).
we are aware that this is not the recommended solution, each Impala insertion creates a file (<10kb) in HDFS.
We seek a solution to insert a small stream in a Imapala base which avoids getting many small files.
What solution we preconize?

Loading data into HIVE to support front end application

We have a datawarehousing application which we are planning to convert to Hadoop.
Currently, there are 20 feeds that we receive on daily basis and load this data into MySQL database.
As the data is getting large, we are planning to move to Hadoop for faster query processing.
As the first step we are planning to load the data into HIVE on a daily basis instead of MySQL.
Question:-
1.Can I convert Hadoop similar to a DWH application to process files on daily basis?
2.When I load the data in Master Node, will it be sync'd automatically?
It really depends on the size of your data. The Question is a bit complex but in general you will have to design your own pipeline.
If you are analyzing raw logs HDFS will be a good choice to start from. You can use Java, Python or Scala to schedule the Hive jobs on daily basis and use Sqoop if you still need some MySQL data.
In Hive you will have to create partitioned table to be synced and available upon query execution. Partition creation can be also scheduled.
I would suggest to go with Impala instead of Hive as it is more tunable, fault tolerant and easier to use.

How to get data from HDFS? Hive?

I am new to Hadoop. I ran a map reduce on my data and now I want to query it so I can put it into my website. Is Apache Hive the best way to do that? I would greatly appreciate any help.
Keep in mind that Hive is a batch processing system, which under the hoods converts the SQL statements to bunch of MapReduce jobs with stage builds in between. Also, Hive is a high latency system i.e. based on your dataset sizes you are looking at minutes to hours or even days to process a complicated query.
So, if you want to serve the results from your MapReduce job output in your website, its highly recommended you export the results back to a RDBMS using sqoop and then take it from there.
Or, if the data itself is huge and cannot be exported back to RDBMS. Then another option you could think of is using a NoSQL system like HBase.
welcome to Hadoop!
I highly recommend you watch Cloudera Essentials for Apache Hadoop | Chapter 5: The Hadoop Ecosystem and familiarize yourself with the different ways to transfer data inbound and outbound from your HDFS cluster. The video is easy-to-watch and describes advantages / disadvantages to each tool, but this outline should give you the basics of the Hadoop Ecosystem:
Flume - Data integration and import of flat files into HDFS. Designed for asynchronous data streams (e.g., log files). Distributed, scalable, and extensible. Supports various endpoints. Allows preprocessing on data before loading to HDFS.
Sqoop - Bidirectional transfer of structured data (RDBMS) and HDFS. Permits incremental import to HDFS. RDBMS must support JDBC or ODBC.
Hive - SQL-like interface to Hadoop. Requires table structure. JDBC and/or ODBC is required.
Hbase - Allows interactive access of HDFS. Sits on top of HDFS and apply structure to data. Allows for random reads, scales horizontally with cluster. Not a full query language; only permits get/put/scan operations (can be used with Hive and/or Impala). Row-key indexes only on data. Does not use Map Reduce paradigm.
Impala - Similar to Hive, high-performance SQL Engine for querying vast amounts of data stored in HDFS. Does not use Map Reduce. Good alternative to Hive.
Pig - Data flow language for transforming large datasets. Permits schema optionally defined at runtime. PigServer (Java API) permits programmatic access.
Note: I assume the data you are trying to read already exists in HDFS. However, some of the products in the Hadoop ecosystem may be useful for your application or as a general reference, so I included them.
If you're only looking to get data from HDFS then yes, you can do so via Hive.
However, you'll most beneficiate from it if your data are already organized (for instance, in columns).
Lets take an example : your map-reduce job produced a csv file named wordcount.csv and containing two rows : word and count. This csv file is on HDFS.
Let's now suppose you want to know the occurence of the word "gloubiboulga". You can simply achieve this via the following code :
CREATE TABLE data
(
word STRING,
count INT,
text2 STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ",";
LOAD DATA LOCAL INPATH '/wordcount.csv'
OVERWRITE INTO TABLE data;
select word, count from data where word=="gloubiboulga";
Please note that while this language looks highly like SQL, you'll still have to learn a few things about it.

Resources