Moving hive data from one Hadoop cluster to another without using distcp command? - hadoop

How to move hive data from one Hadoop cluster to another Hadoop cluster without using distcp command. As we can not use this. Do we have another option like Sqoop or Flume?

distcp is the efficient way to move huge amounts of data from one hadoop cluster to another.
Sqoop and Flume cannot be used to transfer data from one hadoop cluster to another. Sqoop is predominantly used to move data between hadoop and relational databases whereas Flume is used to ingest streaming data to Hadoop.
Your other option would be to use:
high-throughput msg queue like Kafka, but this would become more complicated than using distcp.
Use traditional hadoop fs shell commands like cp or get followed by put
FYI when you are talking about Hive data, you also should consider keeping hive metadata (metastore) in-sync between the clusters.

Related

Can you use HDFS as your principal storage?

Is its reliable to save your data in Hadoop and consume it using Spark/Hive etc?
What are the advantages of using HDFS as your main storage?
HDFS is only as reliable as the Namenode(s) that maintain the file metadata. You'd better setup Namenode HA and take frequent snapshots of them, and externally store those away from HDFS.
If all Namenodes are unavailable, or their metadata storage is corrupted, you'll be unable to read the HDFS datanode data, despite those files being fine themselves, and highly available
Here are some considerations for storing your data in Hive vs HDFS (and/or HBase).
Hive:
HDFS is a filesystem that supports fail-over and HA. HDFS will replicate the data in several datanodes based on the replication factor you have chosen. Hive is build on top of Hadoop therefore can store data in HDFS as well leveraging the pros of HDFS for HA.
Hive utilizes predicates-pushdown providing huge performance benefits. Hive can also be combined with modern file formats such as parquet and ORC improving performance even more (utilizing predicates-pushdown).
Hive provides very easy access to data via HQL (Hive Query Language) which is SQL like language.
Hive works very well with Spark and you can combine them both aka retrieving Hive data into dataframes and saving dataframes into Hive.
HDFS/HBase:
Hive is a warehouse system used for data analysis therefore Hive CRUD operations are relatively slower than direct access to HDFS files (or HBase which is build for fast CRUD operations). For instance in a streaming application saving data in HDFS or HBase will be much faster than in Hive. If you need fast storage (or insert queries) and you don't do any analysis on large datasets then you should prefer HDFS/HBase over Hive.
If performance is very crucial for your application and therefore you prefer to skip the extra layer of Hive accessing HDFS files directly.
The team decides not to use SQL.
Related post:
When to use Hadoop, HBase, Hive and Pig?

Does HBase and Hive need to be co-located on the same machine?

We currently build a new hbase cluster. The architecture is as follow:
now I want to add a new hive cluster. But I don't know if I should build hive on the same hadoop cluster with hbase. Or should I build a new hadoop cluster for hive?
If I build a new hadoop cluster for hive, will the traffic between hive and hbase be too huge when I execute SQL to aggregate data of hbase (AFAIK, hive need to import the data on hbase to its own hdfs storage)?
Lets first define what is the traffic between HBase and Hive. In every Hive query you will either:
use Hive for querying HBase directly
use Hive to perform Join requests of Warehouse tables with HBase table
use Hive to perform Join requests of external tables with HBase table
If you take a deep dive into the architecture of all these components you will notice that Hive Warehouse and Hbase both use HDFS behind the scene. Same can be considered to external tables which are located in HDFS. So if you deploy Hive into current architecture, there internal data would physically be stored at the same place as HBase - in the DataNodes of your data Rack. It means that you traffic would be optimal for the tasks which will be scheduled by YARN through your Hive queries. If you create another Rack for hive Warehouse with a set of separate DataNodes located at another Data Center, this would negatively impact your performance.
So if you have enough capacities to keep data within single Rack for all your services, go this way. You can read more about Hive and HBase infrastructure in referenced links.

How to use DistCp to directly convert data into tables in Hive?

I am using DistCp to copy the data from cluster 1 to cluster 2. I was successfully able to copy the table data from cluster 1 into cluster 2. However, using the hdfs, the data has been sent to file browser.
Is there any direct way to convert this hdfs data into a Hive table (including data type, delimeters ...etc) by use of DistCp command(s)? I can certainly query it to gather the data from hdfs, however I'll have to convert them one-by-one. Trying to look for efficient way to this. Thanks!
Example:
hadoop distcp hdfs://nn1:8020/source/a hdfs://nn1:8020/source/b hdfs://nn2:8020/destination
Haven't found a documentation where you can directly use DistCp to copy tables. However, if any one is looking for similar situation, they can use. Worked for me.
--hive
export table <<<table_name>>> to '<<<hdfs path>>>';
#bash/shell
hadoop distcp source desitination
--hive
import table <<<table_name>> from '<<<hdfs>>>';

Data moving from RDBMS to Hadoop, using SQOOP and FLUME

I am in the process of learning Hadoop and stuck with few concepts on moving data from Relational database to Hadoop and vice versa.
I have transferred files from MySQL to HDFS using SQOOP import queries. The files I transferred were structured datasets and not any server log data. I recently read that we usually use flume for moving log files into Hadoop,
My question is:
1. Can we use SQOOP as well for moving log files?
2. If yes, which of SQOOP or FLUME is more preferred for log files and why?
1) Sqoop can be used to transfer data between any rdbms and hdfs. To use scoop the data has to be structured usually specified by schema of database from where data is being imported or exported.Log files are not always structured,depending on source and type of log so sqoop is not used for moving log files.
2)Flume can collect, aggregate data from many different kinds of customizable data sources. It gives more flexibility in controlling what specific events to capture and use in user defined work flow before storing in say hdfs.
I hope it clarified difference between sqoop and flume.
SQOOP is designed to transfer data from RDMS to HDFS whereas FLUME is for moving large amounts of log data.
Both are different and specialized for different purposes.
Like
You can use SQOOP to import data via JDBC ( which you can not do in FLUME ),
and
You can use FLUME to say something like "I want to tail 200 lines of log file from this server".
Read more about FLUME here
http://flume.apache.org/
SQOOP not only transfers data from RDBMS but also from NOSql databases like MongoDB. You can directly transfer data to HDFS or Hive.
Transferring data to Hive you need not have to create table beforehand.. It takes the scheme from database itself.
Flume is used to fetch log data or streaming data

Hadoop HDFS dependency

In hadoop mapreduce programming model; when we are processing files is it mandatory to keep the files in HDFS file system or can I keep the files in other file system's and still have the benefit of mapreduce programming model ?
Mappers read input data from an implementation of InputFormat. Most implementations descend from FileInputFormat, which reads data from local machine or HDFS. (by default, data is read from HDFS and the results of the mapreduce job are stored in HDFS as well.) You can write a custom InputFormat, when you want your data to be read from an alternative data source, not being HDFS.
TableInputFormat would read data records directly from HBase and DBInputFormat would access data from relational databases. You could also imagine a system where data is streamed to each machine over the network on a particular port; the InputFormat reads data from the port and parses it into individual records for mapping.
However, in your case, you have data in a ext4-filesystem on a single or multiple servers. In order to conveniently access this data within Hadoop you'd have to copy it into HDFS first. This way you will benefit from data locality, when the file chunks are processed in parallel.
I strongly suggest reading the tutorial from Yahoo! on this topic for detailed information. For collecting log files for mapreduce processing also take a look at Flume.
You can keep the files elsewhere but you'd lose the data locality advantage.
For example. if you're using AWS, you can store your files on S3 and access them directly from Map-reduce code, Pig, Hive, etc.
In order to user Apache Haddop you must have your files in HDFS, the hadoop file system. Though there are different abstract types of HDFS, like AWS S3, these are all at their basic level HDFS storage.
The data needs to be in HDFS because HDFS distributed the data along your cluster. During the mapping phase each Mapper goes through the data stored in it's node and then sends it to the proper node running the reducer code for the given chunk.
You can't have Hadoop MapReduce, withput using HDFS.

Resources