How to push data from SQL to HDFS - etl

I have the following use case:
We have several SQL databases in different locations and we need to load some data them to HDFS.
The problem is that we do not have access to the servers from our Hadoop cluster(due to security concerns), but we can push data to our cluster.
Is there ant tool like Apache Sqoop to do such bulk loading.

Dump data as files from your SQL databases in some delimited format for instance csv and then do a simple hadoop put command and put all the files to hdfs.
Thats it.

Let us assume I am working in a small company on 30 node cluster daily 100GB data processing. This data will comes from the different sources like RDBS such as Oracle, MySQL, IBMs Netteza, DB2 and etc. We need not to install SQOOP on all 30 nodes. The minimum number of nodes should be isntalled by SQOOP is=1. After installing on one machine now we will access those machines. Using SQOOP we will import that data.
As per the security is considered no import will be done untill and unless the administartor has to put the following two commands.
MYSQL>grant all privileges on mydb.table to ''#'IP Address of Sqoop Machine'
MYSQL>grant all privileges on mydb.table to '%'#'IP Address of Sqoop Machine'
these two commands should be fire by admin.
Then we can use our sqoop import commands and etc.

Related

Importing data to hbase using sqoop

When I want to import the data to hive using sqoop I can specify --hive-home <dir> and sqoop will call that specified copy of hive installed on the machine where the script is being executed. But what about hbase? How does sqoop know which hbase instance/database I want the data to be imported on?
Maybe the documentation helps?
By specifying --hbase-table, you instruct Sqoop to import to a table in HBase rather than a directory in HDFS
Every example I see just shows that option along with column families, and whatnot, so I assume it depends on whatever variables that might be part of the sqoop-env.sh, like what the Hortonworks docs say
When you give the hive home directory, that's not telling it any database or table information either, but rather where Hive configuration files exist on the machine you're running Sqoop on. By default, that's set to be the environment variable $HIVE_HOME

hadoop architecture query example

Currently i have 2 machines one of them is the Horton sandbox i have configured it as name node and decommissioned the data node from it and other machine which i have made and made it as a data node and i have installed hive server on it.
Also and assigned the slave role to it and i used Ambari to finish it .
My question is as its my first time ever to use hadoop my plan is to transfer data from sql database to the hadoop so does this mean i have to install mysql on datanode while i will be using sqoop and other thing what will the name node do ?shall i query it and it passes the queries to the datanode am really very much confused and really having huge pressure to finish so forgive me as am newbie the installations of the machines are all default i have chosen datanode for the First machine and nodemanager for the second one with no special configurations appreciate if You have a simple example from which i can understand .
Thanks alot fellows
Sqoop is a tool designed to transfer data between Hadoop and relational database servers. It is used to import data from relational databases such as MySQL, Oracle to Hadoop HDFS, and export from Hadoop file system to relational databases.
example like- you have some data in mysql in other machine and you have to transfer the data into your hadoop hdfs. In this condition sqoop will be used
NameNode stores MetaData(No of Blocks, On Which Rack which DataNode the data is stored and other details) about the data being stored in DataNodes whereas the DataNode stores the actual Data.

Export of whole database from one hbase to another

Recently I started working on HBase. Now I need to get the application data on one HBase system to another. How can I export dump of whole HBase from one system to another.
You can use hbase Export/Import command for transferring data from one hbase cluster to another.
Please refer below link.
http://hbase.apache.org/0.94/book/ops_mgt.html

What's the difference between Flume and Sqoop?

Both Flume and Sqoop are meant for data movement, then what is the difference between them? Under what condition should I use Flume or Sqoop?
From http://flume.apache.org/
Flume is a distributed, reliable, and available service for
efficiently collecting, aggregating, and moving large amounts of log
data.
Flume helps to collect data from a variety of sources, like logs, jms, Directory etc. Multiple flume agents can be configured to collect high volume of data.
It scales horizontally.
From http://sqoop.apache.org/
Apache Sqoop(TM) is a tool designed for efficiently transferring bulk
data between Apache Hadoop and structured datastores such as
relational databases.
Sqoop helps to move data between hadoop and other databases and it can transfer data in parallel for performance.
Both Sqoop and Flume, pull the data from the source and push it to the sink. The main difference is Flume is event driven, while Sqoop is not.
Flume:
Flume is a framework for populating Hadoop with data. Agents are populated
throughout ones IT infrastructure – inside web servers, application servers
and mobile devices, for example – to collect data and integrate it into Hadoop.
Sqoop:
Sqoop is a connectivity tool for moving data from non-Hadoop data stores – such
as relational databases and data warehouses – into Hadoop. It allows users to
specify the target location inside of Hadoop and instruct Sqoop to move data
from Oracle,Teradata or other relational databases to the target.
You can see the full Post
Flume:
A very common use case is collecting log data from one system- a bank of web servers(aggregating it in HDFS for later analysis).
Sqoop:
On the other hand is designed for performing bulk imports of data into HDFS from structured data stores. simple use case will be an organization that runs a nightly sqoop import to load the day's data from a production DB into a Hive data ware house for analysis.
--From the definitive guide.
Apache Sqoop and Apache Flume work with various kinds of data sources. Flume functions well in streaming data sources which are generated continuously in hadoop environment such as log files from multiple servers.
whereas Apache Sqoop is designed to work well with any kind of relational database system that has JDBC connectivity.
Sqoop can also import data from NoSQL databases like MongoDB or Cassandra and also allows direct data transfer or Hive or HDFS. For transferring data to Hive using Apache Sqoop tool, a table has to be created for which the schema is taken from the database itself.
In Apache Flume data loading is event driven whereas in Apache Sqoop data load is not driven by events.
4.Flume is a better choice when moving bulk streaming data from various sources like JMS or Spooling directory whereas Sqoop is an ideal fit if the data is sitting in databases like Teradata, Oracle, MySQL Server, Postgres or any other JDBC compatible database then it is best to use Apache Sqoop.
5.In Apache Flume, data flows to HDFS through multiple channels whereas in Apache Sqoop HDFS is the destination for importing data.
6.Apache Flume has agent based architecture i.e. the code written in flume is known as agent which is responsible for fetching data whereas in Apache Sqoop the architecture is based on connectors. The connectors in Sqoop know how to connect with the various data sources and fetch data accordingly.
Lastly, Sqoop and Flume cannot be used achieve the same tasks as they are developed specifically to serve different purposes. Apache Flume agents are designed to fetch streaming data like tweets from Twitter or log file from the web server whereas Sqoop connectors are designed to work only with structured data sources and fetch data from them.
Apache Sqoop is mainly used for parallel data transfers, for data imports as it copies data quickly where Apache Flume is used for collecting and aggregating data because of its distributed, reliable nature and highly available backup routes.
Sqoop and Flume both are meant to fulfill data ingestion needs but they serve different purposes. Apache Flume works well for streaming data sources that are generated continuously in hadoop environment such as log files from multiple servers whereas whereas Apache Sqoop works well with any RDBMS has JDBC connectivity.
Sqoop is actually meant for bulk data transfers between hadoop and any other structured data stores. Flume collects log data from many sources, aggregating it, and writing it to HDFS.
I came across this interesting infographic that explains the differences between the two apache projects Sqoop and Flume -
Difference between Sqoop and Flume
Sqoop
Sqoop can perform import/export from RDBMS to HDFS/HIVE/HBASE
sqoop only import/export structured data not unstructured or semi
structured.
Flume
import stream data from multiple sources mostly semi-structured and
unstructured in nature. Now Kafka is better alternative for flume.

SQOOP export command VS DB2 LOAD CLIENT

I have a scenario where I have copy data from hive to db2. There are two ways I can implement this. One is using sqoop export command and another is db2 load client. I need to know which is best approach with respect to performance. Please give me suggestion.
Sqoop can be used to transfer large sized data file in HDFS concurrently (using mappers) to db2. I have no idea about db2 load client.
Depends.. If using DB2 LUW, with the sqoop connector it can be faster depending on how many clusters you have available (mappers). DB2 Load (at least in the z world) can do parrallel loading so depending on how many cp's on the database system, that could be faster. So I guess it depends on your environment (the database system vs the hadoop cluster).

Resources