SQOOP export command VS DB2 LOAD CLIENT - performance

I have a scenario where I have copy data from hive to db2. There are two ways I can implement this. One is using sqoop export command and another is db2 load client. I need to know which is best approach with respect to performance. Please give me suggestion.

Sqoop can be used to transfer large sized data file in HDFS concurrently (using mappers) to db2. I have no idea about db2 load client.

Depends.. If using DB2 LUW, with the sqoop connector it can be faster depending on how many clusters you have available (mappers). DB2 Load (at least in the z world) can do parrallel loading so depending on how many cp's on the database system, that could be faster. So I guess it depends on your environment (the database system vs the hadoop cluster).

Related

Oracle Hadoop Connectors vs Sqoop

I have used Sqoop to ingest data from Oracle to Hadoop and it worked well. It took only 4 mins to bring 86 million records from Oracle to Hive table without using partitions on Sqoop. Can anyone give some details about Oracle Hadoop connectors, Will it perform better than Sqoop?
Most of connectors would have the performance close to same as you'll have have a set of MapReduce jobs on the very end of your workflow and this would play the main role in your overall performance.
Oracle provides a set of different connectors for accessing the Hive and you could check a nice overview about standard solutions but I doubt that on the very end you will expect significant performance differences other then you see in Sqoop:
https://docs.oracle.com/cd/E37231_01/doc.20/e36961/start.htm#BDCUG119
Sqoop is a generic tool for working with the relational databases from Hadoop realm, and it is not limited by Oracle only. Besides it has an integration with other Hadoop solutions like Oozie for making complicated workflows, which makes it a good candidate over other types of connectors.
Personally myself I prefer Sqoop for Hadoop-driven import-export operations and connector approach for querying the data in Hadoop.
Sqoop will leverage a standard JDBC connection. Oracles connector will work with a fastloader/fastexport class integrated into the sqoop connection. It should be faster that Sqoop.

How to push data from SQL to HDFS

I have the following use case:
We have several SQL databases in different locations and we need to load some data them to HDFS.
The problem is that we do not have access to the servers from our Hadoop cluster(due to security concerns), but we can push data to our cluster.
Is there ant tool like Apache Sqoop to do such bulk loading.
Dump data as files from your SQL databases in some delimited format for instance csv and then do a simple hadoop put command and put all the files to hdfs.
Thats it.
Let us assume I am working in a small company on 30 node cluster daily 100GB data processing. This data will comes from the different sources like RDBS such as Oracle, MySQL, IBMs Netteza, DB2 and etc. We need not to install SQOOP on all 30 nodes. The minimum number of nodes should be isntalled by SQOOP is=1. After installing on one machine now we will access those machines. Using SQOOP we will import that data.
As per the security is considered no import will be done untill and unless the administartor has to put the following two commands.
MYSQL>grant all privileges on mydb.table to ''#'IP Address of Sqoop Machine'
MYSQL>grant all privileges on mydb.table to '%'#'IP Address of Sqoop Machine'
these two commands should be fire by admin.
Then we can use our sqoop import commands and etc.

Sqoop vs Informatica Big Data edition for Data sourcing

I have a option of using Sqoop or Informatica Big Data edition to source data into HDFS. The source systems are Tearadata, Oracle.
I would like to know which one is better and any reason behind the same.
Note:
My current utility is able to pull data using sqoop into HDFS , Create Hive staging table and archive external table.
Informatica is the ETL tool used in the organization.
Regards
Sanjeeb
Sqoop
Sqoop is capable of performing full and incremental loading from Oracle/Teradata.
Sqoop does parallel copy of data from source systems.
Sqoop scripts can be custom genrated and scheduled by Oozie.
Open source solution for any size cluster. No license cost.
Informatica
Best Interface in ETL Industry to manage mappings.
Does not provide parallel copy options. Provides Hive mode for parallel processing. Basically converts transformation into Hive queries for execution. Also supports push downs to generate MR code.
Licensing cost per node. If you plan 500 Hadoop nodes for future data storage you need to pay 10 times as compared with 50 node cluster when you scale cluster.
Informatica BDE is relatively new product in market. INFA Developer will be usefull for working on Big data. There are challenges in supporting all latest Hadoop platform features on Informatica, also traditional RDBMS features like Sequence generation, Stateful mapping,Sessions, Lookup Transformation in Informatica BDE.
Informatica MDM does not support Hadoop.
If price is criteria for decision making, go for Sqoop. If you want to leverage flexibility of switching Hadoop plaftorm tools, use Sqoop(Sqoop project is also thinking of moving over Spark).
If you are tied to Informatica for some reason, go for Informatica. But most Informatica developers want to move to Hadoop technologies.
Although this was asked an year ago, sharing new features in Informatica
Informatica BDM version 10.1 supports Sqoop connectivity i.e. you can use Sqoop to read the data from RDBMS and load it into Hadoop/Hive
Also, there are many new features in BDM version 10.2, especially the parameterization support in the developer tool and dynamic mappings.
Tool versus handcoding was always there.
Informatica tool gives enterprise level solution which is easier to maintain.
BDM 10.1.1 supports sqoop with spark engine. Spark 2.0.1 is supported in this version so performance its pretty good.
BDM 10.2 is just released with new features like stateful variable support which was missing in earlier versions.
SQOOP must be used for the Data exchange. You have lot of options with which you can have an optimal performance. Also if you are trying to exchange the data between RDBMS(Teradata / Oracle) <-> Informatica <-> Hadoop cluster then the data would first need to be brought to the Informatica Server which may involve additional I/O.
If the data processing must be done within hive Informatica BDE must be used.

Build an application for reporting and analysis on Hadoop framework

I have an application with SAS where I pull the data from Oracle and produce report to excel using Base SAS and SAS macros. Now the problem is day by day my database is getting huge and fetching data from Oracle is taking more time, as a result my jobs are running slow.
So I want my application to be built on Hadoop for Reporting and analysis purpose. Can someone please suggest me any approach and what are the tools I need to use for this.
The short answer is: it depends.
For unloading data from Oracle I would recommend you to use Sqoop (http://sqoop.apache.org/), it is designed for this specific use case and can even do incremental loads and can create Hive table for unloaded data
When the data is unloaded, you can use Impala to build the report you need. Impala can natively work with Hive tables, so the sings are really simple. Of course, you would have to rewrite your SAS code to a set of SQL statements that would run on top of Impala.
Next, if you need visualization tool to run on top of it, you can either try something like Tableau or any other tool that is capable of utilizing ODBC/JDBC to connect to Impala
Finally, I think Hadoop + Sqoop + Impala would cover your needs. But I'd recommend you also to take a look at the MPP databases, because using SAS means you have pretty structured data and MPP database would be a better fit for this case

HDFS for Teradata

As per my understanding, HDFS is useful for the data that is unstructured and large in quantity. I wanted to know, is it possible to use HDFS with Teradata, as Teradata is RDBMS and hence not so Unstructured?
Also, how does HDFS come into picture with the database anyway. Is it that the File System contains data or , how exactly does it work in simple terms? Thanks
With Teradata DB itself - no.
However:), Teradata is providing so-called UDA (Unified Data Architecture), where Teradata, Aster DB and Hadoop(HDFS) are interconnected and can work together almost seamlessly :).
In general, if you want to work with unstructured data only, choose Aster. Which is product of Teradata and you can be connect with HDFS directly. HDFS is used here as a cheap and fast data storage.
Even more interesting solution will come up with the new Aster version (6), where AFS (Aster File system) is going to be implemented. ASR is a distributed filesystem similar to HDFS. I'm looking forward to give a try as well ;)
To add some more details to the answer of xhudik.
To connect Teradata with Hadoop, you need a connector. One is called Teradata QueryGrid for Hadoop. It is an addon to Teradata DWH and connects to HCatalog. And HCatalog connects to HDFS.
You can also use the Teradata Connector for Hadoop, which is a SQOOP extension and so you can connect to Teradata from Hadoop.

Resources