Sqoop vs Informatica Big Data edition for Data sourcing - hadoop

I have a option of using Sqoop or Informatica Big Data edition to source data into HDFS. The source systems are Tearadata, Oracle.
I would like to know which one is better and any reason behind the same.
Note:
My current utility is able to pull data using sqoop into HDFS , Create Hive staging table and archive external table.
Informatica is the ETL tool used in the organization.
Regards
Sanjeeb

Sqoop
Sqoop is capable of performing full and incremental loading from Oracle/Teradata.
Sqoop does parallel copy of data from source systems.
Sqoop scripts can be custom genrated and scheduled by Oozie.
Open source solution for any size cluster. No license cost.
Informatica
Best Interface in ETL Industry to manage mappings.
Does not provide parallel copy options. Provides Hive mode for parallel processing. Basically converts transformation into Hive queries for execution. Also supports push downs to generate MR code.
Licensing cost per node. If you plan 500 Hadoop nodes for future data storage you need to pay 10 times as compared with 50 node cluster when you scale cluster.
Informatica BDE is relatively new product in market. INFA Developer will be usefull for working on Big data. There are challenges in supporting all latest Hadoop platform features on Informatica, also traditional RDBMS features like Sequence generation, Stateful mapping,Sessions, Lookup Transformation in Informatica BDE.
Informatica MDM does not support Hadoop.
If price is criteria for decision making, go for Sqoop. If you want to leverage flexibility of switching Hadoop plaftorm tools, use Sqoop(Sqoop project is also thinking of moving over Spark).
If you are tied to Informatica for some reason, go for Informatica. But most Informatica developers want to move to Hadoop technologies.

Although this was asked an year ago, sharing new features in Informatica
Informatica BDM version 10.1 supports Sqoop connectivity i.e. you can use Sqoop to read the data from RDBMS and load it into Hadoop/Hive
Also, there are many new features in BDM version 10.2, especially the parameterization support in the developer tool and dynamic mappings.

Tool versus handcoding was always there.
Informatica tool gives enterprise level solution which is easier to maintain.
BDM 10.1.1 supports sqoop with spark engine. Spark 2.0.1 is supported in this version so performance its pretty good.
BDM 10.2 is just released with new features like stateful variable support which was missing in earlier versions.

SQOOP must be used for the Data exchange. You have lot of options with which you can have an optimal performance. Also if you are trying to exchange the data between RDBMS(Teradata / Oracle) <-> Informatica <-> Hadoop cluster then the data would first need to be brought to the Informatica Server which may involve additional I/O.
If the data processing must be done within hive Informatica BDE must be used.

Related

How data analytics can be implemented in the data resides in multiple oracle databases?

I am new to data analytics and the big data concepts. I stuck for deciding, what would be the technology to implement my requirement.
My need is as follows:
My client is using more than one oracle databases as their organization's ERP backend. These two databases having different structures and different types of data. I need to create a data analytics application with the data from these two databases. What technology can be adapted by me for this implementation. Can I go with Hadoop and it's associated applications?.
If I go with hadoop, how can I synch my oracle databases to hadoop. I am looking for a solution with realtime synching.
Or can I use the native connection with databases to implement the database access and create my new application?
The size of the databases would be around 1.5 TB.
There are a lot of layers to this question, so I'll keep it somewhat general to give you a push in the right direction.
You suggest two approaches - one would keep your data in Oracle, the other would bring it to Hadoop.
If you stay in Oracle, you may need to use a DI tool such as Informatica, Pentaho, SAS DI or SAS Enterprise to interrogate the different tables in different schemas, extract the data you need, and call in analytics either from native steps or by integrating Python, R or Weka scripts.
To the best of my knowledge, Hadoop doesn't natively integrate with Oracle but instead manages its own file system, HDFS. Sqoop jobs run on Hadoop can extract from Oracle and write to Hive or HBase tables, and then your integration would be using Hive Context on Spark, which enables you to perform analytics.
You may be able to interrogate the databases directly using R or Python. Packt offered a guide at one point on Business Intelligence Using R that included chapters on the ETL (Extract-Transform-Load) process using R. I will tell you this isn't a common solution in the industry because R is primarily an analyst's language, not an ETL Developer's tool. That said, R should be able to query most oracle databases unless they're really old and perform the integration and analytics. The downside is that R's kernel may need more processing power and threads than RStudio can provide - this is why Oracle SQL Developer and Toad handle large-scale queries so well. Python can probably perform the approach using the CX_oracle library.

What do we do with the processed data (output) from Hadoop DB?

i am a newbie to Hadoop, hence my apologize if my question is too immature.
I understand Hadoop is used for analyzing data on Large data sets.
At the end what we do with the analysed data, we create reports and presentations?
For eg,
If in the case of SSRS reports, the reports will be generated based on the resultant data that is pulled from RDBMS using SQL queries.
But, how things work in Hadoop based DB? from a client if a particular report is requested, which needs data points from Hadoop DB, then how the flow would be?
i am sure Client will not directly run Job in hadoop to pull the needed data for its report generation, as hadoop job takes more time to process.
My question is, by running MR jobs on hadoop DB whether the processed data(result set) is stored in any Intermediate DB, like RDBMS?
so that the client can pull the required data for generating reports?
Kindly clarify me on this.
Hadoop have 2 main components
Distributed Storage (HDFS)
Distributed computing (Map Reduce)
Hadoop should be visualized more as Distributed Operating System with HDFS as distributed storage and Map Reduce as kernel. There are many tools such as Hive, Pig, Sqoop, Impala, Datameer, Spark etc which can leverage these distributed capabilities.
Once you run heavy weight data processing such as ETL, you can load data back to light weight relational database and connect enterprise BI tools such SSRS for reporting purposes. Also BI tools like Tableau have connectors to Hadoop via Spark using which we can report directly out of Hadoop. Datameer is Hadoop based visualization tool which can be used to report the data.
In short, one should not compare tools like SSRS with Hadoop. Hadoop is technology which provides distributed capabilities seamlessly and the eco system around it can be used to solve the business problems leveraging it.

Build an application for reporting and analysis on Hadoop framework

I have an application with SAS where I pull the data from Oracle and produce report to excel using Base SAS and SAS macros. Now the problem is day by day my database is getting huge and fetching data from Oracle is taking more time, as a result my jobs are running slow.
So I want my application to be built on Hadoop for Reporting and analysis purpose. Can someone please suggest me any approach and what are the tools I need to use for this.
The short answer is: it depends.
For unloading data from Oracle I would recommend you to use Sqoop (http://sqoop.apache.org/), it is designed for this specific use case and can even do incremental loads and can create Hive table for unloaded data
When the data is unloaded, you can use Impala to build the report you need. Impala can natively work with Hive tables, so the sings are really simple. Of course, you would have to rewrite your SAS code to a set of SQL statements that would run on top of Impala.
Next, if you need visualization tool to run on top of it, you can either try something like Tableau or any other tool that is capable of utilizing ODBC/JDBC to connect to Impala
Finally, I think Hadoop + Sqoop + Impala would cover your needs. But I'd recommend you also to take a look at the MPP databases, because using SAS means you have pretty structured data and MPP database would be a better fit for this case

Oracle to Hadoop data ingestion in real-time

I have a requirement to ingest the data from an Oracle database to Hadoop in real-time.
What's the best way to achieve this on Hadoop?
The important problem here is getting the data out of the Oracle DB in real time. This is usually called Change Data Capture, or CDC. The complete solution depends on how you do this part.
Other things that matter for this answer are:
What is the target for the data and what are you going to do with it?
just store plain HDFS files and access for adhoc queries with something like Impala?
store in HBase for use in other apps?
use in a CEP solution like Storm?
...
What tools is your team familiar with
Do you prefer the DIY approach, gluing together existing open-source tools and writing code for the missing parts?
or do you prefer a Data integration tool like Informatica?
Coming back to CDC, there are three different approaches to it:
Easy: if you don't need true real-time and have a way to identify new data with an SQL query that executes fast enough for the required data latency. Then you can run this query over and over and ingest its results (the exact method depends on the target, the size of each chunk, and the preferred tools)
Complicated: Roll your own CDC solution: download the database logs, parse them into series of inserts/updates/deletes, ingest these to Hadoop.
Expensive: buy a CDC solution, that does this for you (like GoldenGate or Attunity)
Expanding a bit on what #Nickolay mentioned, there are a few options, but the best would be too opinion based to state.
Tungsten (open source)
Tungsten Replicator is an open source replication engine supporting a variety of different extractor and applier modules. Data can be extracted from MySQL, Oracle and Amazon RDS, and applied to transactional stores, including MySQL, Oracle, and Amazon RDS; NoSQL stores such as MongoDB, and datawarehouse stores such as Vertica, Hadoop, and Amazon rDS.
Oracle GoldenGate
Oracle GoldenGate is a comprehensive software package for real-time data integration and replication in heterogeneous IT environments. The product set enables high availability solutions, real-time data integration, transactional change data capture, data replication, transformations, and verification between operational and analytical enterprise systems. It provides a handler for HDFS.
Dell Shareplex
SharePlex™ Connector for Hadoop® loads and continuously replicates changes from an Oracle® database to a Hadoop® cluster. This gives you all the benefits of maintaining a real-time or near real-time copy of source tables
Apache Sqoop is a data transfer tool to transfer bulk data from any RDBMS with JDBC connectivity(supports Oracle also) to hadoop HDFS.

Spark in Business Intelligence

Currently I am doing a project in Business Intelligence and Big Data area, 2 areas in which in all honesty I am new and very green.
I was planning to build a Hive Datawarehouse using MongoDB and connect it with a Business Intelligence platform like Pentaho. While researching I came across Spark and got interested in it's Shark module due to it's in-memory functionality and increase in performance while doing queries.
I know that I can connect Hive to Pentaho but the thing I was wondering is if I could use Shark queries between them for performance? If not is does anyone know of any other BI platform that would allow that?
As I said I am pretty new in this areas so feel free to correct me since there is a good chance of me having some concepts mixed up and having said something idiotic.
I think that you should build Hive Datawarehouse using Hive or MongoDB Datawarehouse using MongoDB. I didn't understand how you are going to mix them, but I will try to answer the question anyway.
Usually, you configure for a BI tool a JDBC driver for DB of your choice (e.g. Hive) and the BI tool fetches the data using that JDBC driver. How the driver fetches the data from DB is completely transparent for the BI tool.
Thus, you can use Hive, Shark or any other DB which comes with a JDBC driver.
I can summarize your options this way:
Hive: the most complete feature set, and is the most compatible tool. Can be used over plain data or, you can ETL the data into its ORC format boosting performance.
Impala: claims to be faster than Hive but has less complete feature set. Can be used over plain data or, you can ETL the data into its Parquet format boosting performance.
Shark: cutting edge, not mainstream yet. Performance depends on which percent of your data can fit into RAM over your cluster.
First of all Shark is being absorbed by Spark SQL.
SparkSQL provides a JDBC/ ODBC connector. That should allow you to integrate it with most of your existing platforms.

Resources