Hadoop remote data sources connectivity - hadoop

What are the available hadoop remote data sources connectivity options?
I know about drivers for MongoDB, MySQL and Vertica connectivity but my question is what are other available data sources that have driver for hadoop connectivity?

These are the few I am aware of :
Oracle
ArcGIS Geodatabase
Teradata
Microsoft SQL Server 2008 R2 Parallel Data Warehouse (PDW)
PostgreSQL
IBM InfoSphere warehouse
Couchbase
Netezza
Tresata
But I am still wondering about the intent of this question. Every data source fits into a particular use case. Like, Couchbase for document data storage, Tresata for financial data storage and so on. Are you going to decide your store based on the connector availability??I don't think so.

Your list will be too long to be useful.
Just one reference: cascading gives you access to almost anything you want to access. More, you're not limited with Java. For example there is scalding component which provides very good framework for Scala programmers.

Related

Retrieve data from data lake to analytical system

We have created a new data lake in Hadoop file system. Data is stored in the form of ORC. Currently analytical system is directly connecting to data lake to read these ORC file.
Is there any way to create a middle layer between data lake and analytical system to serve data ?
Which are your analytical software?
Is it possible to use the API or the RESTful web services to access the data lake?
Please, clarify a little more.
The question is very generic, but one common and easy way to build a data lake is to use Presto (https://prestodb.io).
Presto can read several formats, but also connect to different data sources like mysql databases and others, presenting the data as table.
Clients can use SQL, also through jdbc/odbc and hence access the data lake even from tools like excel, or other analytical tools (microstrategy, tableau, etc).

How data analytics can be implemented in the data resides in multiple oracle databases?

I am new to data analytics and the big data concepts. I stuck for deciding, what would be the technology to implement my requirement.
My need is as follows:
My client is using more than one oracle databases as their organization's ERP backend. These two databases having different structures and different types of data. I need to create a data analytics application with the data from these two databases. What technology can be adapted by me for this implementation. Can I go with Hadoop and it's associated applications?.
If I go with hadoop, how can I synch my oracle databases to hadoop. I am looking for a solution with realtime synching.
Or can I use the native connection with databases to implement the database access and create my new application?
The size of the databases would be around 1.5 TB.
There are a lot of layers to this question, so I'll keep it somewhat general to give you a push in the right direction.
You suggest two approaches - one would keep your data in Oracle, the other would bring it to Hadoop.
If you stay in Oracle, you may need to use a DI tool such as Informatica, Pentaho, SAS DI or SAS Enterprise to interrogate the different tables in different schemas, extract the data you need, and call in analytics either from native steps or by integrating Python, R or Weka scripts.
To the best of my knowledge, Hadoop doesn't natively integrate with Oracle but instead manages its own file system, HDFS. Sqoop jobs run on Hadoop can extract from Oracle and write to Hive or HBase tables, and then your integration would be using Hive Context on Spark, which enables you to perform analytics.
You may be able to interrogate the databases directly using R or Python. Packt offered a guide at one point on Business Intelligence Using R that included chapters on the ETL (Extract-Transform-Load) process using R. I will tell you this isn't a common solution in the industry because R is primarily an analyst's language, not an ETL Developer's tool. That said, R should be able to query most oracle databases unless they're really old and perform the integration and analytics. The downside is that R's kernel may need more processing power and threads than RStudio can provide - this is why Oracle SQL Developer and Toad handle large-scale queries so well. Python can probably perform the approach using the CX_oracle library.

Oracle to Hadoop data ingestion in real-time

I have a requirement to ingest the data from an Oracle database to Hadoop in real-time.
What's the best way to achieve this on Hadoop?
The important problem here is getting the data out of the Oracle DB in real time. This is usually called Change Data Capture, or CDC. The complete solution depends on how you do this part.
Other things that matter for this answer are:
What is the target for the data and what are you going to do with it?
just store plain HDFS files and access for adhoc queries with something like Impala?
store in HBase for use in other apps?
use in a CEP solution like Storm?
...
What tools is your team familiar with
Do you prefer the DIY approach, gluing together existing open-source tools and writing code for the missing parts?
or do you prefer a Data integration tool like Informatica?
Coming back to CDC, there are three different approaches to it:
Easy: if you don't need true real-time and have a way to identify new data with an SQL query that executes fast enough for the required data latency. Then you can run this query over and over and ingest its results (the exact method depends on the target, the size of each chunk, and the preferred tools)
Complicated: Roll your own CDC solution: download the database logs, parse them into series of inserts/updates/deletes, ingest these to Hadoop.
Expensive: buy a CDC solution, that does this for you (like GoldenGate or Attunity)
Expanding a bit on what #Nickolay mentioned, there are a few options, but the best would be too opinion based to state.
Tungsten (open source)
Tungsten Replicator is an open source replication engine supporting a variety of different extractor and applier modules. Data can be extracted from MySQL, Oracle and Amazon RDS, and applied to transactional stores, including MySQL, Oracle, and Amazon RDS; NoSQL stores such as MongoDB, and datawarehouse stores such as Vertica, Hadoop, and Amazon rDS.
Oracle GoldenGate
Oracle GoldenGate is a comprehensive software package for real-time data integration and replication in heterogeneous IT environments. The product set enables high availability solutions, real-time data integration, transactional change data capture, data replication, transformations, and verification between operational and analytical enterprise systems. It provides a handler for HDFS.
Dell Shareplex
SharePlex™ Connector for Hadoop® loads and continuously replicates changes from an Oracle® database to a Hadoop® cluster. This gives you all the benefits of maintaining a real-time or near real-time copy of source tables
Apache Sqoop is a data transfer tool to transfer bulk data from any RDBMS with JDBC connectivity(supports Oracle also) to hadoop HDFS.

Connect to Spark SQL via ODBC

According to this page: https://spark.apache.org/sql/ you can connect existing BI tools to Spark SQL via ODBC or JDBC:
I don't mean Shark as this is basically EOL:
It is for this reason that we are ending development in Shark as a separate project and moving all our development resources to Spark SQL, a new component in Spark.
How would a BI tool (like Tableau) connect to shark sql via ODBC?
With the release of Spark SQL 1.1 you also have thrift JDBC driver see https://spark.apache.org/docs/latest/sql-programming-guide.html#distributed-sql-engine
Simba provides the ODBC driver that Databricks uses, however that is only for the Databricks distribution. We are launching the public version for use with Apache tomorrow (Wed, Dec 3rd) at www.simba.com. You'll be able to download and trial the driver for use with Tableau then.
As Carlos said, Stratio Meta is a module that acts as a parser, validator, planner and coordinator layer over the different persistence layers (currently, only Cassandra and Mongo, but also HDFS in the short term). This modules offers a Shell with a SQL-like language, a Java/Scala API, a REST API and ODBC (JDBC shortly). It also uses another Stratio module, Stratio Deep, which allows us to use Apache Spark in order to execute query in an efficent and fast way.
Disclaimer: I am currently employed by Stratio Big Data
Please take a look at: http://www.openstratio.org/blog/connecting-to-the-stratio-big-data-platform-using-odbc-2/
Stratio is a platform that includes a certified Spark distribution that allows you to connect Spark to any type of data repository (like Cassandra, MongoDB,...). It has an ODBC Driver so you can write SQL queries that will be translated to Spark jobs, or even faster, direct queries to Cassandra -or whichever database you want to connect to it - if possible. This way, it is pretty simple to connect Tableau into Spark and your data repository. If you need any help, we will be more than glad to assist you.
Disclaimer: I'm one of Stratio's ODBC developers
Simba will offer one: http://databricks.com/blog/2014/04/30/Databricks-selects-Simba-ODBC-driver-for-shark.html. No known official release date.
[update]
Use HIVE's ODBC driver to connect to Spark SQL as described here and here.
For Spark on Azure HDInsight, you can connect Tableau (or PowerBI) as described here https://azure.microsoft.com/en-us/documentation/articles/hdinsight-apache-spark-use-bi-tools/. The ODBC driver is here: http://www.microsoft.com/en-us/download/details.aspx?id=47713

Spark in Business Intelligence

Currently I am doing a project in Business Intelligence and Big Data area, 2 areas in which in all honesty I am new and very green.
I was planning to build a Hive Datawarehouse using MongoDB and connect it with a Business Intelligence platform like Pentaho. While researching I came across Spark and got interested in it's Shark module due to it's in-memory functionality and increase in performance while doing queries.
I know that I can connect Hive to Pentaho but the thing I was wondering is if I could use Shark queries between them for performance? If not is does anyone know of any other BI platform that would allow that?
As I said I am pretty new in this areas so feel free to correct me since there is a good chance of me having some concepts mixed up and having said something idiotic.
I think that you should build Hive Datawarehouse using Hive or MongoDB Datawarehouse using MongoDB. I didn't understand how you are going to mix them, but I will try to answer the question anyway.
Usually, you configure for a BI tool a JDBC driver for DB of your choice (e.g. Hive) and the BI tool fetches the data using that JDBC driver. How the driver fetches the data from DB is completely transparent for the BI tool.
Thus, you can use Hive, Shark or any other DB which comes with a JDBC driver.
I can summarize your options this way:
Hive: the most complete feature set, and is the most compatible tool. Can be used over plain data or, you can ETL the data into its ORC format boosting performance.
Impala: claims to be faster than Hive but has less complete feature set. Can be used over plain data or, you can ETL the data into its Parquet format boosting performance.
Shark: cutting edge, not mainstream yet. Performance depends on which percent of your data can fit into RAM over your cluster.
First of all Shark is being absorbed by Spark SQL.
SparkSQL provides a JDBC/ ODBC connector. That should allow you to integrate it with most of your existing platforms.

Resources