Retrieve data from data lake to analytical system

Retrieve data from data lake to analytical system - hadoop

We have created a new data lake in Hadoop file system. Data is stored in the form of ORC. Currently analytical system is directly connecting to data lake to read these ORC file.
Is there any way to create a middle layer between data lake and analytical system to serve data ?

Which are your analytical software?
Is it possible to use the API or the RESTful web services to access the data lake?
Please, clarify a little more.

The question is very generic, but one common and easy way to build a data lake is to use Presto (https://prestodb.io).
Presto can read several formats, but also connect to different data sources like mysql databases and others, presenting the data as table.
Clients can use SQL, also through jdbc/odbc and hence access the data lake even from tools like excel, or other analytical tools (microstrategy, tableau, etc).

Related

ETL tool Snowflake

We are going to move from SQL server to Snowflake as our target database for the warehouse.
Today we have most of our ETL development done in ODI (Oracle Data Ingegrator).
So I'm intressted in to know if anyone is using ODI together with Snowflake and how it's woking.
And what experince/recommendations you have of other ETL tools together with Snowflake as target.
For example
Matillion
DBT
Xplenty
Today we have started with using NIFI moving the data from source to Azure blob storage.
But we are not sure if ODI is the right tool for the rest when we are in the cloud.
I'm really looking forward to see all your answers

Snowflake supports both transformations during (ETL) or after loading (ELT).
Snowflake works with a wide range of data integration tools, including Informatica, Talend, Tableau, Matillion and others.
In data engineering, new tools and self-service pipelines are eliminating traditional tasks such as manual ETL coding and data cleaning companies. With easy ETL or ELT options via Snowflake, data engineers can instead spend more time working on critical data strategy and pipeline optimization projects.
With a Snowflake as your data lake and data warehouse, ETL can be effectively eliminated, as no pre-transformations or pre-schemas are needed.
In addition, Snowflake Snowpark is designed to make building complex data pipelines a breeze and to allow developers to interact with Snowflake directly without moving data. Read more about Snowpark here.
https://www.snowflake.com/trending/etl-tools

If you started to transfer data from the source to Azure blob storage, I assume that you have a subscription in Azure and it is possible that Snowflake itself is placed in the Azure region.
In this case, I recommend using Azure Data Factory directly, so you have everything on one provider and support for data migration from SQL Server.
Link to documentation: Copy and transform data in Snowflake using Azure Data Factory

How data analytics can be implemented in the data resides in multiple oracle databases?

I am new to data analytics and the big data concepts. I stuck for deciding, what would be the technology to implement my requirement.
My need is as follows:
My client is using more than one oracle databases as their organization's ERP backend. These two databases having different structures and different types of data. I need to create a data analytics application with the data from these two databases. What technology can be adapted by me for this implementation. Can I go with Hadoop and it's associated applications?.
If I go with hadoop, how can I synch my oracle databases to hadoop. I am looking for a solution with realtime synching.
Or can I use the native connection with databases to implement the database access and create my new application?
The size of the databases would be around 1.5 TB.

There are a lot of layers to this question, so I'll keep it somewhat general to give you a push in the right direction.
You suggest two approaches - one would keep your data in Oracle, the other would bring it to Hadoop.
If you stay in Oracle, you may need to use a DI tool such as Informatica, Pentaho, SAS DI or SAS Enterprise to interrogate the different tables in different schemas, extract the data you need, and call in analytics either from native steps or by integrating Python, R or Weka scripts.
To the best of my knowledge, Hadoop doesn't natively integrate with Oracle but instead manages its own file system, HDFS. Sqoop jobs run on Hadoop can extract from Oracle and write to Hive or HBase tables, and then your integration would be using Hive Context on Spark, which enables you to perform analytics.
You may be able to interrogate the databases directly using R or Python. Packt offered a guide at one point on Business Intelligence Using R that included chapters on the ETL (Extract-Transform-Load) process using R. I will tell you this isn't a common solution in the industry because R is primarily an analyst's language, not an ETL Developer's tool. That said, R should be able to query most oracle databases unless they're really old and perform the integration and analytics. The downside is that R's kernel may need more processing power and threads than RStudio can provide - this is why Oracle SQL Developer and Toad handle large-scale queries so well. Python can probably perform the approach using the CX_oracle library.

Oracle to Hadoop data ingestion in real-time

I have a requirement to ingest the data from an Oracle database to Hadoop in real-time.
What's the best way to achieve this on Hadoop?

The important problem here is getting the data out of the Oracle DB in real time. This is usually called Change Data Capture, or CDC. The complete solution depends on how you do this part.
Other things that matter for this answer are:
What is the target for the data and what are you going to do with it?
just store plain HDFS files and access for adhoc queries with something like Impala?
store in HBase for use in other apps?
use in a CEP solution like Storm?
...
What tools is your team familiar with
Do you prefer the DIY approach, gluing together existing open-source tools and writing code for the missing parts?
or do you prefer a Data integration tool like Informatica?
Coming back to CDC, there are three different approaches to it:
Easy: if you don't need true real-time and have a way to identify new data with an SQL query that executes fast enough for the required data latency. Then you can run this query over and over and ingest its results (the exact method depends on the target, the size of each chunk, and the preferred tools)
Complicated: Roll your own CDC solution: download the database logs, parse them into series of inserts/updates/deletes, ingest these to Hadoop.
Expensive: buy a CDC solution, that does this for you (like GoldenGate or Attunity)

Expanding a bit on what #Nickolay mentioned, there are a few options, but the best would be too opinion based to state.
Tungsten (open source)
Tungsten Replicator is an open source replication engine supporting a variety of different extractor and applier modules. Data can be extracted from MySQL, Oracle and Amazon RDS, and applied to transactional stores, including MySQL, Oracle, and Amazon RDS; NoSQL stores such as MongoDB, and datawarehouse stores such as Vertica, Hadoop, and Amazon rDS.
Oracle GoldenGate
Oracle GoldenGate is a comprehensive software package for real-time data integration and replication in heterogeneous IT environments. The product set enables high availability solutions, real-time data integration, transactional change data capture, data replication, transformations, and verification between operational and analytical enterprise systems. It provides a handler for HDFS.
Dell Shareplex
SharePlex™ Connector for Hadoop® loads and continuously replicates changes from an Oracle® database to a Hadoop® cluster. This gives you all the benefits of maintaining a real-time or near real-time copy of source tables

Apache Sqoop is a data transfer tool to transfer bulk data from any RDBMS with JDBC connectivity(supports Oracle also) to hadoop HDFS.

Spark in Business Intelligence

Currently I am doing a project in Business Intelligence and Big Data area, 2 areas in which in all honesty I am new and very green.
I was planning to build a Hive Datawarehouse using MongoDB and connect it with a Business Intelligence platform like Pentaho. While researching I came across Spark and got interested in it's Shark module due to it's in-memory functionality and increase in performance while doing queries.
I know that I can connect Hive to Pentaho but the thing I was wondering is if I could use Shark queries between them for performance? If not is does anyone know of any other BI platform that would allow that?
As I said I am pretty new in this areas so feel free to correct me since there is a good chance of me having some concepts mixed up and having said something idiotic.

I think that you should build Hive Datawarehouse using Hive or MongoDB Datawarehouse using MongoDB. I didn't understand how you are going to mix them, but I will try to answer the question anyway.
Usually, you configure for a BI tool a JDBC driver for DB of your choice (e.g. Hive) and the BI tool fetches the data using that JDBC driver. How the driver fetches the data from DB is completely transparent for the BI tool.
Thus, you can use Hive, Shark or any other DB which comes with a JDBC driver.
I can summarize your options this way:
Hive: the most complete feature set, and is the most compatible tool. Can be used over plain data or, you can ETL the data into its ORC format boosting performance.
Impala: claims to be faster than Hive but has less complete feature set. Can be used over plain data or, you can ETL the data into its Parquet format boosting performance.
Shark: cutting edge, not mainstream yet. Performance depends on which percent of your data can fit into RAM over your cluster.

First of all Shark is being absorbed by Spark SQL.
SparkSQL provides a JDBC/ ODBC connector. That should allow you to integrate it with most of your existing platforms.

Hadoop remote data sources connectivity

What are the available hadoop remote data sources connectivity options?
I know about drivers for MongoDB, MySQL and Vertica connectivity but my question is what are other available data sources that have driver for hadoop connectivity?

These are the few I am aware of :
Oracle
ArcGIS Geodatabase
Teradata
Microsoft SQL Server 2008 R2 Parallel Data Warehouse (PDW)
PostgreSQL
IBM InfoSphere warehouse
Couchbase
Netezza
Tresata
But I am still wondering about the intent of this question. Every data source fits into a particular use case. Like, Couchbase for document data storage, Tresata for financial data storage and so on. Are you going to decide your store based on the connector availability??I don't think so.

Your list will be too long to be useful.
Just one reference: cascading gives you access to almost anything you want to access. More, you're not limited with Java. For example there is scalding component which provides very good framework for Scala programmers.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio