rdbms & big data into a datamart? - hadoop

I have an RDBMS (SQL Server/ Oracle) and a Hadoop database on the other end.
Primary-key 'customer' is common in both data stores.
A few questions:
Is it possible to have a datamart that can pull data from both RDBMS
& Big data and produce reports? What would be a tool example?
Does the datamart itself need to be a RDBMS store or can it be some in-memory stuff?
Whats the best way to run data analytics in this environment?
What about data visualization?
Or should I just get all data into an RDBMS data warehouse and then solve for these questions?

Data virtualization or data federation is what you're looking for - i.e. the ability to access a single source that will access multiple resources as needed.
Databases usually have some limited capability in this area which lets you define external tables see for example this link for Oracle and HDFS

Related

Mechanism to interact timeseries database in hadoop with structured RDBMS data

I am new to this Hadoop thing . What I want to accomplish is store time series data over cloud in a distributed system. Looking at few stuffs over web OpenTSDB seems to be a feasible option for me to do so .
I also have some RDBMS databases which can be stored in distributed system and used using hive.
What we plan to do is use timeseries databases and the structured RDBMS data (read and write via HIVE ) and then join time series data with this structured data . Store the output in such a way that it can be read and write like SQL something via HIVE.
Not sure if you're asking about Hadoop or TSDBs.
If you already had a Hadoop environment, sure, adding HBase and then OpenTSDB might make sense.
If you want alternatives that offer more query centric ideas, then Influxdb or TimescaleDb seem more popular in that realm.
If scalability is really the issue, then Cassandra with Kairos is another option.
As part as Hive-like processing goes, SparkSQL probably can interact with all the above

Cassandra for datawarehouse

Is Cassandra a good alternative for Hadoop as a data warehouse where data is append only and all updates in source databases should not overwrite the existing rows in the data warehouse but get appended. Is Cassandra really ment to act as a data warehouse or just as a database to store the results of batch / stream queries?
Cassandra can be used both as a data warehouse(raw data storage) and as a database (for final data storage). It depends more on the cases you want to do with the data.
You even may need to have both Hadoop and Cassandra for different purposes.
Assume, you need to gather and process data from multiple mobile devices and provide some complex aggregation report to the user.
So at first, you need to save data as fast as possible (as new portions appear very often) so you use Cassandra here. As Cassandra is limited in aggregation features, you load data into HDFS and do some processing via HQL scripts (assume, you're not very good at coding but great in complicated SQLs). And then you move the report results from HDFS to Cassandra in a dedicated reports table partitioned by user id.
So when the user wants to have some aggregation report about his activity in the last month, the application takes the id of active user and returns the aggregated result from Cassandra (as it is simple key-value search).
So for your question, yes, it could be an alternative, but the selection strategy depends on the data types and your application business cases.
You can read more information about usage of Cassandra
here

What size the data volume of traditional database to choose Hadoop?

What size the data volume of traditional database to choose Hadoop? What is the basic bench-marked parameter to choose Hadoop system over traditional database?
There is no specific "size" to move from RDBMS to Hadoop. Two things to know:
They are very different.(read on to know more)
The size of data that RDBMS can handle is dependent on the capability of the DataBase Server.
Traditional databases are RDBMS(Relational Database Management System) where we insert data as rows, which get stored in the database. You may Alter/Query/Update the database.
Hadoop is a framework for storage and processing data(large amounts of data). It has two parts: Storage(Hadoop Distributed File System) and MapReduce(processing framework).
Hadoop stores data as files on its FS. So if you want to Update/alter/query it like RDBMS its not possible.
We do have SQL wrappers over Hadoop like Hive or impala but they aren't as performant as RDBMS on data(not big data).
Even with all this many are considering moving from RDBMS to Hadoop because RDBMS under-performs with large data(bigdata). Hadoop can be used as a DataStore and Queries over it could be run using Hive/Impala. Updates are not readily supported on Hadoop.
There are many pros and cons of using Hadoop over RDBMS. Read more.. here orhere

What is a Data warehouse in this use case

I'm trying to figure out the difference (between tools/services/programs) between Data Warehouse, Clustered Data Processing and the tools/infrastructure for querying a Data Warehouse
So Let's say I have the following setup to perform some data processing for a certain use case
Hadoop Cluster for Distributed Data processing
Hive for providing infrastructure and Functions for querying data from a data warehouse
My data sitting in an RDBMS or a NoSQL database
In the above example, what exactly is the Data Warehouse? My naive brain thinks that it is the RDBMS or the NoSQL database in the above context is the Data warehouse. But by definition, isn't a Data warehouse a database used for reporting and data analysis? (Definition shamelessly stolen from Wikipedia). So can I call a traditional RDBMS/NoSQL database a Data Warehouse? Thanks.
You cannot call every relational database system a data warehouse, since one of data warehouses main feature is to aggregate data from multiple databases (with different schemas). It is usually done with a "star schema" allowing to combine multiple dimensions and multiple granularities.
Because NoSQL database systems (graph-based or map-reduce-based) are schema-less they can indeed store data from different schemas. Moreover Map-Reduce can be used to aggregate data with different granularities (e.g. aggregate daily data to compare them with monthly data).

Lookup data in Relation database from Hadoop side

I am converting SSIS solution to Hadoop for ETL processing in the data-warehouse.
My expected system:
ETL - landing & staging (Hadoop) ----put-data---> Data-warehouse(MySQL)
The problem is: in transform phrase, I need to lookup data in MySQL from hadoop side (pig or mapreduce job). There are 2 solutions:
1st: Clone all tables need to lookup from MySQL into Hadoop. It means that we need to maintain data from 2 places.
2nd: query directly to MySQL. I am worried about many connections come to MySQL server.
What is solution/best practise for this problem? Are there any other solutions.
You will have to have some representation of your dimensional tables in Hadoop. Depending on the way how you do ETL of the dimension data, you might actually have them as a side effect of the ETL.
Are you planning to store the most granular fact data in MySQL? I my experience, Hive + Hadoop beat realational databases when it comes to storing and analyzing the fact data. If you need a realtime access to the results of the queries, you then can "cache" the summary results by storing them in MySQL.

Resources