I'm trying to figure out the difference (between tools/services/programs) between Data Warehouse, Clustered Data Processing and the tools/infrastructure for querying a Data Warehouse
So Let's say I have the following setup to perform some data processing for a certain use case
Hadoop Cluster for Distributed Data processing
Hive for providing infrastructure and Functions for querying data from a data warehouse
My data sitting in an RDBMS or a NoSQL database
In the above example, what exactly is the Data Warehouse? My naive brain thinks that it is the RDBMS or the NoSQL database in the above context is the Data warehouse. But by definition, isn't a Data warehouse a database used for reporting and data analysis? (Definition shamelessly stolen from Wikipedia). So can I call a traditional RDBMS/NoSQL database a Data Warehouse? Thanks.
You cannot call every relational database system a data warehouse, since one of data warehouses main feature is to aggregate data from multiple databases (with different schemas). It is usually done with a "star schema" allowing to combine multiple dimensions and multiple granularities.
Because NoSQL database systems (graph-based or map-reduce-based) are schema-less they can indeed store data from different schemas. Moreover Map-Reduce can be used to aggregate data with different granularities (e.g. aggregate daily data to compare them with monthly data).
Related
I know Google BigQuery is a data warehouse but is Dataproc, Big Table, Pub/Sub considered a data warehouse? Would that make Hadoop a data warehouse?
A "Data warehouse" is mostly an information systems concept that describes a centralized and trusted source of (e.g. company/business) data.
From Wikipedia: "DWs are central repositories of integrated data from one or more disparate sources. They store current and historical data in one single place that are used for creating analytical reports for workers throughout the enterprise."
Regarding your questions, a simple answer would be:
Google BigQuery is a query execution (and/or data processing) engine that you can use over data stores of different kinds.
Google BigTable is a database service that can be used to implement a
data warehouse or any other data store.
Google DataProc is a data processing service composed by common Hadoop processing components like MapReduce (or Spark, if you consider it part of Hadoop).
Hadoop is a framework/platform for data storage and processing comprised of
different components (e.g. data storage via HDFS, data processing via MapReduce). You could use an Hadoop platform to build a Data Warehouse, e.g. by using MapReduce to process data and load it into ORC files that will be stored in HDFS and that can be queried by Hive. But it would only be appropriate to call it a data warehouse if it is a "centralized, single version of the truth about data" ;)
Dataproc could be working as a data lake as it's a Hadoop cluster, but it could be considered as a Data warehouse as some tools can consult its information.
BigTable stores up to petabytes of data, however, it's designed for applications that need very high throughput and scalability. Nevertheless, due to its high storage capacity and stream processing/analytics, it could be considered as a data warehouse too.
Pub/Sub it's not a data warehouse as it's a publish-subscribe service.
Is Cassandra a good alternative for Hadoop as a data warehouse where data is append only and all updates in source databases should not overwrite the existing rows in the data warehouse but get appended. Is Cassandra really ment to act as a data warehouse or just as a database to store the results of batch / stream queries?
Cassandra can be used both as a data warehouse(raw data storage) and as a database (for final data storage). It depends more on the cases you want to do with the data.
You even may need to have both Hadoop and Cassandra for different purposes.
Assume, you need to gather and process data from multiple mobile devices and provide some complex aggregation report to the user.
So at first, you need to save data as fast as possible (as new portions appear very often) so you use Cassandra here. As Cassandra is limited in aggregation features, you load data into HDFS and do some processing via HQL scripts (assume, you're not very good at coding but great in complicated SQLs). And then you move the report results from HDFS to Cassandra in a dedicated reports table partitioned by user id.
So when the user wants to have some aggregation report about his activity in the last month, the application takes the id of active user and returns the aggregated result from Cassandra (as it is simple key-value search).
So for your question, yes, it could be an alternative, but the selection strategy depends on the data types and your application business cases.
You can read more information about usage of Cassandra
here
I have some questions about migration, data model and performance of Hadoop/Impala.
How to migrate Oracle application to cloudera hadoop/Impala
1.1 How to replace oracle stored procedure in impala or M/R or java/python app.
For example, the original SP include several parameters and sqls.
1.2 How to replace unsupported or complex SQL like over by partition from Oracle to impala.
Are there any existing examples or Impala UDF?
1.3 How to handle update operation since part of data has to be updated.
For example, use data timestamp? use the store model which can support update like HBase? or use delete all data/partition/dir and insert it again(insert overwrite).
Data store model , partition design and query performance
2.1 How to chose impala internal table or external table like csv, parquet, habase?
For example, if there are several kind of data like importing exsited large data in Oracle into hadoop, new business data into hadoop, computed data in hadoop and frequently updated data in hadoop, how to choose the data model? Do you need special attention if the different kind of data need to join?
We have XX TB's data from Oracle, do you have any suggestion about the file format like csv or parquet? Do we need to import the data results into impala internal table or hdfs fs after calculation. If those kind of data can be updated, how to we considered that?
2.2 How to partition the table /external table when joining
For example, there are huge number of sensor data and each one includes measuring data, acquisition timestamp and region information.
We need:
calculate measuring data by different region
Query a series of measuring data during a certain time interval for specific sensor or region.
Query the specific sensor data from huge number of data cross all time.
Query data for all sensors on specific date.
Would you please provide us some suggestion about how to setup up the partition for internal and directories structure for external table(csv) .
In addition, for the structure of the directories, which is better when using date=20090101/area=BEIJING or year=2009/month=01/day=01/area=BEIJING? Is there any guide about that?
I have an RDBMS (SQL Server/ Oracle) and a Hadoop database on the other end.
Primary-key 'customer' is common in both data stores.
A few questions:
Is it possible to have a datamart that can pull data from both RDBMS
& Big data and produce reports? What would be a tool example?
Does the datamart itself need to be a RDBMS store or can it be some in-memory stuff?
Whats the best way to run data analytics in this environment?
What about data visualization?
Or should I just get all data into an RDBMS data warehouse and then solve for these questions?
Data virtualization or data federation is what you're looking for - i.e. the ability to access a single source that will access multiple resources as needed.
Databases usually have some limited capability in this area which lets you define external tables see for example this link for Oracle and HDFS
I am converting SSIS solution to Hadoop for ETL processing in the data-warehouse.
My expected system:
ETL - landing & staging (Hadoop) ----put-data---> Data-warehouse(MySQL)
The problem is: in transform phrase, I need to lookup data in MySQL from hadoop side (pig or mapreduce job). There are 2 solutions:
1st: Clone all tables need to lookup from MySQL into Hadoop. It means that we need to maintain data from 2 places.
2nd: query directly to MySQL. I am worried about many connections come to MySQL server.
What is solution/best practise for this problem? Are there any other solutions.
You will have to have some representation of your dimensional tables in Hadoop. Depending on the way how you do ETL of the dimension data, you might actually have them as a side effect of the ETL.
Are you planning to store the most granular fact data in MySQL? I my experience, Hive + Hadoop beat realational databases when it comes to storing and analyzing the fact data. If you need a realtime access to the results of the queries, you then can "cache" the summary results by storing them in MySQL.