GCP Hadoop data warehouse? - hadoop

I know Google BigQuery is a data warehouse but is Dataproc, Big Table, Pub/Sub considered a data warehouse? Would that make Hadoop a data warehouse?

A "Data warehouse" is mostly an information systems concept that describes a centralized and trusted source of (e.g. company/business) data.
From Wikipedia: "DWs are central repositories of integrated data from one or more disparate sources. They store current and historical data in one single place that are used for creating analytical reports for workers throughout the enterprise."
Regarding your questions, a simple answer would be:
Google BigQuery is a query execution (and/or data processing) engine that you can use over data stores of different kinds.
Google BigTable is a database service that can be used to implement a
data warehouse or any other data store.
Google DataProc is a data processing service composed by common Hadoop processing components like MapReduce (or Spark, if you consider it part of Hadoop).
Hadoop is a framework/platform for data storage and processing comprised of
different components (e.g. data storage via HDFS, data processing via MapReduce). You could use an Hadoop platform to build a Data Warehouse, e.g. by using MapReduce to process data and load it into ORC files that will be stored in HDFS and that can be queried by Hive. But it would only be appropriate to call it a data warehouse if it is a "centralized, single version of the truth about data" ;)

Dataproc could be working as a data lake as it's a Hadoop cluster, but it could be considered as a Data warehouse as some tools can consult its information.
BigTable stores up to petabytes of data, however, it's designed for applications that need very high throughput and scalability. Nevertheless, due to its high storage capacity and stream processing/analytics, it could be considered as a data warehouse too.
Pub/Sub it's not a data warehouse as it's a publish-subscribe service.

Related

Cassandra for datawarehouse

Is Cassandra a good alternative for Hadoop as a data warehouse where data is append only and all updates in source databases should not overwrite the existing rows in the data warehouse but get appended. Is Cassandra really ment to act as a data warehouse or just as a database to store the results of batch / stream queries?
Cassandra can be used both as a data warehouse(raw data storage) and as a database (for final data storage). It depends more on the cases you want to do with the data.
You even may need to have both Hadoop and Cassandra for different purposes.
Assume, you need to gather and process data from multiple mobile devices and provide some complex aggregation report to the user.
So at first, you need to save data as fast as possible (as new portions appear very often) so you use Cassandra here. As Cassandra is limited in aggregation features, you load data into HDFS and do some processing via HQL scripts (assume, you're not very good at coding but great in complicated SQLs). And then you move the report results from HDFS to Cassandra in a dedicated reports table partitioned by user id.
So when the user wants to have some aggregation report about his activity in the last month, the application takes the id of active user and returns the aggregated result from Cassandra (as it is simple key-value search).
So for your question, yes, it could be an alternative, but the selection strategy depends on the data types and your application business cases.
You can read more information about usage of Cassandra
here

What size the data volume of traditional database to choose Hadoop?

What size the data volume of traditional database to choose Hadoop? What is the basic bench-marked parameter to choose Hadoop system over traditional database?
There is no specific "size" to move from RDBMS to Hadoop. Two things to know:
They are very different.(read on to know more)
The size of data that RDBMS can handle is dependent on the capability of the DataBase Server.
Traditional databases are RDBMS(Relational Database Management System) where we insert data as rows, which get stored in the database. You may Alter/Query/Update the database.
Hadoop is a framework for storage and processing data(large amounts of data). It has two parts: Storage(Hadoop Distributed File System) and MapReduce(processing framework).
Hadoop stores data as files on its FS. So if you want to Update/alter/query it like RDBMS its not possible.
We do have SQL wrappers over Hadoop like Hive or impala but they aren't as performant as RDBMS on data(not big data).
Even with all this many are considering moving from RDBMS to Hadoop because RDBMS under-performs with large data(bigdata). Hadoop can be used as a DataStore and Queries over it could be run using Hive/Impala. Updates are not readily supported on Hadoop.
There are many pros and cons of using Hadoop over RDBMS. Read more.. here orhere

Clarification of Sqoop and Flume

I am very new to big data and i have little confusion regarding Sqoop and Flume
So i get that difference between the Sqoop and Flume
Sqoop is for transferring bulk data from RDBMS
Flume is for streaming of data such as log files
My confusion is because big data architecture i am looking at (which i have no virtual copy of) grouped structured data and its transferred by Sqoop and Unstructured streamed by Flume.
My question regard that is does that mean Flume is only for streaming?
What about high frequency data? and does Flume support transfer of unstructured data that are non-log files (i.e. audio, video) or would Sqoop be able to handle that?
Final question is can Sqoop work with federated data sources? if yes with both real and virtual?
Thanks,
Apache Flume is a distributed, reliable, and available system for efficiently collecting, aggregating and moving large amounts of log data from many different sources to a centralized data store.
The use of Apache Flume is not only restricted to log data aggregation. Since data sources are customizable, Flume can be used to transport massive quantities of event data including but not limited to network traffic data, social-media-generated data, email messages and pretty much any data source possible.
Apache Sqoop is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases(it imports data, transform the data in Hadoop MapReduce, and then export the data).
Sqoop automates most of this process, relying on the database to describe the schema for the data to be imported. Sqoop uses MapReduce to import and export the data, which provides parallel operation as well as fault tolerance.
Source: sqoop-vs-flume-battle-of-the-hadoop
Reference: INGESTION AND STREAMING
Flume is efficient with streams and if you want to just dump data from RDBMS why not use sqoop?
By high frequency data if you mean social media yes flume can handle it. Unstructured data yes, flume may handle that too.
sqoop is essentially a tool to ingest data in HDFS from RDBMS. Under the hood, it generates simple Java code which submit a query to a RDBMS and writes the result to HDFS. This means that you can import with sqoop everything which can be accessed via JDBC connection and which has a Java driver available. For this reason, you can't use it for files (like logs) or things like that.
Then sqoop can't handle video or audio files.
Flume, instead, is used to monitor and ingesting in real time informations. You can ingest everything for which there is a Flume source available (https://flume.apache.org/FlumeUserGuide.html#flume-sources).

How to build a big data platform to receive and store big data in Hadoop

I am trying to build up a big data platform to receive and store in Hadoop large amount of heterogeneous data like (documents,videos,images,sensors data, etc) then implement classification process.
So what architecture can help me as I’m currently using
VMware VSphere EXSi
Hadoop
Habse
Thrift
XAMPP
All these working fine but I don’t know how to receive a large amount of data and how to store the data because I discovered that Hbase is a column-oriented data base and it’s not data warehouse.
You have to customize solution for type of Big Data ( Structured, Semi-Structured and Un-Structured)
You can use HIVE/HBASE for structured data if total data size <= 10 TB
You can use SQOOP to import structured data from traditional RDBMS database Oracle, SQL Server etc.
You can use FLUME for processing Un-structured data.
You can use Content Management System to process Un-structured data & Semi-Structured data - Tera Or Peta bytes of data. If you are storing un-structured data, I prefer to store the data in CMS and use meta data information in NoSQL database like HBASE
To process Big data streaming, you can use PIG.
Have a look at Structured Data and Un-Structured data handling in Hadoop

Hadoop vs. NoSQL-Databases

As I am new to Big Data and the related technologies my question is, as the title implies:
When would you use Hadoop and when would you use some kind of NoSQL-Databases to store and analyse massive amounts of data?
I know that Hadoop is a Framework and that Hadoop and NoSQL differs.
But you can save lots of data with Hadoop on HDFS and also with NoSQL-DBs like MongoDB, Neo4j...
So maybe the use of Hadoop or of a NoSQL-Database depends if you just want to analyse data or if you just want to store data?
Or is it just that HDFS can save lets say RAW data and a NoSQL-DB is more structured (more structured than raw data and less structured than a RDBMS)?
Hadoop in an entire framework of which one of the components can be NOSQL.
Hadoop generally refers to cluster of systems working together to analyze data. You can take data from NOSQL and parallel process them using Hadoop.
HBase is a NOSQL that is part of Hadoop ecosystem. You can use other different NOSQL too.
Your question is missleading you are comparing Hadoop, which is a framework, to a database ...
Hadoop is containing a lot of features (including NoSQL database named HBase) in order to provide you a big data environment. If you're having a massive quantity of data you will probably use Hadoop (for the MapReduce functionalities or the datawarehouse capabilities) but it's not sure, depending on what you're processing and how you want to process it. If you're just storing a lot of data and don't need other feature (batch data processing or data transformations ...) a simple NoSQL database is enough.

Resources