How is Hadoop different from database? - hadoop

I was doing a case study on Spotify. I found out that Spotify uses Cassandra as a DB and also Hadoop. My question is, how is Hadoop different from a database. What type of files does Hadoop datanode stores? Why every corporation has DB and Hadoop as well. I know Hadoop is not a DB but what is it used for if there is DB cluster to save data?

Hadoop is not a database at all. Hadoop is a set of tools for distributed storage and processing, such as distributed filesystem (HDFS), MapReduce framework libraries, YARN resource manager.
Other tools like Hive, Spark, Pig, Giraph, sqoop, etc, etc can use Hadoop or it's components. For example Hive is a database. It uses HDFS for storing it's data and MapReduce framework primitives for building query execution graph.

Related

What is the relationship between Spark, Hadoop and Cassandra

My understanding was that Spark is an alternative to Hadoop. However, when trying to install Spark, the installation page asks for an existing Hadoop installation. I'm not able to find anything that clarifies that relationship.
Secondly, Spark apparently has good connectivity to Cassandra and Hive. Both have sql style interface. However, Spark has its own sql. Why would one use Cassandra/Hive instead of Spark's native sql? Assuming that this is a brand new project with no existing installation?
Spark is a distributed in memory processing engine. It does not need to be paired with Hadoop, but since Hadoop is one of the most popular big data processing tools, Spark is designed to work well in that environment. For example, Hadoop uses the HDFS (Hadoop Distributed File System) to store its data, so Spark is able to read data from HDFS, and to save results in HDFS.
For speed, Spark keeps its data sets in memory. It will typically start a job by loading data from durable storage, such as HDFS, Hbase, a Cassandra database, etc. Once loaded into memory, Spark can run many transformations on the data set to calculate a desired result. The final result is then typically written back to durable storage.
In terms of it being an alternative to Hadoop, it can be much faster than Hadoop at certain operations. For example a multi-pass map reduce operation can be dramatically faster in Spark than with Hadoop map reduce since most of the disk I/O of Hadoop is avoided. Spark can read data formatted for Apache Hive, so Spark SQL can be much faster than using HQL (Hive Query Language).
Cassandra has its own native query language called CQL (Cassandra Query Language), but it is a small subset of full SQL and is quite poor for things like aggregation and ad hoc queries. So when Spark is paired with Cassandra, it offers a more feature rich query language and allows you to do data analytics that native CQL doesn't provide.
Another use case for Spark is for stream processing. Spark can be set up to ingest incoming real time data and process it in micro-batches, and then save the result to durable storage, such as HDFS, Cassandra, etc.
So spark is really a standalone in memory system that can be paired with many different distributed databases and file systems to add performance, a more complete SQL implementation, and features they may lack such a stream processing.
Im writing a paper about Hadoop for university. And stumbled over your question. Spark is just using Hadoop for persistence and only if you want to use it. It's possible to use it with other persistence tiers like Amazon EC2.
On the other hand-side spark is running in-memory and it's not primarly build to be used for map reduce use-cases like Hadoop was/is.
I can recommend this article, if you like a more detailed description: https://www.xplenty.com/blog/2014/11/apache-spark-vs-hadoop-mapreduce/
The README.md file in Spark can solve your puzzle:
A Note About Hadoop Versions
Spark uses the Hadoop core library to talk to HDFS and other Hadoop-supported
storage systems. Because the protocols have changed in different versions of
Hadoop, you must build Spark against the same version that your cluster runs.
Please refer to the build documentation at
"Specifying the Hadoop Version"
for detailed guidance on building for a particular distribution of Hadoop, including
building for particular Hive and Hive Thriftserver distributions.

What is "Hadoop" - the definition of Hadoop?

It is kind of obvious and we will all agree that we can call HDFS + YARN + MapReduce as Hadoop. But what happens with different other combinations and other products in the Hadoop ecosystem?
Is, for example, HDFS + YARN + Spark still Hadoop? Is HBASE Hadoop? I guess we consider HDFS + YARN + Pig Hadoop, since Pig uses MapReduce.
Are only the MapReduce tools considered Hadoop, but anything else run on HDFS + YARN(like Spark) is non Hadoop?
I agree with your impression that the "Hadoop" term does not have a useful definition. "We have a Hadoop cluster" may mean various things.
There is an official answer though at http://hadoop.apache.org/#What+Is+Apache+Hadoop%3F:
The Apache™ Hadoop® project develops open-source software for
reliable, scalable, distributed computing.
The Apache Hadoop software library is a framework that allows for the
distributed processing of large data sets across clusters of computers
using simple programming models.
So "Hadoop" is the name of a project and a software library. Any other usage is ill-defined.
In addition to Apache hadoop definition from Official website, I would like to highlight that Hadoop is a framework and there are many sub-systems in Hadoop ecosystem
Quoting this content from official website so that broken links in future does not cause any issue to this answer.
The project includes these modules:
Hadoop Common: The common utilities that support the other Hadoop modules.
Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data.
Hadoop YARN: A framework for job scheduling and cluster resource management.
Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.
More or less,
Hadoop = Distributed Storage (HDFS) + Distributed Processing ( YARN + Map Reduce)
But these four modules does not cover complete Hadoop Ecosystem. There are many Hadoop related projects and 40+ subsystems in Hadoop Ecosystems.
Other Hadoop-related projects at Apache include:
Ambari™: A web-based tool for provisioning, managing, and monitoring Apache Hadoop clusters which includes support for Hadoop HDFS, Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop.
Avro™: A data serialization system.
Cassandra™: A scalable multi-master database with no single points of failure.
Chukwa™: A data collection system for managing large distributed systems.
HBase™: A scalable, distributed database that supports structured data storage for large tables.
Hive™: A data warehouse infrastructure that provides data summarization and ad hoc querying.
Mahout™: A Scalable machine learning and data mining library.
Pig™: A high-level data-flow language and execution framework for parallel computation.
Spark™: A fast and general compute engine for Hadoop data. Spark provides a simple and expressive programming model that supports a wide range of applications, including ETL, machine learning, stream processing, and graph computation.
Tez™: A generalized data-flow programming framework, built on Hadoop YARN, which provides a powerful and flexible engine to execute an arbitrary DAG of tasks to process data for both batch and interactive use-cases. Tez is being adopted by Hive™, Pig™ and other frameworks in the Hadoop ecosystem, and also by other commercial software (e.g. ETL tools), to replace Hadoop™ MapReduce as the underlying execution engine.
ZooKeeper™: A high-performance coordination service for distributed applications.
Coming back to your question:
Just have a look at 40+ sub systems in Hadoop eco system. Every thing you have quoted may not be Hadoop but most them are related to Hadoop.
Spark is part of Hadoop ecosystem. But it can neither use HDFS nor YARN. HDFS data sets can be replaced with RDD ( resilient distributed dataset) and can run in Standalone mode without YARN.
Have a look at this article and this article for Hadoop & Spark comparison.
Spark use cases over Hadoop:
Iterative Algorithms in Machine Learning
Interactive Data Mining and Data Processing
Stream processing
Sensor data processing
Since Spark doesn't have storage system, it have to depend on one of distributed storages where HDFS is one of them.
Have a look at related SE question:
Can apache spark run without hadoop?
The most generally accepted understanding of Hadoop: HDFS and Map/Reduce and their related processes and tooling.
A related term: Hadoop ecosystem: Hive/Pig/Hbase, Zookeeper, Oozie. Also vendor specific ones such as impala, ambari.
Why we require a Big Data System?
STORE (to store massive amount of data)
PROCESS (process the data/requests in timely manner)
SCALE (scale easily as data grows)
There was a big data solution provided by google.
Google File System: To solve distributed storage.
Map Reduce: To solve distributed computing.
Google published the research papers. Apache developed open source system similar to developed by Google, it is known as HADOOP.
Similar to Google File system in hadoop is HDFS(Hadoop Distributed File System): File system to manage the storage of data.
MAP Reduce: Framework to process data across multiple servers.
Note: In 2013, Apache released HADOOP 2.0
(MapReduce was broken into two components:
MapReduce: A framework to define the data processing task.
YARN: A framework to run the data processing task.
)
HADOOP ECOSYSTEM
Hadoop was not easy to understand, and it was limited to hard core developer. so to make the usage of hadoop easier. Lots of tools sprung up, which as a whole is known as Hadoop Ecosystem.
Hadoop Ecosystem contains tools like:
HIVE:
provides an SQL interface to hadoop.
bridge to hadoop for folks who dont have dont have exposure to OOPs in JAVA.
HBASE:
a database management system on top of hadoop.
integrates with our app just like a traditional database.
PIG:
DATA Manipulation Language.
transforms unstructured data into structured format.
query this structured data using interfaces like Hive.
SPARK:
A distributed computing engine used along with Hadoop.
Interactive shell to quickly process datasets.
has a bunch of built in libraries for machine learning stream processing, graph processing etc.
OOZIE:
tool to schedule workflows on all the hadoop ecosystem technologies.
FLUME/SQOOP:
tools to transfer data between other systems and hadoop.
This concludes a very high level overview of Hadoop.
Hadoop is not a stack like LAMP or MEAN stack. Hadoop is a collections of frameworks and tools that work togethor to solve complex big data problems.
It is basically a Project under Apache foundation. Various Subprojects like Mapreduce, Ambari, Sqoop, Spark, Zookeeper, etc togethor makes ,what is called a Hadoop Ecosystem.
Source : https://www.datacloudschool.com/2020/01/introduction-what-is-hadoop.html

Questions about Hadoop And Hive And Presto

I am looking into using Hive on our Hadoop cluster to then use Presto to do some analytics on the data stored in Hadoop but I am still confused about some things:
Files are stored in Hadoop (some kind of file manager)
Hive needs tables to store data from Hadoop (data manager)
Do both Hadoop and Hive store their data separate or does Hive just use the files from Hadoop? (in terms of hard disk space and so on?)
-> So does Hive import data from Hadoop in tables and leave Hadoop alone or how must I see this?
Can Presto be used without Hive and just on Hadoop directly?
Thanks in advance for answering my questions :)
First things first: files are stored in Hadoop Distributed File System (HDFS). Is that what you call Data manager?
Actually Hive can use both - "regular" files in HDFS or tables which are once again "regular" files with additional metadata stored in special datastore (it is called warehouse).
Concerning Presto - it has a built-in support for Hive metastore, but you can also write your own connector plugin for any data source.
Please read more info about Hive connector configuration here and about connector plugins here.

Consilidating multiple Hadoop clusters

We have multiple hadoop cluster using hive and pig, what is the best way to consolidate them in to one? In BI this was done by build EDW or MDM approach. how about hadoop is any one thinking about this
Pig and Hive are client side libraries, so there is nothing to migrate for them. Only the Pig and Hive client have to point to the appropriate Hadoop clusters.
Regarding the data, you could use DistCp to move the data across clusters.

In a hadoop cluster, should hive be installed on all nodes?

I am a newbie to Hadoop / Hive and I have just started reading the docs. There are lots of blogs on installing Hadoop in cluster mode. Also, I know that Hive runs on top of Hadoop.
My question is: Hadoop is installed on all the cluster nodes. Should I also install Hive on all the cluster nodes or only on the master node?
No, it is not something you install on worker nodes. Hive is a Hadoop client. Just run Hive according to the instructions you see at the Hive site.
From Cloudera's Hive installation Guide:
Install Hive on your client machine(s) from which you submit jobs; you do not need to install it on the nodes in your Hadoop cluster.
Hive is basically used for processing structured and semi-structured data in Hadoop. We can also perform Analysis of large datasets which is present in HDFS and also in Amazon S3 filesystem using Hive. In order to query data hive also provides query language known as HiveQL which is similar to SQL. Using Hive one can easily run Ad-hoc queries for the data analysis. Using Hive we don’t need to write complex Map-Reduce jobs, we just need to submit SQL queries. Hive converts these SQL queries into MapReduce jobs.
Finally Hive SQL will get converted to MapReduce jobs and we don't have to submit MapReduce job from all node in a Hadoop cluster, in the same way we don't need Hive to be installed in all node of Hadoop cluster

Resources