Incremental copy between Hadoop Clusters (using Spark) given that FALCON deprecated - hadoop

Given that FALCON is deprecated and distcp cannot do incremental copies of data that are not in separate ' nice slices', how would one copy in a sqoop-like way between two Hadoop clusters?
Can Spark attach to 2 clusters simultaneously? Never had the need to try this.
I can see many issues listed with cross realm HDFS aspects.

In summary:
You can use Cloudera Replication Manager for HDFS:
for HDP, CDP and CDH environments. This is file based replication options.
for CDH you can do this for a given (set of) table(s).
For KUDU there is no such option.
You can use a Spark App with two clusters by giving the full hdfs://namenode:port/path address and "roll your own" logic. This applies o both HDFS/Hive, and KUDU Storage Manager in fact.

Related

Can apache spark run without hadoop?

Are there any dependencies between Spark and Hadoop?
If not, are there any features I'll miss when I run Spark without Hadoop?
Spark is an in-memory distributed computing engine.
Hadoop is a framework for distributed storage (HDFS) and distributed processing (YARN).
Spark can run with or without Hadoop components (HDFS/YARN)
Distributed Storage:
Since Spark does not have its own distributed storage system, it has to depend on one of these storage systems for distributed computing.
S3 – Non-urgent batch jobs. S3 fits very specific use cases when data locality isn’t critical.
Cassandra – Perfect for streaming data analysis and an overkill for batch jobs.
HDFS – Great fit for batch jobs without compromising on data locality.
Distributed processing:
You can run Spark in three different modes: Standalone, YARN and Mesos
Have a look at the below SE question for a detailed explanation about both distributed storage and distributed processing.
Which cluster type should I choose for Spark?
Spark can run without Hadoop but some of its functionality relies on Hadoop's code (e.g. handling of Parquet files). We're running Spark on Mesos and S3 which was a little tricky to set up but works really well once done (you can read a summary of what needed to properly set it here).
(Edit) Note: since version 2.3.0 Spark also added native support for Kubernetes
By default , Spark does not have storage mechanism.
To store data, it needs fast and scalable file system. You can use S3 or HDFS or any other file system. Hadoop is economical option due to low cost.
Additionally if you use Tachyon, it will boost performance with Hadoop. It's highly recommended Hadoop for apache spark processing.
As per Spark documentation, Spark can run without Hadoop.
You may run it as a Standalone mode without any resource manager.
But if you want to run in multi-node setup, you need a resource manager like YARN or Mesos and a distributed file system like HDFS,S3 etc.
Yes, spark can run without hadoop. All core spark features will continue to work, but you'll miss things like easily distributing all your files (code as well as data) to all the nodes in the cluster via hdfs, etc.
Yes, you can install the Spark without the Hadoop.
That would be little tricky
You can refer arnon link to use parquet to configure on S3 as data storage.
http://arnon.me/2015/08/spark-parquet-s3/
Spark is only do processing and it uses dynamic memory to perform the task, but to store the data you need some data storage system. Here hadoop comes in role with Spark, it provide the storage for Spark.
One more reason for using Hadoop with Spark is they are open source and both can integrate with each other easily as compare to other data storage system. For other storage like S3, you should be tricky to configure it like mention in above link.
But Hadoop also have its processing unit called Mapreduce.
Want to know difference in Both?
Check this article: https://www.dezyre.com/article/hadoop-mapreduce-vs-apache-spark-who-wins-the-battle/83
I think this article will help you understand
what to use,
when to use and
how to use !!!
Yes, of course. Spark is an independent computation framework. Hadoop is a distribution storage system(HDFS) with MapReduce computation framework. Spark can get data from HDFS, as well as any other data source such as traditional database(JDBC), kafka or even local disk.
Yes, Spark can run with or without Hadoop installation for more details you can visit -https://spark.apache.org/docs/latest/
Yes spark can run without Hadoop. You can install spark in your local machine with out Hadoop. But Spark lib comes with pre Haddop libraries i.e. are used while installing on your local machine.
You can run spark without hadoop but spark has dependency on hadoop win-utils. so some features may not work, also if you want to read hive tables from spark then you need hadoop.
Not good at english,Forgive me!
TL;DR
Use local(single node) or standalone(cluster) to run spark without Hadoop,but stills need hadoop dependencies for logging and some file process.
Windows is strongly NOT recommend to run spark!
Local mode
There are so many running mode with spark,one of it is called local will running without hadoop dependencies.
So,here is the first question:how to tell spark we want to run on local mode?
After read this official doc,i just give it a try on my linux os:
Must install java and scala,not the core content so skip it.
Download spark package
There are "without hadoop" and "hadoop integrated" 2 type of package
The most important thing is "without hadoop" do NOT mean run without hadoop but just not bundle with hadoop so you can bundle it with your custom hadoop!
Spark can run without hadoop(HDFS and YARN) but need hadoop dependency jar such as parquet/avro etc SerDe class,so strongly recommend to use "integrated" package(and you will found missing some log dependencies like log4j and slfj and other common utils class if chose "without hadoop" package but all this bundled with hadoop integrated pacakge)!
Run on local mode
Most simple way is just run shell,and you will see the welcome log
# as same as ./bin/spark-shell --master local[*]
./bin/spark-shell
Standalone mode
As same as blew,but different with step 3.
# Starup cluster
# if you want run on frontend
# export SPARK_NO_DAEMONIZE=true
./sbin/start-master.sh
# run this on your every worker
./sbin/start-worker.sh spark://VMS110109:7077
# Submit job or just shell
./bin/spark-shell spark://VMS110109:7077
On windows?
I kown so many people run spark on windown just for study,but here is so different on windows and really strongly NOT recommend to use windows.
The most important things is download winutils.exe from here and configure system variable HADOOP_HOME to point where winutils located.
At this moment 3.2.1 is the most latest release version of spark,but a bug is exist.You will got a exception like Illegal character in path at index 32: spark://xxxxxx:63293/D:\classe when run ./bin/spark-shell.cmd,only startup a standalone cluster then use ./bin/sparkshell.cmd or use lower version can temporary fix this.
For more detail and solution you can refer for here
No. It requires full blown Hadoop installation to start working - https://issues.apache.org/jira/browse/SPARK-10944

What is the relationship between Spark, Hadoop and Cassandra

My understanding was that Spark is an alternative to Hadoop. However, when trying to install Spark, the installation page asks for an existing Hadoop installation. I'm not able to find anything that clarifies that relationship.
Secondly, Spark apparently has good connectivity to Cassandra and Hive. Both have sql style interface. However, Spark has its own sql. Why would one use Cassandra/Hive instead of Spark's native sql? Assuming that this is a brand new project with no existing installation?
Spark is a distributed in memory processing engine. It does not need to be paired with Hadoop, but since Hadoop is one of the most popular big data processing tools, Spark is designed to work well in that environment. For example, Hadoop uses the HDFS (Hadoop Distributed File System) to store its data, so Spark is able to read data from HDFS, and to save results in HDFS.
For speed, Spark keeps its data sets in memory. It will typically start a job by loading data from durable storage, such as HDFS, Hbase, a Cassandra database, etc. Once loaded into memory, Spark can run many transformations on the data set to calculate a desired result. The final result is then typically written back to durable storage.
In terms of it being an alternative to Hadoop, it can be much faster than Hadoop at certain operations. For example a multi-pass map reduce operation can be dramatically faster in Spark than with Hadoop map reduce since most of the disk I/O of Hadoop is avoided. Spark can read data formatted for Apache Hive, so Spark SQL can be much faster than using HQL (Hive Query Language).
Cassandra has its own native query language called CQL (Cassandra Query Language), but it is a small subset of full SQL and is quite poor for things like aggregation and ad hoc queries. So when Spark is paired with Cassandra, it offers a more feature rich query language and allows you to do data analytics that native CQL doesn't provide.
Another use case for Spark is for stream processing. Spark can be set up to ingest incoming real time data and process it in micro-batches, and then save the result to durable storage, such as HDFS, Cassandra, etc.
So spark is really a standalone in memory system that can be paired with many different distributed databases and file systems to add performance, a more complete SQL implementation, and features they may lack such a stream processing.
Im writing a paper about Hadoop for university. And stumbled over your question. Spark is just using Hadoop for persistence and only if you want to use it. It's possible to use it with other persistence tiers like Amazon EC2.
On the other hand-side spark is running in-memory and it's not primarly build to be used for map reduce use-cases like Hadoop was/is.
I can recommend this article, if you like a more detailed description: https://www.xplenty.com/blog/2014/11/apache-spark-vs-hadoop-mapreduce/
The README.md file in Spark can solve your puzzle:
A Note About Hadoop Versions
Spark uses the Hadoop core library to talk to HDFS and other Hadoop-supported
storage systems. Because the protocols have changed in different versions of
Hadoop, you must build Spark against the same version that your cluster runs.
Please refer to the build documentation at
"Specifying the Hadoop Version"
for detailed guidance on building for a particular distribution of Hadoop, including
building for particular Hive and Hive Thriftserver distributions.

Falcon's role in Hadoop ecosystem

I am supposed to work on cluster mirroring where I have to set up the similar HDFS cluster (same master and slaves) as a existing one and copy the data to the new and then run the same jobs as is.
I have read about falcon as a feed processing and a work flow coordinating tool and it is used for mirroring of HDFS clusters as well. Can someone enlighten me on what is Falcon's role in Hadoop ecosystem and how does it help in mirroring in particular. I am looking here to understand what all facon offers when it is part of my Hadoop eco-system (HDP).
Apache Falcon simplifies the configuration of data motion with: replication; lifecycle management; lineage and traceability. This provides data governance consistency across Hadoop components.
Falcon replication is asynchronous with delta changes. Recovery is done by running a process and swapping the source and target.
Data loss – Delta data may be lost if the primary cluster is completely shut down
Backup can be scheduled when needed depending on the bandwidth and network availability.

What is "Hadoop" - the definition of Hadoop?

It is kind of obvious and we will all agree that we can call HDFS + YARN + MapReduce as Hadoop. But what happens with different other combinations and other products in the Hadoop ecosystem?
Is, for example, HDFS + YARN + Spark still Hadoop? Is HBASE Hadoop? I guess we consider HDFS + YARN + Pig Hadoop, since Pig uses MapReduce.
Are only the MapReduce tools considered Hadoop, but anything else run on HDFS + YARN(like Spark) is non Hadoop?
I agree with your impression that the "Hadoop" term does not have a useful definition. "We have a Hadoop cluster" may mean various things.
There is an official answer though at http://hadoop.apache.org/#What+Is+Apache+Hadoop%3F:
The Apache™ Hadoop® project develops open-source software for
reliable, scalable, distributed computing.
The Apache Hadoop software library is a framework that allows for the
distributed processing of large data sets across clusters of computers
using simple programming models.
So "Hadoop" is the name of a project and a software library. Any other usage is ill-defined.
In addition to Apache hadoop definition from Official website, I would like to highlight that Hadoop is a framework and there are many sub-systems in Hadoop ecosystem
Quoting this content from official website so that broken links in future does not cause any issue to this answer.
The project includes these modules:
Hadoop Common: The common utilities that support the other Hadoop modules.
Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data.
Hadoop YARN: A framework for job scheduling and cluster resource management.
Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.
More or less,
Hadoop = Distributed Storage (HDFS) + Distributed Processing ( YARN + Map Reduce)
But these four modules does not cover complete Hadoop Ecosystem. There are many Hadoop related projects and 40+ subsystems in Hadoop Ecosystems.
Other Hadoop-related projects at Apache include:
Ambari™: A web-based tool for provisioning, managing, and monitoring Apache Hadoop clusters which includes support for Hadoop HDFS, Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop.
Avro™: A data serialization system.
Cassandra™: A scalable multi-master database with no single points of failure.
Chukwa™: A data collection system for managing large distributed systems.
HBase™: A scalable, distributed database that supports structured data storage for large tables.
Hive™: A data warehouse infrastructure that provides data summarization and ad hoc querying.
Mahout™: A Scalable machine learning and data mining library.
Pig™: A high-level data-flow language and execution framework for parallel computation.
Spark™: A fast and general compute engine for Hadoop data. Spark provides a simple and expressive programming model that supports a wide range of applications, including ETL, machine learning, stream processing, and graph computation.
Tez™: A generalized data-flow programming framework, built on Hadoop YARN, which provides a powerful and flexible engine to execute an arbitrary DAG of tasks to process data for both batch and interactive use-cases. Tez is being adopted by Hive™, Pig™ and other frameworks in the Hadoop ecosystem, and also by other commercial software (e.g. ETL tools), to replace Hadoop™ MapReduce as the underlying execution engine.
ZooKeeper™: A high-performance coordination service for distributed applications.
Coming back to your question:
Just have a look at 40+ sub systems in Hadoop eco system. Every thing you have quoted may not be Hadoop but most them are related to Hadoop.
Spark is part of Hadoop ecosystem. But it can neither use HDFS nor YARN. HDFS data sets can be replaced with RDD ( resilient distributed dataset) and can run in Standalone mode without YARN.
Have a look at this article and this article for Hadoop & Spark comparison.
Spark use cases over Hadoop:
Iterative Algorithms in Machine Learning
Interactive Data Mining and Data Processing
Stream processing
Sensor data processing
Since Spark doesn't have storage system, it have to depend on one of distributed storages where HDFS is one of them.
Have a look at related SE question:
Can apache spark run without hadoop?
The most generally accepted understanding of Hadoop: HDFS and Map/Reduce and their related processes and tooling.
A related term: Hadoop ecosystem: Hive/Pig/Hbase, Zookeeper, Oozie. Also vendor specific ones such as impala, ambari.
Why we require a Big Data System?
STORE (to store massive amount of data)
PROCESS (process the data/requests in timely manner)
SCALE (scale easily as data grows)
There was a big data solution provided by google.
Google File System: To solve distributed storage.
Map Reduce: To solve distributed computing.
Google published the research papers. Apache developed open source system similar to developed by Google, it is known as HADOOP.
Similar to Google File system in hadoop is HDFS(Hadoop Distributed File System): File system to manage the storage of data.
MAP Reduce: Framework to process data across multiple servers.
Note: In 2013, Apache released HADOOP 2.0
(MapReduce was broken into two components:
MapReduce: A framework to define the data processing task.
YARN: A framework to run the data processing task.
)
HADOOP ECOSYSTEM
Hadoop was not easy to understand, and it was limited to hard core developer. so to make the usage of hadoop easier. Lots of tools sprung up, which as a whole is known as Hadoop Ecosystem.
Hadoop Ecosystem contains tools like:
HIVE:
provides an SQL interface to hadoop.
bridge to hadoop for folks who dont have dont have exposure to OOPs in JAVA.
HBASE:
a database management system on top of hadoop.
integrates with our app just like a traditional database.
PIG:
DATA Manipulation Language.
transforms unstructured data into structured format.
query this structured data using interfaces like Hive.
SPARK:
A distributed computing engine used along with Hadoop.
Interactive shell to quickly process datasets.
has a bunch of built in libraries for machine learning stream processing, graph processing etc.
OOZIE:
tool to schedule workflows on all the hadoop ecosystem technologies.
FLUME/SQOOP:
tools to transfer data between other systems and hadoop.
This concludes a very high level overview of Hadoop.
Hadoop is not a stack like LAMP or MEAN stack. Hadoop is a collections of frameworks and tools that work togethor to solve complex big data problems.
It is basically a Project under Apache foundation. Various Subprojects like Mapreduce, Ambari, Sqoop, Spark, Zookeeper, etc togethor makes ,what is called a Hadoop Ecosystem.
Source : https://www.datacloudschool.com/2020/01/introduction-what-is-hadoop.html

data backup and recovery in hadoop 2.2.0

I am new to Hadoop and much interested in Hadoop Administration,so i tried to install Hadoop 2.2.0 in Ubuntu 12.04 as pseudo distributed mode and installed successfully and run some example jar files also ,now i am trying learn further ,trying to learn data back up and recovery part now,can anyone tell ways to take data back back up and recovery it in hadoop 2.2.0 ,and also please suggest any good books for Hadoop Adminstration and steps to learn Hadoop Adminstration.
Thanks in Advance.
There is no classic backup and recovery functionality in Hadoop. There are several reasons for this:
HDFS uses block level replication for data protection via redundancy.
HDFS scales out massively in size, and it is becoming more economic to backup to disk, rather than tape.
The size of "Big Data" doesn't lend itself to being easily backed up.
Instead of backups, Hadoop uses data replication. Internally, it creates multiple copies of each block of data (by default, 3 copies). It also has a function called 'distcp', which allows you to replicate copies of data between clusters. This is what's typically done for "backups" by most Hadoop operators.
Some companies, like Cloudera, are incorporating the distcp tool into creating a 'backup' or 'replication' service for their distribution of Hadoop. It operates against a specific directory in HDFS, and replicates it to another cluster.
If you really wanted to create a backup service for Hadoop, you can create one manually yourself. You would need some mechanism of accessing the data (NFS gateway, webFS, etc), and could then use tape libraries, VTLs, etc. to create backups.

Resources