Falcon's role in Hadoop ecosystem - hadoop

I am supposed to work on cluster mirroring where I have to set up the similar HDFS cluster (same master and slaves) as a existing one and copy the data to the new and then run the same jobs as is.
I have read about falcon as a feed processing and a work flow coordinating tool and it is used for mirroring of HDFS clusters as well. Can someone enlighten me on what is Falcon's role in Hadoop ecosystem and how does it help in mirroring in particular. I am looking here to understand what all facon offers when it is part of my Hadoop eco-system (HDP).

Apache Falcon simplifies the configuration of data motion with: replication; lifecycle management; lineage and traceability. This provides data governance consistency across Hadoop components.
Falcon replication is asynchronous with delta changes. Recovery is done by running a process and swapping the source and target.
Data loss – Delta data may be lost if the primary cluster is completely shut down
Backup can be scheduled when needed depending on the bandwidth and network availability.

Related

Incremental copy between Hadoop Clusters (using Spark) given that FALCON deprecated

Given that FALCON is deprecated and distcp cannot do incremental copies of data that are not in separate ' nice slices', how would one copy in a sqoop-like way between two Hadoop clusters?
Can Spark attach to 2 clusters simultaneously? Never had the need to try this.
I can see many issues listed with cross realm HDFS aspects.
In summary:
You can use Cloudera Replication Manager for HDFS:
for HDP, CDP and CDH environments. This is file based replication options.
for CDH you can do this for a given (set of) table(s).
For KUDU there is no such option.
You can use a Spark App with two clusters by giving the full hdfs://namenode:port/path address and "roll your own" logic. This applies o both HDFS/Hive, and KUDU Storage Manager in fact.

Replication vs Snapshot in HBase

We have two systems- One Offline system(Performance is not critical here), where the MapReduce jobs run on the HBase Cluster. The Other is the Online System(Performace is very critical here), where the API reads from the same HBase Cluster. But As the MapReduce jobs running on the same cluster, there are performance issues on the online system. So we are trying to set up separate HBase cluster for Offline system which is the replication of few family names from the Source cluster.
So on the source heavy MapReduce job runs. On the replicated cluster only online system runs giving the best performance.
My Question here is :: Cant we use Snap shot feature in HBase for doing the Same? I also wanted to know what is the difference between them?
If you use snapshot feature for mapreduce, it will also spend cpu, memory and disk io on live hbase cluster nodes too. So if disk io or cpu is the bottleneck for you, a seperate cluster for mapreduce jobs is better solution.

Differences between existing MapReduce and YARN (MRv2)

Would anyone tell me, which are the differences between existing MapReduce and YARN, because I do not find all clearly differences between these two?
P.S: I'm asking for something like a comparison between these.
Thanks!
MRv1 uses the JobTracker to create and assign tasks to data nodes, which can become a resource bottleneck when the cluster scales out far enough (usually around 4,000 nodes).
MRv2 (aka YARN, "Yet Another Resource Negotiator") has a Resource Manager for each cluster, and each data node runs a Node Manager. For each job, one slave node will act as the Application Master, monitoring resources/tasks, etc.
MRv1 which is also called as Hadoop 1 where the HDFS (Resource management and scheduling) and MapReduce(Programming Framework) are tightly coupled.
Because of this non-batch applications can not be run on the hadoop 1.
It has single namenode so, it doesn't provides high system availability and scalability.
MRv2 (aka Hadoop 2) in this version of hadoop the resource management and scheduling tasks are separated from MapReduce which is separated by YARN(Yet Another Resource Negotiator).
The resource management and scheduling layer lies beneath the MapReduce layer.
It also provides high system availability and scalability as we can create redundant NameNodes.
The new feature of snapshot through which we can take backup of filesystems which helps disaster recovery.

Hadoop Disaster Recovery

Anyone could help me to know about Hadoop Disaster Recovery ?
should i replicate data from cluster to another cluster as backup use distcp?
Or i can use copyToLocal to copy my data to my localmachine ?
Anyone idea about it ?
DRP plan goes beyond just the technology and the requirements can greatly affect the solution.
for instance if you can't afford to lose any data you'd want an active/active setup and send data to two hadoop clusters simultaneously. on the other side of the spectrum hadoop's replication (default is 3 copies but you can change that) and rack awareness can give you a copy on a secondary rack. In between you can use things like distcp that you mention to copy data from cluster to cluster.
Additionally you might want to follow project falcon which is a new initiative for hadoop data life-cycle management

HBase and Hadoop

HBase requires Hadoop installation based on what I read so far. And it looks like HBase can be set up to use existing Hadoop cluster (which is shared with some other users) or it can be set up to use dedicated Hadoop cluster? I guess the latter would be a safer configuration but I am wondering if anybody has any experience on the former (but then I am not very sure my understanding of HBase setup is correct or not).
I know that Facebook and other large organizations separate their HBase cluster (real time access) from their Hadoop cluster (batch analytics) for performance reasons. Large MapReduce jobs on the cluster have the ability to impact performance of the real-time interface, which can be problematic.
In a smaller organization or in a situation in which your HBase response time doesn't necessarily need to be consistent, you can just use the same cluster.
There aren't many (or any) concerns with coexistence other than performance concerns.
We've set it up with an existing Hadoop cluster that's 1,000 cores strong. Short answer: it works just fine, at least with Cloudera CH2 +149.88. But by Hadoop version, your mileage may vary.
In a distributed mode Hadoop is used for its HDFS storage. HBase will store HFile on HDFS, and thus get benefits from replication strategies and data-locality principles brought by datanodes.
RegionServer are about to basically handle local data, but still might have to fetch data from other datanodes.
Hope that will help you to understand why and how hadoop is used with HBase.

Resources