Replication vs Snapshot in HBase - hadoop

We have two systems- One Offline system(Performance is not critical here), where the MapReduce jobs run on the HBase Cluster. The Other is the Online System(Performace is very critical here), where the API reads from the same HBase Cluster. But As the MapReduce jobs running on the same cluster, there are performance issues on the online system. So we are trying to set up separate HBase cluster for Offline system which is the replication of few family names from the Source cluster.
So on the source heavy MapReduce job runs. On the replicated cluster only online system runs giving the best performance.
My Question here is :: Cant we use Snap shot feature in HBase for doing the Same? I also wanted to know what is the difference between them?

If you use snapshot feature for mapreduce, it will also spend cpu, memory and disk io on live hbase cluster nodes too. So if disk io or cpu is the bottleneck for you, a seperate cluster for mapreduce jobs is better solution.

Related

HBase standalone performance vs. running on an HDFS cluster

My Application is connected to an HBase and does a lot of communication (hundreds or thousands of reads/writes per second). This strongly affects performance, probably due to I/O operations HBase does on every request.
Doo.dle are calls to my code - the difference between blue and red is time consumed by HBase.
Currently, I've only tested in standalone mode, where HBase stores data using the local file system. I was wondering, whether using one in distributed mode with an actual HDFS could significantly improve performance, or just yield the same results. I'm trying to get a clue before losing too much time into getting a cluster up and running.
A second question I've asked myself is whether a standalone HBase could be configured to just persist data to memory (RAM) instead of writing it to the file system for performance measures.
In the standalone mode,HBase does not use HDFS and it runs all HBase daemons and a local ZooKeeper all up in the same JVM
In a Pseudo-distributed mode, Hbase can run against the local filesystem or it can run against an instance of the Hadoop Distributed File System. So there is no difference between standalone and pseudo-distributed considering the performance.
The Fully-distributed mode requires the use of HDFS which means that the tasks will run over jobs and that's take time according to my experience.
So using Hbase in fully-distributed mode with an actual HDFS could significantly improve performance.

Falcon's role in Hadoop ecosystem

I am supposed to work on cluster mirroring where I have to set up the similar HDFS cluster (same master and slaves) as a existing one and copy the data to the new and then run the same jobs as is.
I have read about falcon as a feed processing and a work flow coordinating tool and it is used for mirroring of HDFS clusters as well. Can someone enlighten me on what is Falcon's role in Hadoop ecosystem and how does it help in mirroring in particular. I am looking here to understand what all facon offers when it is part of my Hadoop eco-system (HDP).
Apache Falcon simplifies the configuration of data motion with: replication; lifecycle management; lineage and traceability. This provides data governance consistency across Hadoop components.
Falcon replication is asynchronous with delta changes. Recovery is done by running a process and swapping the source and target.
Data loss – Delta data may be lost if the primary cluster is completely shut down
Backup can be scheduled when needed depending on the bandwidth and network availability.

Differences between existing MapReduce and YARN (MRv2)

Would anyone tell me, which are the differences between existing MapReduce and YARN, because I do not find all clearly differences between these two?
P.S: I'm asking for something like a comparison between these.
Thanks!
MRv1 uses the JobTracker to create and assign tasks to data nodes, which can become a resource bottleneck when the cluster scales out far enough (usually around 4,000 nodes).
MRv2 (aka YARN, "Yet Another Resource Negotiator") has a Resource Manager for each cluster, and each data node runs a Node Manager. For each job, one slave node will act as the Application Master, monitoring resources/tasks, etc.
MRv1 which is also called as Hadoop 1 where the HDFS (Resource management and scheduling) and MapReduce(Programming Framework) are tightly coupled.
Because of this non-batch applications can not be run on the hadoop 1.
It has single namenode so, it doesn't provides high system availability and scalability.
MRv2 (aka Hadoop 2) in this version of hadoop the resource management and scheduling tasks are separated from MapReduce which is separated by YARN(Yet Another Resource Negotiator).
The resource management and scheduling layer lies beneath the MapReduce layer.
It also provides high system availability and scalability as we can create redundant NameNodes.
The new feature of snapshot through which we can take backup of filesystems which helps disaster recovery.

How many HBase servers should I have per Hadoop server?

I have a system which will feed smaller image files which are stored in an HBase table which uses hadoop for the file system.
I have 2 instances of hadoop currently and 1 instance of HBase, but my question is what should the ratio here be? SHould I have 1 hadoop per hbase server or does it really matter?
Answer is it depends.
It depends how much data you have, cpu utilization of regionserver and various other factors. You need to do some Proof of concepts to realise the sizing of your hadoop and hbase cluster. Variability of using hadoop and hbase depends on use-cases.
As a matter of fact, I have recently seen a setup where hadoop and hbase cluster totally decoupled. In the setup hbase cluster remotely uses hadoop to R/W on HDFS.

HBase and Hadoop

HBase requires Hadoop installation based on what I read so far. And it looks like HBase can be set up to use existing Hadoop cluster (which is shared with some other users) or it can be set up to use dedicated Hadoop cluster? I guess the latter would be a safer configuration but I am wondering if anybody has any experience on the former (but then I am not very sure my understanding of HBase setup is correct or not).
I know that Facebook and other large organizations separate their HBase cluster (real time access) from their Hadoop cluster (batch analytics) for performance reasons. Large MapReduce jobs on the cluster have the ability to impact performance of the real-time interface, which can be problematic.
In a smaller organization or in a situation in which your HBase response time doesn't necessarily need to be consistent, you can just use the same cluster.
There aren't many (or any) concerns with coexistence other than performance concerns.
We've set it up with an existing Hadoop cluster that's 1,000 cores strong. Short answer: it works just fine, at least with Cloudera CH2 +149.88. But by Hadoop version, your mileage may vary.
In a distributed mode Hadoop is used for its HDFS storage. HBase will store HFile on HDFS, and thus get benefits from replication strategies and data-locality principles brought by datanodes.
RegionServer are about to basically handle local data, but still might have to fetch data from other datanodes.
Hope that will help you to understand why and how hadoop is used with HBase.

Resources