HBase standalone performance vs. running on an HDFS cluster - hadoop

My Application is connected to an HBase and does a lot of communication (hundreds or thousands of reads/writes per second). This strongly affects performance, probably due to I/O operations HBase does on every request.
Doo.dle are calls to my code - the difference between blue and red is time consumed by HBase.
Currently, I've only tested in standalone mode, where HBase stores data using the local file system. I was wondering, whether using one in distributed mode with an actual HDFS could significantly improve performance, or just yield the same results. I'm trying to get a clue before losing too much time into getting a cluster up and running.
A second question I've asked myself is whether a standalone HBase could be configured to just persist data to memory (RAM) instead of writing it to the file system for performance measures.

In the standalone mode,HBase does not use HDFS and it runs all HBase daemons and a local ZooKeeper all up in the same JVM
In a Pseudo-distributed mode, Hbase can run against the local filesystem or it can run against an instance of the Hadoop Distributed File System. So there is no difference between standalone and pseudo-distributed considering the performance.
The Fully-distributed mode requires the use of HDFS which means that the tasks will run over jobs and that's take time according to my experience.
So using Hbase in fully-distributed mode with an actual HDFS could significantly improve performance.

Related

Is there a significant performance difference between Pseudo-Distributed and Fully-distributed mode in Hadoop?

I was reading the document of Hadoop, and I found this:
"Both standalone mode and pseudo-distributed mode are provided for the purposes of small-scale testing".
I have 2 questions.
First, how big is considered as small-scale, more specifically, I'm going to use at most 32 nodes, is this ok for me to run it in the pseudo-distributed mode?
Second, even for small-scale, is there any performance difference between Pseudo-Distributed and Fully-distributed mode? Since, I'm running hadoop on my Mac, and it's kind difficult for me to find a really cluster system. Anything that I have to pay attention?
at most 32 nodes, is this ok for me to run it in the pseudo-distributed mode?
Pseudo distributed specifically means you only have one node. It means all Hadoop services are capable of talking to each other as if they were on an external interface (not all localhost) connection, and using HDFS, not just the local filesystem.
In order to create a "distributed mode" cluster, you can add additional nodes to your single node by using the correct configurations. Tip: Apache Ambari would make this process much easier.
However, HDFS will want to be able to replicate blocks at least three times by default, and in order to accommodate for downtime in these services, 5 nodes is a good minimum. I also recommend that you setup High Availability in your cluster using a standalone installation of 3-5 Zookeeper servers

Pig script runs fine on Sandbox but fails on a real cluster

Environments:
Hortonworks Sandbox running HDP 2.5
Hortonworks HDP 2.5 Hadoop cluster managed by Ambari
We are facing a tricky situation. We run Pig script from Hadoop tutorial. Script is working with tiny data. It works fine on a Sandbox. But fails in real cluster where it complains about insufficient memory for the container.
container is running beyond physical memory limit
message can be seen in the logs.
The tricky part is - Sandbox has way less memory available than real cluster (about 3 times less). Also most memory settings in Sandbox (MapReduce memory, Yarn memory, Yarn container sizes) allow much less memory than corresponding settings in a real cluster. Still it is sufficient for Pig in Sandbox but not sufficient in a real cluster.
Another note - Hive queries doing the similar job also work good (in both environements), they do not complain about memory.
Apparently there is some setting somewhere (within Environment 2), which makes Pig to request too much memory? Can please anybody recommend what parameter should be modified to stop Pig script to request too big memory?

Replication vs Snapshot in HBase

We have two systems- One Offline system(Performance is not critical here), where the MapReduce jobs run on the HBase Cluster. The Other is the Online System(Performace is very critical here), where the API reads from the same HBase Cluster. But As the MapReduce jobs running on the same cluster, there are performance issues on the online system. So we are trying to set up separate HBase cluster for Offline system which is the replication of few family names from the Source cluster.
So on the source heavy MapReduce job runs. On the replicated cluster only online system runs giving the best performance.
My Question here is :: Cant we use Snap shot feature in HBase for doing the Same? I also wanted to know what is the difference between them?
If you use snapshot feature for mapreduce, it will also spend cpu, memory and disk io on live hbase cluster nodes too. So if disk io or cpu is the bottleneck for you, a seperate cluster for mapreduce jobs is better solution.

HBase Stand alone mode Functions

When i execute start-hbase.sh in Stand alone mode(not in Distributed or pseudo distributed mode) it won't execute starting region servers, zoo keeper ,master back up except starting master(since no HDFS file system it cannot run region servers). Will this mode effects in recovery part of the HBase. For example, crashing the VM while HBase data is still in memstore and restarting the VM and HBase.
I tried the above experiment HBase is not able to recover.what will be the reason?
No Difference whether it is Stand alone mode or Pseudo Distributed mode, but it will takes some time writing into WAL in Stand alone mode

Difference between PIG local and mapreduce mode

What is the actual difference between running PIG scripts locally and on mapreduce?
I understand mapreduce mode is when you run it on a cluster that has hdfs installed. Does this mean local mode does not need HDFS and so even mapreduce jobs don't get triggered? What is the difference and when do you the other?
Local mode will build a simulated mapreduce job running off of a local file on disk. In theory equivalent to MapReduce, but it's not a "real" mr job. You shouldn't be able to tell the difference from a user perspective.
Local mode is great for development.
Local mode: All scripts are run on a single machine without requiring Hadoop MapReduce and HDFS. This can be useful for developing and testing Pig logic. If you’re using a small set of data to developer or test your code, then local mode could be faster than going through the MapReduce infrastructure.
Local mode doesn’t require Hadoop. When you run in Local mode, the Pig program runs in the context of a local Java Virtual Machine, and data access is via the local file system of a single machine. Local mode is actually a local simulation of MapReduce in Hadoop’s LocalJobRunner class.
MapReduce mode (also known as Hadoop mode): Pig is executed on the Hadoop cluster. In this case, the Pig Script gets converted into a series of MapReduce jobs that are then run on the Hadoop cluster.
If you have a terabyte of data that you want to perform operations on and you want to interactively develop a program, you may soon find things slowing down considerably, and you may start growing your storage. Local mode allows you to work with a subset of your data in a more interactive manner so that you can figure out the logic (and work out the bugs) of your Pig program.
After you have things set up as you want them and your operations are running smoothly, you can then run the script against the full data set using MapReduce mode.

Resources