HBase Stand alone mode Functions - hadoop

When i execute start-hbase.sh in Stand alone mode(not in Distributed or pseudo distributed mode) it won't execute starting region servers, zoo keeper ,master back up except starting master(since no HDFS file system it cannot run region servers). Will this mode effects in recovery part of the HBase. For example, crashing the VM while HBase data is still in memstore and restarting the VM and HBase.
I tried the above experiment HBase is not able to recover.what will be the reason?

No Difference whether it is Stand alone mode or Pseudo Distributed mode, but it will takes some time writing into WAL in Stand alone mode

Related

Is there a significant performance difference between Pseudo-Distributed and Fully-distributed mode in Hadoop?

I was reading the document of Hadoop, and I found this:
"Both standalone mode and pseudo-distributed mode are provided for the purposes of small-scale testing".
I have 2 questions.
First, how big is considered as small-scale, more specifically, I'm going to use at most 32 nodes, is this ok for me to run it in the pseudo-distributed mode?
Second, even for small-scale, is there any performance difference between Pseudo-Distributed and Fully-distributed mode? Since, I'm running hadoop on my Mac, and it's kind difficult for me to find a really cluster system. Anything that I have to pay attention?
at most 32 nodes, is this ok for me to run it in the pseudo-distributed mode?
Pseudo distributed specifically means you only have one node. It means all Hadoop services are capable of talking to each other as if they were on an external interface (not all localhost) connection, and using HDFS, not just the local filesystem.
In order to create a "distributed mode" cluster, you can add additional nodes to your single node by using the correct configurations. Tip: Apache Ambari would make this process much easier.
However, HDFS will want to be able to replicate blocks at least three times by default, and in order to accommodate for downtime in these services, 5 nodes is a good minimum. I also recommend that you setup High Availability in your cluster using a standalone installation of 3-5 Zookeeper servers

Pig script runs fine on Sandbox but fails on a real cluster

Environments:
Hortonworks Sandbox running HDP 2.5
Hortonworks HDP 2.5 Hadoop cluster managed by Ambari
We are facing a tricky situation. We run Pig script from Hadoop tutorial. Script is working with tiny data. It works fine on a Sandbox. But fails in real cluster where it complains about insufficient memory for the container.
container is running beyond physical memory limit
message can be seen in the logs.
The tricky part is - Sandbox has way less memory available than real cluster (about 3 times less). Also most memory settings in Sandbox (MapReduce memory, Yarn memory, Yarn container sizes) allow much less memory than corresponding settings in a real cluster. Still it is sufficient for Pig in Sandbox but not sufficient in a real cluster.
Another note - Hive queries doing the similar job also work good (in both environements), they do not complain about memory.
Apparently there is some setting somewhere (within Environment 2), which makes Pig to request too much memory? Can please anybody recommend what parameter should be modified to stop Pig script to request too big memory?

HBase standalone performance vs. running on an HDFS cluster

My Application is connected to an HBase and does a lot of communication (hundreds or thousands of reads/writes per second). This strongly affects performance, probably due to I/O operations HBase does on every request.
Doo.dle are calls to my code - the difference between blue and red is time consumed by HBase.
Currently, I've only tested in standalone mode, where HBase stores data using the local file system. I was wondering, whether using one in distributed mode with an actual HDFS could significantly improve performance, or just yield the same results. I'm trying to get a clue before losing too much time into getting a cluster up and running.
A second question I've asked myself is whether a standalone HBase could be configured to just persist data to memory (RAM) instead of writing it to the file system for performance measures.
In the standalone mode,HBase does not use HDFS and it runs all HBase daemons and a local ZooKeeper all up in the same JVM
In a Pseudo-distributed mode, Hbase can run against the local filesystem or it can run against an instance of the Hadoop Distributed File System. So there is no difference between standalone and pseudo-distributed considering the performance.
The Fully-distributed mode requires the use of HDFS which means that the tasks will run over jobs and that's take time according to my experience.
So using Hbase in fully-distributed mode with an actual HDFS could significantly improve performance.

Should the HBase region server and Hadoop data node on the same machine?

Sorry that I don't have the resource to set up a cluster to test it, I'm just wondering to know:
Can I deploy hbase region server on a separated machine other than the hadoop data node machine? I guess the answer is yes, but I'm not sure.
Is it good or bad to deploy hbase region server and hadoop data node on different machines?
When putting some data into hbase, where is this data eventually stored in, data node or region server? I guess it's data node, but what is the StoreFile and HFile in region server, isn't it the physical file to store our data?
Thank you!
RegionServers should always run alongside DataNodes in distributed clusters if you want decent performance.
Very bad, that will work against the data locality principle (If you want to know a little more about data locality check this: http://www.larsgeorge.com/2010/05/hbase-file-locality-in-hdfs.html)
Actual data will be stored in the HDFS (DataNode), RegionServers are responsible of serving and managing regions.
For more information about HBase architecture please check this excelent post from Lars' blog: http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html
BTW, as long as you have a PC with decent RAM you can set up a demo cluster with virtual machines. Do not ever try to set up a production environment without properly test the platform first in a development environment.
To go in more detail about this answer:
RegionServers should always run alongside? DataNodes in distributed clusters if you want decent performance."
I'm not sure how anyone would interpet the term alongside, so let's try to be even more precise:
What makes any physical server an "XYZ" server is that it's running a program called a daemon (think "eternally-running background event-handling" program);
What makes a "file" server is that it's running a file-serving daemon;
What makes a "web" server is that it's running a web-serving daemon;
AND
What makes a "data node" server is that it's running the HDFS data-serving daemon;
What makes a "region" server then is that it's running the HBase region-serving daemon (program);
So, in all Hadoop Distributions (eg Cloudera, MAPR, Hortonworks, others), the general best practice is that for HBase, the "RegionServers" are "co-located" with the "DataNodeServers".
This means that the actual slave (datanode) servers which form the HDFS cluster are each running the HDFS data-serving daemon (program)
and they're also running the HBase region-serving daemon (program) as well!
This way we ensure locality - the concurrent processing and storing of data on all the individual nodes in an HDFS cluster, with no "movement" of gigantic loads of big data from "storage" locations to "processing" locations. Locality is vital to the success of a Hadoop cluster, such that HBase region servers (data nodes running the HBase daemon as well) must also do all their processing (putting/getting/scanning) on each data node containing the HFiles which make up HRegions which make up HTables which make up HBases (Hadoop-dataBases) ... .
So, servers (VMs or physical on Windows, Linux, ..) can run multiple daemons concurrently, often, they run dozens of them regularly.

Difference between PIG local and mapreduce mode

What is the actual difference between running PIG scripts locally and on mapreduce?
I understand mapreduce mode is when you run it on a cluster that has hdfs installed. Does this mean local mode does not need HDFS and so even mapreduce jobs don't get triggered? What is the difference and when do you the other?
Local mode will build a simulated mapreduce job running off of a local file on disk. In theory equivalent to MapReduce, but it's not a "real" mr job. You shouldn't be able to tell the difference from a user perspective.
Local mode is great for development.
Local mode: All scripts are run on a single machine without requiring Hadoop MapReduce and HDFS. This can be useful for developing and testing Pig logic. If you’re using a small set of data to developer or test your code, then local mode could be faster than going through the MapReduce infrastructure.
Local mode doesn’t require Hadoop. When you run in Local mode, the Pig program runs in the context of a local Java Virtual Machine, and data access is via the local file system of a single machine. Local mode is actually a local simulation of MapReduce in Hadoop’s LocalJobRunner class.
MapReduce mode (also known as Hadoop mode): Pig is executed on the Hadoop cluster. In this case, the Pig Script gets converted into a series of MapReduce jobs that are then run on the Hadoop cluster.
If you have a terabyte of data that you want to perform operations on and you want to interactively develop a program, you may soon find things slowing down considerably, and you may start growing your storage. Local mode allows you to work with a subset of your data in a more interactive manner so that you can figure out the logic (and work out the bugs) of your Pig program.
After you have things set up as you want them and your operations are running smoothly, you can then run the script against the full data set using MapReduce mode.

Resources