Repartitioning in Hadoop Distributed File System ( HDFS ) - hadoop

Is there a way to repartition data directly in HDFS? If You notice that Your partitions are unbalanced (one or more is much bigger then other) how You can deal with it?
I know that it could be done ex in Apache Spark but running a job to just repartition seems like overhead- or maybe it is good idea?

Run hdfs balancer. This tool that distributes HDFS blocks evenly across datanodes.
https://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-hdfs/HDFSCommands.html#balancer
In case you are running a Cloudera Manager or Ambari managed distribution, you can run HDFS balancer from their web UI.

Related

Migrating Hadoop Clusters from Big Insights to Cloudera

What are the best approaches to migrate clusters of size 1 TB from Big Insights to Cloudera.
Cloudera being a kerborized cluster.
The current approach which we are following is through batches:
a. Take the cluster and move it to Unix filesystem
b. SCP to Cloudera filesystem
c. Dump from cloudera file system to cloudera HDFS
This is not an effective approach
Distcp does work with a kerberized cluster
However it's not clear if you actually have 333GB x3 replicas = 1TB or actually 1TB of raw data.
In either case, you're more than welcome to purchase an external 4TB (or more) drive and copyToLocal every file on your cluster, then upload it anywhere else.

Do I need to use Spark with YARN to achieve NODE LOCAL data locality with HDFS?

Do I need to use Spark with YARN to achieve NODE LOCAL data locality with HDFS?
If I use Spark standalone cluster manager and have my data distributed in HDFS cluster, how will Spark know that data is located locally on the nodes?
YARN is a resource manager. It deals with memory and processes, and not with the workings of HDFS or data-locality.
Since Spark can read from HDFS sources, and the namenodes & datanodes take care of all that HDFS block data management outside of YARN, then I believe the answer is no, you don't need YARN. But you already have HDFS, which means you have Hadoop, so why not take advantage of integrating Spark into YARN?
the stand alone mode has its own cluster manager/resource manager which will talk to name node for locality. client/driver will put tasks based on the results.

How to create HDFS for a text file in Apache Spark?

This is the first time I deal with big data and use cluster.
In order to distribute the bytes among the slaves nodes I read that it is easy to use the HDFS with Apache spark.
How to create HDFS?
You can use Apache Spark to process files even without HDFS if you just want to experiment and learn. Take a look at the Spark Quick Start.
If you do need an HDFS cluster, your best bet is installing one of the big Hadoop distributions. Cloudera, Hortonworks, MapR all provide easy and free installers and paid support services.

How many HBase servers should I have per Hadoop server?

I have a system which will feed smaller image files which are stored in an HBase table which uses hadoop for the file system.
I have 2 instances of hadoop currently and 1 instance of HBase, but my question is what should the ratio here be? SHould I have 1 hadoop per hbase server or does it really matter?
Answer is it depends.
It depends how much data you have, cpu utilization of regionserver and various other factors. You need to do some Proof of concepts to realise the sizing of your hadoop and hbase cluster. Variability of using hadoop and hbase depends on use-cases.
As a matter of fact, I have recently seen a setup where hadoop and hbase cluster totally decoupled. In the setup hbase cluster remotely uses hadoop to R/W on HDFS.

HBase and Hadoop

HBase requires Hadoop installation based on what I read so far. And it looks like HBase can be set up to use existing Hadoop cluster (which is shared with some other users) or it can be set up to use dedicated Hadoop cluster? I guess the latter would be a safer configuration but I am wondering if anybody has any experience on the former (but then I am not very sure my understanding of HBase setup is correct or not).
I know that Facebook and other large organizations separate their HBase cluster (real time access) from their Hadoop cluster (batch analytics) for performance reasons. Large MapReduce jobs on the cluster have the ability to impact performance of the real-time interface, which can be problematic.
In a smaller organization or in a situation in which your HBase response time doesn't necessarily need to be consistent, you can just use the same cluster.
There aren't many (or any) concerns with coexistence other than performance concerns.
We've set it up with an existing Hadoop cluster that's 1,000 cores strong. Short answer: it works just fine, at least with Cloudera CH2 +149.88. But by Hadoop version, your mileage may vary.
In a distributed mode Hadoop is used for its HDFS storage. HBase will store HFile on HDFS, and thus get benefits from replication strategies and data-locality principles brought by datanodes.
RegionServer are about to basically handle local data, but still might have to fetch data from other datanodes.
Hope that will help you to understand why and how hadoop is used with HBase.

Resources