Migrating Hadoop Clusters from Big Insights to Cloudera - hadoop

What are the best approaches to migrate clusters of size 1 TB from Big Insights to Cloudera.
Cloudera being a kerborized cluster.
The current approach which we are following is through batches:
a. Take the cluster and move it to Unix filesystem
b. SCP to Cloudera filesystem
c. Dump from cloudera file system to cloudera HDFS
This is not an effective approach

Distcp does work with a kerberized cluster
However it's not clear if you actually have 333GB x3 replicas = 1TB or actually 1TB of raw data.
In either case, you're more than welcome to purchase an external 4TB (or more) drive and copyToLocal every file on your cluster, then upload it anywhere else.

Related

Repartitioning in Hadoop Distributed File System ( HDFS )

Is there a way to repartition data directly in HDFS? If You notice that Your partitions are unbalanced (one or more is much bigger then other) how You can deal with it?
I know that it could be done ex in Apache Spark but running a job to just repartition seems like overhead- or maybe it is good idea?
Run hdfs balancer. This tool that distributes HDFS blocks evenly across datanodes.
https://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-hdfs/HDFSCommands.html#balancer
In case you are running a Cloudera Manager or Ambari managed distribution, you can run HDFS balancer from their web UI.

How to allocate memory to datanode in hadoop configuration

we have a below requirement.
We have a totally 5 servers which will be utilized for building Bigdata Hadoop data warehouse system (we are not going to use any distribution like cloudera, hortonworks...etc).
All servers configurations are 512GB RAM, 30TB storage and 16 cores, Ubuntu Linux 14.04LTS server
We would install hadoop on all the servers. Server3,4,5 will be completely using them for datanode (slave machines) whereas server1 would have Active Namenode and Datanode. Server2 would have standby Namenode and datanode.
we want to configure 300GB RAM for Namenode and 212GB RAM for datanode while configuring hadoop.
Could anyone help me how to do that. which configuration file in hadoop needs to be changed. what are the parameter we need to configure in hadoop configuration files.
Thanks and Regards,
Suresh Pitchaipillai
You can cset these properties from cloudera manager (in case you are using CDH) or from Ambari (if you use Hortonworks).
Also you do not need 300GB for Namenode as namenode only stores metadat. Roughly speaking 1GB of namenode heap can store metadata of 1milion blocks (block size = 128MB).
More details here : https://issues.apache.org/jira/browse/HADOOP-1687
Assuming that you are going to use latest hadoop distribution with Yarn.
Read this article - Reference. It has explained every parameter in details and it is awesome in explanation.
There is one more article from Hortenworks, though it is applicable to all apache based hadoop distribution.
At last keep this handly - Yarn-configuration. It is self explanatory.

Hadoop backup and recovery tool and guidance

I am new to hadoop need to learn details about backup and recovery. I have revised oracle backup and recovery will it help in hadoop?From where should I start
There are a few options for backup and recovery. As s.singh points out, data replication is not DR.
HDFS supports snapshotting. This can be used to prevent user errors, recover files, etc. That being said, this isn't DR in the event of a total failure of the Hadoop cluster. (http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsSnapshots.html)
Your best bet is keeping off-site backups. This can be to another Hadoop cluster, S3, etc and can be performed using distcp. (http://hadoop.apache.org/docs/stable1/distcp2.html), (https://wiki.apache.org/hadoop/AmazonS3)
Here is a Slideshare by Cloudera discussing DR (http://www.slideshare.net/cloudera/hadoop-backup-and-disaster-recovery)
Hadoop is designed to work on the big cluster with 1000's of nodes. Data loss is possibly less. You can increase the replication factor to replicate the data into many nodes across the cluster.
Refer Data Replication
For Namenode log backup, Either you can use the secondary namenode or Hadoop High Availability
Secondary Namenode
Secondary namenode will take backup for the namnode logs. If namenode fails then you can recover the namenode logs (which holds the data block information) from the secondary namenode.
High Availability
High Availability is a new feature to run more than one namenode in the cluster. One namenode will be active and the other one will be in standby. Log saves in both namenode. If one namenode fails then the other one becomes active and it will handle the operation.
But also we need to consider for Backup and Disaster Recovery in most cases. Refer #brandon.bell answer.
You can use the HDFS sync application on DataTorrent for DR use cases to backup high volumes of data from one HDFS cluster to another.
https://www.datatorrent.com/apphub/hdfs-sync/
It uses Apache Apex as a processing engine.
Start with official documentation website : HdfsUserGuide
Have a look at below SE posts:
Hadoop 2.0 data write operation acknowledgement
Hadoop: HDFS File Writes & Reads
Hadoop 2.0 Name Node, Secondary Node and Checkpoint node for High Availability
How does Hadoop Namenode failover process works?
Documentation page regarding Recovery_Mode:
Typically, you will configure multiple metadata storage locations. Then, if one storage location is corrupt, you can read the metadata from one of the other storage locations.
However, what can you do if the only storage locations available are corrupt? In this case, there is a special NameNode startup mode called Recovery mode that may allow you to recover most of your data.
You can start the NameNode in recovery mode like so: namenode -recover

How many HBase servers should I have per Hadoop server?

I have a system which will feed smaller image files which are stored in an HBase table which uses hadoop for the file system.
I have 2 instances of hadoop currently and 1 instance of HBase, but my question is what should the ratio here be? SHould I have 1 hadoop per hbase server or does it really matter?
Answer is it depends.
It depends how much data you have, cpu utilization of regionserver and various other factors. You need to do some Proof of concepts to realise the sizing of your hadoop and hbase cluster. Variability of using hadoop and hbase depends on use-cases.
As a matter of fact, I have recently seen a setup where hadoop and hbase cluster totally decoupled. In the setup hbase cluster remotely uses hadoop to R/W on HDFS.

Writing to local file during map phase in hadoop

Hadoop writes the intermediate results to the local disk and the results of the reducer to the HDFS. what does HDFS mean. What does it physically translate to
HDFS is the Hadoop Distributed File System. Physically, it is a program running on each node of the cluster that provides a file system interface very similar to that of a local file system. However, data written to HDFS is not just stored on the local disk but rather is distributed on disks across the cluster. Data stored in HDFS is typically also replicated, so the same block of data may appear on multiple nodes in the cluster. This provides reliable access so that one node's crashing or being busy will not prevent someone from being able to read any particular block of data from HDFS.
Check out http://en.wikipedia.org/wiki/Hadoop_Distributed_File_System#Hadoop_Distributed_File_System for more information.
As Chase indicated, HDFS is Hadoop Distributed File System.
If I may, I recommend this tutorial and video of how HDFS and the Map/Reduce framework works and will serve you as a guide into the world of Hadoop: http://www.cloudera.com/resource/introduction-to-apache-mapreduce-and-hdfs/

Resources