Pivotal: HDFS-HAWQ - Migration to New Hardware - hadoop

We have version PHD3.0 hadoop cluster for 5 node using ambari installed on rackspace . We have to migrate into Google cloud (GCP).
But Not getting steps How to conduct following
Q1: How do i migrate data,metadata configuration from Old machine to New machine. ( Old Machine version is cantos 6.5 to target machine version centos 6.5)
Q2: What components and folders to backup? What would be the commands?
Q3: How to backup nameode and datanodes?
Q4: Do we need to take backup of ambari database also?
Any help on it would be much appreciated?

I would personally prefer to provision hadoop cluster in GCP and move data using distcp to new cluster.
For hawq managed tables move data to hdfs and then do distcp.
Some occasions on AWS i moved data to s3 and import back into hadoop.

Related

Migrating from one cluster to another

I have a source emr cluster with hive metastore as a external mysql ( emr version 3.1 ).
We are planning to upgrade a cluster to 5.11.1.
Does anyone know how to migrate hive and hdfs from one cluster to another with a remote mysql metastore?
None of your data should be on HDFS persistently. Copy any important files to S3.
HIve provides metastore upgrade scripts for all versions.
Use schematool command available under /usr/lib/hive/bin to perform an upgrade migration.
https://cwiki.apache.org/confluence/display/Hive/Hive+Schema+Tool

Apache Hive Installation on pseudo distributed or multi node cluster environment

I have installed hadoop on multi node environment in my PC as below
1: 4 virtual box instances loaded with ubuntu(14.04)
2: 1-master node , 2-slave node and remaining vm instance works as client
Note: All 4 VM'S are running in my PC itself
I was able to complete apace-2.6 hadoop setup successfully on the above mentioned setup .Now I want to install hive in order to do some data summarization, query, and analysis .
But I am not sure how I have to proceed further. I have few queries mentioned below :
Q1: Do I need to install/setup Apache Hive(0.14) on all nodes(master/name-node and slave/data-node)? or is it only on master node?
Q2: what is the mode should be used to deal with the meta-store is it local mode or remote mode ?
Q3: In case if I want to use mysql for hive meta-store,should I install it on master/name node itself or do I need to use separate client machine for this?
please can some one also share me if there are any steps to be followed to configure metastore? in multi node/pseudo distributed environment.
BR,
San
You need to install the required Hive services (HiveServer2, Metastore, WebHCat) only once. In your lab scenario, you would probably put them on the master. The client can then run Beeline (the HiveServer2 client.)
If you configure the Metastore as Local, Hive will use a local Derby database. Again, for your lab setup, this is probably just what you need/want.
In a production scenario, you would
set up a dedicated server for supporting services that should not fight for resources with the namenode process(es)
and use a dedicated database server for your Metastore database, which will be remote.

AWS EMR Hadoop Administration

We are currently using Apache Hadoop (Vanilla Version) in our org. We are planning to migrate to AWS EMR. I'm trying to understand how AWS EMR Hadoop works internally (not how to use it), I'm mainly interested in Hadoop administration steps and how master and slave communicates and various configuration configurations. I already checked the AWS EMR documentation but I don't see detailed comparison.
Can someone recommend me a link/tutorial for migrating to AWS EMR from an Apache Hadoop.
During EMR cluster creation, it will ask you to specify Master and Node. a default settings will provision 1 master and two nodes for you. You can also specify what all applications you want to be in the cluster (e.g.: hadoop, hive, spark, zeppelin, hue, etc.).
Once the cluster is created, it will provision all the services. you can click on these services and access them via web, or using ssh into the master. For e.g: to access the ambari interface, go to the service within EMR and click it. a new window will be launched with the ambari monitoring service interface.
Installing these applications is very easy. all you have to do is specify all the services while cluster creation.
Amazon Elastic MapReduce uses a mostly standard implementation of Hadoop and associated tools.
See: AMI Versions Supported in Amazon EMR
The benefits of using EMR are in the automated deployment of instances. For example, launching a cluster with an appropriate AMI means that software is already loaded on each instance and HDFS is configured across the core nodes.
The Master and Slave (Core/Task) nodes communicate in exactly the normal way that they communicate in any Hadoop cluster. However, only one Master is supported (with no backup Master).
When migrating to EMR, check that you are using compatible versions of software (eg Hadoop, Hive, Pig, Impala, etc). Also consider using Amazon S3 for storage of data instead of HDFS, especially for storing source data, since data on S3 persists even after the EMR cluster is terminated.
Technically, Hadoop provided with EMR, can be few releases back. You should check EMR release notes for detailed application provided with each version. EMR takes care application provisioning, setup and configuration. Based on EC2 instance type, Hadoop (and other application configuration) will change. You can override default settings using configure application.
Other than this Hadoop you have on premises and EMR should be the same.

Running mahout using hadoop on Amazon's EMR/EC2

I want to migrate my current local hadoop cluster into amazon . In this hadoop cluster I am using services like mahout,hbase and hive . I have two option now in amazon either go for pure EC2 instances or Elastic map reduce cluster . I want some suggestion on what is better option to move the cluster which has these kinds of requirement .
I always suggest people to go for EMR, as that is managed and will be a bit more costlier than using pure ec2, but the headache and time you will spent in configuring the clusters and then managing them can be saved by running managed services like EMR.
Mahout can easily be run like a custom jar.
Hive cluster can also be launched within minutes.
Similary for HBase, Amazon has recently added creating HBase cluster over EMR.
See other views here.

Moving files to Hadoop HDFS using SFTP

I've a VPC subnet which has multiple machines inside it.
On of the machine, I've some files stored. On another machine, I've hadoop HDFS service installed and running.
I need to move those files from first machine to HDFS file system using SFTP.
Do Hadoop has some API's that can achieve this goal ?
PS : I've installed Hadoop using Cloudera CDH4 distribution.
This is a requirement which is much easier to implement on ftp/sftp server side than HDFS.
check out a ftp server works on top of HDFS hdfs-over-ftp
A workflow written in Apache Oozie would do it. It comes with the Cloudera distribution. Other tools for orchestration could be Talend or PDI Kettle.

Resources