Transferring whole HDFS from one Cluster to another - hadoop

I have lots of hive Tables stored in my HDFS on a Test Cluster with 5 nodes. The Data should be around 70 Gb * 3 (Replipication). No i want to transfer the whole setup to a different environment with much more nodes. A Network Connection between the two Clusters is not possible.
The thing is that i dont have much time with the new Cluster and also no possibilities to Test the Transfering with an other Test environment. Therefore i need a solid plan. :)
What options do i have?
How can i transfer the hive setup with a minimum of configuration effort on the new cluster?
Is it possible to just copy the hdfs directorys of the 5 Nodes to 5 Nodes of the new Cluster, then add the rest of the nodes to the new cluster and start the balancer?

Without a network connection, it will be tricky!
I would
Copy the files out of HDFS onto some kind of removable storage (USB stick, external HDD, etc.)
Move the storage to the new cluster
Copy the files back into HDFS
Note that this won't preserve metadata like file creation/last access time, and, more importantly, ownership and permissions.
Small-scale testing of this process should be pretty simple.
If you can get (even temporarily) network connectivity between the two clusters, then distcp would be the way to go. It uses map reduce to parallelise the transfers, potentially resulting in massive time savings.

You can copy directories and files from one cluster to another using hadoop distcp command
Here is a small examples that describes its usage
http://souravgulati.webs.com/apps/forums/topics/show/8534378-hadoop-copy-files-from-one-hadoop-cluster-to-other-hadoop-cluster

you can copy data by using this command :
sudo -u hdfs hadoop --config {PathtotheVpcCluster}/vpcCluster distcp hdfs://SourceIP:8020/user/hdfs/WholeData hdfs://DestinationIP:8020/user/hdfs/WholeData

Related

Alluxio with/without HDFS

I have a cluster with HDFS as an under storage distributed file system, but I've just read about alluxio that is fast and flexible. So, My question is: Should I use Alluxio with HDFS or Alluxio is alternative for HDFS? (I see in their site that shared storage for under storage file system can be network file system (NFS). So, I think HDFS is not required. Correct me if I make a mistake).
In which mode performance is better: HDFS with Alluxio or Alluxio stanalone (what I mean the term standalone is to be used alone in the cluster and not locally).
Reply from Alluxio maintainer.
First of all, Alluxio is not a replacement for HDFS. Instead, it is a new abstraction layer on top of other distributed/cloud storage systems including HDFS, S3, Azure Object Store and other possible choices. In your case, if you data is already in HDFS, you will perhaps still keep HDFS as the persistent data layer for Alluxio.
The typical scenarios users put Alluxio in the picture and see significant benefits include:
Your physical data is not located with your compute. E.g., your bigdata engine is reading data from S3 or other object storage. In this case, by deploying Alluxio with compute nodes, one can make Alluxio work as a filesystem level cache to avoid fetching data across network repeatedly. See http://www.alluxio.org/overview/remote-data-acceleration
You are managing multiple storages and want to expose a single data access layer to simplify the management. E.g., one can "mount" multiple S3/ buckets into one Alluxio deployment so they appear as different directories under the same namespace. See http://www.alluxio.org/overview/storage-unification
Regarding your original performance question. The answer is, it depends. If your HDFS is remote from compute, you would expect a good performance gain. I also saw cases when HDFS is bottlenecked, Alluxio may also help to reduce the load and provides good SLA for certain mission-critical jobs.

Replicating data between multiple Hadoop clusters residing in different data centers

I was wondering what would be the best way to replicate the data present in a Hadoop cluster H1 in data center DC1 to another Hadoop cluster H2 in data center DC2 (warm backup preferably). I know that Hadoop does data replication and the number of copies of the data created is decided by the replication factor set in hdfs-site.xml. I have a few questions related to this
Would it make sense to have the data nodes of one cluster be spread across both data centers so that the data nodes for H1 would be present in both DC1 and DC2. If this makes sense and is viable, then does it mean we do not need H2?
Would it make sense to have the namenodes and datanodes distributed across both data centers rather than having only the datanodes distributed across both data centers?
I have also heard people use distcp and many tools build on top of distcp. But distcp does lazy backups and would prefer warm backups over cold ones.
Some people suggest using Kafka for this but I am not sure how to go about using it.
Any help would be appreciated. Thanks.
It depends on what you are trying to protect against. If you want to protect against site failure, distcp seems to be the only option for cross datacenter replication. However, as you pointed out, distcp has limitations. You can use snapshots to protect against user mistakes or application corruptions because replication or multiple replicas will not protect against that. Other commercial tools are available for automating the backup process as well if you don't want to write code and maintain it.

Need of maintaining replication factor on datanodes

Please pardon if this question has come up earlier as I'm not able to find any related question for this.
1) I want to know the reason why it is important to maintain the same replication factor(or for that matter any configuration) across the datanodes and namenodes in the cluster?
2) When we upload any file to HDFS, isn't it the namenode which manages the storage?
3) Wouldn't maintaining the configuration only on the namenodes suffice?
4) What are the implications of having the configuration different across namenode and datanodes?
Any Help is much appreciated. Thank you! :)
I will try to answer your question taking replication as an example.
Few things to keep in mind -
Data always resides on datanodes, Namenode never deals with data or store data, it only keeps metadata about the data.
Replication factor is configurable, you can change it for every file copy, for example file1 may have replication factor of 2 while file2 may have replication factor of say 3, in a similar way some other properties can also be configured at the time of execution.
2) When we upload any file to HDFS, isn't it the namenode which manages the storage?
I am not sure about what you exactly mean by namenode managing the storage, here is how a file upload to hdfs gets executed -
1) Client sends a request to Namenode for file upload to hdfs
2) Namenode based on the configuration(if not explicitly specified by the client application) calculates the number of blocks data will be broken into.
3) Namenode also decides which Datanodes will store the blocks, based on the replication factor specified in configuration(if not explicitly specified by the client application)
4) Namenode sends information calculated in step #2 and #3 to the client
5) Client application will break the file into blocks and write each block to 'a' Datanode say DN1.
6) Now DN1 will be responsible to replicate the received blocks to other Datanodes as chosen by the Namenode in #3; It will initiate replication when Namenode instructs it.
For you questions #3 and #4, it is important to understand that any distributed application will require a set of configurations available with each node to be able to interact with each other and also perform designated task as per expectation. In case every node chooses to have its own configuration what would be the basis of co-ordination? DN1 has replication factor of 5, while DN2 has of 2 how would data be actually replicated?
Update start
hdfs-site.xml contains lots of other config specifications as well for namenode, datanode and secondary namenode, some client and hdfs specific settings and not just the replication factor.
Now imagine having a 50 node cluster, would you like to go and configure on each node or simply copy a pre-configured file?
Update end
If you keep all configurations at one location, each node will need to connect to that shared resource to load configuration every time it has to perform an action, this would add to latency apart from consistency/synchronization issues in case any config property is changed.
Hope this helps.
Hadoop is designed to deal with large datasets. It's not a good idea to store a large dataset on a single machine because if your storage system or hard disk crashes, you may lose all of your data.
Before Hadoop, people were using a traditional system to store large amounts of data, but the traditional system was very costly. There were also challenges while analyzing large datasets from the traditional system as it was time consuming process to read data from the traditional system. With these things in mind, the Hadoop Framework was designed.
In the hadoop framework, when you load large amounts of data, it splits the data into small chunks, known as blocks. These blocks are basically used to place the data into a datanode in a distributed cluster, and also they also are used during the analysis of the data.
The region behind the splitting of the data is parallel processing and distributed storage (i.e.: you can store your data onto multiple machines, and when you want to analyze it you can do it via parallel analysis).
Now Coming to your questions:
Reason: Hadoop is a framework which allows distributed storage and computing. In other words, this means you can store the data onto multiple machines. It has functionality of replication that means you are keeping multiple copy (based on the replication factor) of the same data.
Ans1: Hadoop is designed to run on the commodity hardware and failures are common on commodity hardware so suppose if you store the data on a single machine and when your machine get crashed you will lose your entire data. But in the hadoop cluster you can recover the data from another replication( if you have replication factor more than 1) as hadoop doesn't store replicated copy of the data on the same machine where your original replication resides.These things are handled from hadoop itself.
Ans2: When you upload file on the HDFS, your actual data goes to the datanode and NameNode keep the metadata information of your data. NameNode metadata information conatains are like block name, block location, filename, directory location of the file.
Ans3: You need to maintain entire configuration related to your hadoop cluster. Maintaining one configuration file is not sufficient and further you may face other problem.
Ans4: NameNode configurations properties are related to NameNode functionality like namespace services metadata location etc,RPC address that handles all clients requests Datanode configuration properties are related to services which is performed by the DataNode like storage balancing among the DataNode's volumes,available disk space,the DataNode server address and port for data transfer
Please check this link to understand more about the different configuration property.
Please provide more clarification about the question 3 and 4 if you think something more you want to know.

Copying a large file (~6 GB) from S3 to every node of an Elastic MapReduce cluster

Turns out that copying a large file (~6 GB) from S3 to every node in an Elastic MapReduce cluster in a bootstrap action doesn't scale well; the pipe is only so big, and downloads to the nodes get throttled as # nodes gets large.
I'm running a job flow with 22 steps, and this file is needed by maybe 8 of them. Sure, I can copy from S3 to HDFS and cache the file before every step, but that's a major speed kill (and can affect scalability). Ideally, the job flow would start with the file on every node.
There are StackOverflow questions at least obliquely addressing persisting a cached file through a job flow:
Re-use files in Hadoop Distributed cache,
Life of distributed cache in Hadoop .
I don't think they help me. Anyone have some fresh ideas?
Two ideas, please consider your case specifics and disregard at will:
Share the file through NFS with a server with a instance type with good enough networking on the same placement group or AZ.
Have EBS PIOPS volumes and EBS-Optimized instances with the file pre-loaded and just attach them to your nodes in a bootstrap action.

Is there a way to have a secondary storage or backup for data blocks in Hadoop?

I have Hadoop running on a cluster that has non-dedicated nodes (i.e. it shares nodes with other applications/users). When the other users are using a cluster's node, it is not allowed to run Hadoop jobs in that node. Thus, it is possible that only a few nodes are available in a given moment, and that this few nodes do not have all data blocks (replicas) need by the Hadoop job.
I also have a big Network-Attached Storage that is used for backup. So, I am wondering if there is a way to use it as a secondary storage for Hadoop. For example, if some data block is missing in the cluster, Hadoop would get the block from the secondary/backup storage.
Any ideas?
Thanks in advance!
I am not aware about such a "mixed" storage mode for the hadoop. So I do not think that your scenario is directly supported by hadoop.
For me it looks like you need more "elastic" solution. If EMR would be available open source - it might be good choice - where NAS would play the role of S3.
I would suggest the following solution in Your case:
Install and run data nodes on all available servers. They are not as resource hungy as task trackers - since they are only sequentially read/write data.
Install task trackers on all machines also, but run only on these which are not used now. Hadoop is smart enough to preserve data locality when possible. In the same time hadoop will takes change in number of task trackers much easier then disappearing data nodes.
Alternatively you can build cluster of task trackers only, not use HDFS and run jobs against the NAS.
In all cases the main interference with other users I still expect is network congestions - during shuffle stage hadoop is usually saturating the network.

Resources