Replicating data between multiple Hadoop clusters residing in different data centers - hadoop

I was wondering what would be the best way to replicate the data present in a Hadoop cluster H1 in data center DC1 to another Hadoop cluster H2 in data center DC2 (warm backup preferably). I know that Hadoop does data replication and the number of copies of the data created is decided by the replication factor set in hdfs-site.xml. I have a few questions related to this
Would it make sense to have the data nodes of one cluster be spread across both data centers so that the data nodes for H1 would be present in both DC1 and DC2. If this makes sense and is viable, then does it mean we do not need H2?
Would it make sense to have the namenodes and datanodes distributed across both data centers rather than having only the datanodes distributed across both data centers?
I have also heard people use distcp and many tools build on top of distcp. But distcp does lazy backups and would prefer warm backups over cold ones.
Some people suggest using Kafka for this but I am not sure how to go about using it.
Any help would be appreciated. Thanks.

It depends on what you are trying to protect against. If you want to protect against site failure, distcp seems to be the only option for cross datacenter replication. However, as you pointed out, distcp has limitations. You can use snapshots to protect against user mistakes or application corruptions because replication or multiple replicas will not protect against that. Other commercial tools are available for automating the backup process as well if you don't want to write code and maintain it.

Related

Need of maintaining replication factor on datanodes

Please pardon if this question has come up earlier as I'm not able to find any related question for this.
1) I want to know the reason why it is important to maintain the same replication factor(or for that matter any configuration) across the datanodes and namenodes in the cluster?
2) When we upload any file to HDFS, isn't it the namenode which manages the storage?
3) Wouldn't maintaining the configuration only on the namenodes suffice?
4) What are the implications of having the configuration different across namenode and datanodes?
Any Help is much appreciated. Thank you! :)
I will try to answer your question taking replication as an example.
Few things to keep in mind -
Data always resides on datanodes, Namenode never deals with data or store data, it only keeps metadata about the data.
Replication factor is configurable, you can change it for every file copy, for example file1 may have replication factor of 2 while file2 may have replication factor of say 3, in a similar way some other properties can also be configured at the time of execution.
2) When we upload any file to HDFS, isn't it the namenode which manages the storage?
I am not sure about what you exactly mean by namenode managing the storage, here is how a file upload to hdfs gets executed -
1) Client sends a request to Namenode for file upload to hdfs
2) Namenode based on the configuration(if not explicitly specified by the client application) calculates the number of blocks data will be broken into.
3) Namenode also decides which Datanodes will store the blocks, based on the replication factor specified in configuration(if not explicitly specified by the client application)
4) Namenode sends information calculated in step #2 and #3 to the client
5) Client application will break the file into blocks and write each block to 'a' Datanode say DN1.
6) Now DN1 will be responsible to replicate the received blocks to other Datanodes as chosen by the Namenode in #3; It will initiate replication when Namenode instructs it.
For you questions #3 and #4, it is important to understand that any distributed application will require a set of configurations available with each node to be able to interact with each other and also perform designated task as per expectation. In case every node chooses to have its own configuration what would be the basis of co-ordination? DN1 has replication factor of 5, while DN2 has of 2 how would data be actually replicated?
Update start
hdfs-site.xml contains lots of other config specifications as well for namenode, datanode and secondary namenode, some client and hdfs specific settings and not just the replication factor.
Now imagine having a 50 node cluster, would you like to go and configure on each node or simply copy a pre-configured file?
Update end
If you keep all configurations at one location, each node will need to connect to that shared resource to load configuration every time it has to perform an action, this would add to latency apart from consistency/synchronization issues in case any config property is changed.
Hope this helps.
Hadoop is designed to deal with large datasets. It's not a good idea to store a large dataset on a single machine because if your storage system or hard disk crashes, you may lose all of your data.
Before Hadoop, people were using a traditional system to store large amounts of data, but the traditional system was very costly. There were also challenges while analyzing large datasets from the traditional system as it was time consuming process to read data from the traditional system. With these things in mind, the Hadoop Framework was designed.
In the hadoop framework, when you load large amounts of data, it splits the data into small chunks, known as blocks. These blocks are basically used to place the data into a datanode in a distributed cluster, and also they also are used during the analysis of the data.
The region behind the splitting of the data is parallel processing and distributed storage (i.e.: you can store your data onto multiple machines, and when you want to analyze it you can do it via parallel analysis).
Now Coming to your questions:
Reason: Hadoop is a framework which allows distributed storage and computing. In other words, this means you can store the data onto multiple machines. It has functionality of replication that means you are keeping multiple copy (based on the replication factor) of the same data.
Ans1: Hadoop is designed to run on the commodity hardware and failures are common on commodity hardware so suppose if you store the data on a single machine and when your machine get crashed you will lose your entire data. But in the hadoop cluster you can recover the data from another replication( if you have replication factor more than 1) as hadoop doesn't store replicated copy of the data on the same machine where your original replication resides.These things are handled from hadoop itself.
Ans2: When you upload file on the HDFS, your actual data goes to the datanode and NameNode keep the metadata information of your data. NameNode metadata information conatains are like block name, block location, filename, directory location of the file.
Ans3: You need to maintain entire configuration related to your hadoop cluster. Maintaining one configuration file is not sufficient and further you may face other problem.
Ans4: NameNode configurations properties are related to NameNode functionality like namespace services metadata location etc,RPC address that handles all clients requests Datanode configuration properties are related to services which is performed by the DataNode like storage balancing among the DataNode's volumes,available disk space,the DataNode server address and port for data transfer
Please check this link to understand more about the different configuration property.
Please provide more clarification about the question 3 and 4 if you think something more you want to know.

Can HDFS replicate to other clusters?

HDFS is replicating to fact 3 in the same cluster. That is fine, but is there a way to set up HDFS so it can replicate also to different clusters/servers? Let say 1 replication in to the same cluster and the other one somewhere far away in another HDFS cluster.
If HDFS is not supporting this, are there any tools around Hadoop that allow us to do so? How do you guys replicate over other servers?
Currently there are no mechanisms for what you're asking for. Cross-cluster replication has been implemented for HBase, but not for HDFS. There is a plan to support cross datacenter replication in HDFS but it's not implemented yet.
You can use the distcp mechanism to copy your data to another cluster on a regular interval. This will place 3 replicas on each cluster (which is typically what you want for cross dc/cluster replication anyway). Note however that since this has to be done periodically, it's not exactly a replacement for realtime replication. If you lose a cluster in between copies, whatever data was written to the "primary" cluster will be lost until the cluster has been restored.

Falcon vs Wandisco Non-stop

Use case is: I need to copy all my data from a HDFS cluster to another cluster with the same set up of masters and slaves and I will release the previous cluster and start running my jobs in the new cluster.
I have read about Apache Falcon and Wandisco non-stop Hadoop which helps in this mirroring. But I'm not sure of other advantages they offer when I have them as part of my eco system (which will be more advantageous? - It will depend on my use case but I want to know if there are any in particular). Can an analogy of Falcon vs Wandisco be provided, depending on your experience with them?
(Disclaimer: I work at WANdisco.)
My view is that the products are complementary. Falcon does a lot of things besides data transfer, like setting up data workflow stages. WANdisco's products do active-active data replication (which means that data can be used equivalently from both the source and target clusters).
In your use case, if you use Falcon then you're actually using DistCP to copy data to your new cluster. You might do an initial transfer to get the bulk of the data over, and then at some point you need to do a final cutover to pick up all the deltas, and then you can let applications run on the new cluster.
If you did the data transfer with WANdisco's products, you could use both clusters at the same time as the replication engine coordinates the changes using a Paxos algorithm. That might make an incremental migration easier.
Other scenarios where you'll notice a difference between a continuous active-active replication compared to DistCP are things like backup and disaster recovery and ingesting into multiple data centers. Hope that helps.

Control data locality in Impala by partitioning

I would like to avoid Impala nodes unnecessarily requesting data from other nodes over the network in cases when the ideal data locality or layout is known at table creation time. This would be helpful with 'non-additive' operations where all records from a partition are needed at the same place (node) anyway (for ex. percentiles).
Is it possible to tell Impala that all data in a partition should always be co-located on a single node for any HDFS replica?
In Impala-SQL, I am not sure if the "PARTITIONED BY" clause provide this feature. In my understanding, Impala chunks its partitions into separate files on HDFS but HDFS does not guarantee the co-location of related files nor blocks by default (rather tries to achieve the opposite).
Found some information about Impala's impact on HDFS development but not clear if these are already implemented or still in plans:
http://www.slideshare.net/deview/aaron-myers-hdfs-impala
(slides 23-24)
Thank you in advance for all.
About the slides you mention ("Co-located block replicas") - it's about an HDFS feature (HDFS-2576) implemented in Hadoop 2.1. It provides a Java API to give hints to HDFS as to where the blocks should be placed.
It's not used in Impala as of 2014, but it definitely seems like building some groundwork for that - as it would give Impala a performance equivalent of specifying distribution key in traditional MPP databases.
No, that completely defeats the purpose of having a distributed file system and MPP computing. It also creates a single point of failure and a bottleneck especially if you're talking about a 250GB table that is joined to itself. Exactly the kind of problems that Hadoop was designed to solve. Partitioning data creates sub-directories in HDFS on the namenode and that data is then replicated throughout the datanodes in the cluster.

How to use HBase and Hadoop to serve live traffic AND perform analytics? (Single cluster vs separate clusters?)

Our primary purpose is to use Hadoop for doing analytics. In this use case, we do batch processing, so throughput is more important than latency, meaning that HBase is not necessarily a good fit (although getting closer to real-time analytics does sound appealing). We are playing around with Hive and we like it so far.
Although analytics is the main thing we want to do in the immediate future with Hadoop, we are also looking to potentially migrate parts of our operations to HBase and to serve live traffic out of it. The data that would be stored there is the same data that we use in our analytics, and I wonder if we could just have one system for both live traffic and analytics.
I have read a lot of reports and it seems that most organizations choose to have separate clusters for serving traffic and for analytics. This seems like a reasonable choice for stability purposes, since we plan to have many people writing Hive queries, and badly written queries could potentially compromise the live operations.
Now my question is: how are those two different use cases reconciled (serving live traffic and doing batch analytics)? Do organizations use systems to write all data in two otherwise independent clusters? Or is it possible to do this out of the box with a single cluster in which some of the nodes serve live traffic and others do only analytics?
What I'm thinking is that we could perhaps have all data coming into the nodes that are used for serving live traffic, and let the HDFS replication mechanisms manage the copying of data into nodes that are used for analytics (increasing the replication higher than the default 3 probably makes sense in such scenario). Hadoop can be made aware of special network topologies, and it has functionality to always replicate at least one copy to different racks, so this seems to mesh well with what I'm describing.
The nodes dedicated to live traffic could be set to have zero (or few) map and reduce slots, so that all Hive queries end up being processed by the nodes dedicated to analytics.
The nodes dedicated to analytics would always be a little behind those dedicated to serving live traffic, but that does not seem to be a problem.
Does that kind of solution make sense? I am thinking it could be more simple to have one cluster than two, but would this be significantly riskier? Are there known cases of companies using a HBase cluster to serve live traffic while also running batch analytics jobs on it?
I'd love to get your opinions on this :) !
Thanks.
EDIT: What about Brisk? It's based on Cassandra instead of HBase, but it seems to be made exactly for what I'm describing (hybrid clusters). Has anyone worked with it before? Is it mature?
--
Felix
Your approach has a few problems... even in rack aware mode, if you have more than a few racks I don't see how you can be guaranteed your nodes will be replicated on those nodes. If you lose one of your "live" nodes, then you will be under-replicated for a while and won't have access to that data.
HBase is greedy in terms of resources and I've found it doesn't play well with others (in terms of memory and CPU) in high load situations. You mention, too, that heavy analytics can impact live performance, which is also true.
In my cluster, we use Hadoop quite a bit to preprocess data for ingest into HBase. We do things like enrichment, filtering out records we don't want, transforming, summarization, etc. If you are thinking you want to do something like this, I suggest sending your data to HDFS on your Hadoop cluster first, then offloading it to your HBase cluster.
There is nothing stopping you from having your HBase cluster and Hadoop cluster on the same network backplane. I suggest instead of having hybrid nodes, just dedicate some nodes to your Hadoop cluster and some nodes to your Hbase cluster. The network transfer between the two will be quite snappy.
Just my personal experience so I'm not sure how much of it is relevant. I hope you find it useful and best of luck!
I think this kind of solution might have sense, since MR is mostly CPU intensive and HBASE is a memory hungry beast. What we do need - is to properly arrange resource management. I think it is possible in the following way:
a) CPU. We can define maximum number of MR mappers/reducers per slot and assuming that each mapper is single threaded we can limit CPU consumption of the MR. The rest will go to HBASE.
b) Memory.We can limit memory for mappers and reducers and the rest give to HBASE.
c) I think we can not properly manage HDFS bandwidth sharing, but I do not think it should be a problem for HBASE -since for it disk operations are not on the critical path.

Resources