Hadoop Data Corrupted Following Power Failure - hadoop

I'm new to Hadoop and learning to use it by working with a small cluster where each node is an Ubuntu Server VM. The cluster consists of 1 name node and 3 data nodes with a replication factor of 3. After a power loss on the machine hosting the VMs, all files stored in the cluster were corrupted and with the blocks storing those files missing. No queries were running at the time power was lost and no files were being written to or read from the cluster.
If I shut down the VMs correctly (even without first stopping the Hadoop cluster), then the data is preserved and I don't run into any issues with missing or corrupted blocks.
The only information I've been able to find suggested setting dfs.datanode.sync.behind.writes to true, but this did not resolve the issue (killing the VMs from the host causes the same issue as a power failure). The information I found here seems to indicate this property will only have an effect when writing data to the disk.
I also tried running hdfs namenode -recover, but this did not resolve the issue. Ultimately I had to remove the data stored in the dfs.namenode.name.dir directory, rebooted each VM in the cluster to remove any Hadoop files in /tmp and reformatted the name node before copying the data back into the cluster from local file storage.
I understand that having all nodes in the cluster running on the same hardware and only 3 data nodes to go with a replication factor of 3 is not an ideal configuration, but I'd like a way to ensure that any data that is already written to disk is not corrupted by a power loss. Is there a property or other configuration I need to implement to avoid this in the future (besides separate hardware, more nodes, power backup, etc.)?
EDIT: To clarify further, the issue I'm trying to resolve is data corruption, not cluster availability. I understand I need to make changes to the overall cluster architecture to improve reliability, but I'd like a way to ensure data is not lost even in the event of a cluster-wide power failure.

Related

why is the default replication factor 1 in cloudera 5.12 vm where as it is 3 in previous versions?

If the default replication factor was changed from 3 to 1, then are we not losing the reliability of hdfs? How can a hadoop engineer retrieve the only copy of the block if it is crashed or deleted for some reason?
It may be 1 in the Virtual Machine, as it only has a single datanode process. If it was set to 3 on the VM, then any files you create will be under-replicated and HDFS will not be able to repair them.
On a real cluster with many datanodes, the default should be 3.
In respect to your question
Why is the default replication factor 1 in Cloudera 5.12 VM
where as it is 3 in previous versions?
I've checked the documentation about DataNodes and found that
The default replication factor for HDFS is three. That is, three copies of data are maintained at all times.
So it seems not to be the case for the general Cloudera software distribution, but maybe for you specific case with Cloudera QuickStart VM 5.12.
If the default replication factor was changed from 3 to 1, then are we not losing the reliability of HDFS?
Your are correct with this.
How can a Hadoop engineer retrieve the only copy of the block if it is crashed or deleted for some reason?
And also this wouldn't be possible.
As #Stephen ODonnell already mentioned
It may be 1 in the Virtual Machine, as it only has a single data node process.
In an single virtual (demo?) environment there might be not the need or the resources for many nodes, high availability and so on.
If it was set to 3 on the VM, then any files you create will be under-replicated and HDFS will not be able to repair them.
Which might be OK for an one node cluster in a single VM to save resources.

How to reconfigure a non-HA HDFS cluster to HA with minimal/no downtime?

I have a single namenode HDFS cluster with multiple datanodes that store many terabytes of data. I want to enable high availability on that cluster and add another namenode. What is the most efficient and least error-prone way to achieve that? Ideally that would work without any downtime or with a simple restart.
The two options that came to mind are:
Edit the configuration of the namenode to facilitate the HA features and restart it. Afterwards add the second namenode and reconfigure and restart the datanodes, so that they are aware that the cluster is HA now.
Create an identical cluster in terms of data, but with two namenodes. Then migrate the data from the old datanodes to the new datanodes and finally adjust the pointers of all HDFS clients.
The first approach seems easier, but requires some downtime and I am not sure if that is even possible. The second one is somehow cleaner, but there are potential problems with the data migration and the pointers adjustments.
You won't be able to do this in-place without any down time; a non-HA setup is exactly that, not highly available, and thus any code/configuration changes require downtime.
To incur the least amount of downtime while doing this in-place, you probably want to:
Set up configurations for an HA setup. This includes things like a shared edits directory or journal nodes - see https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html or https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithNFS.html.
Create a new fsImage using the hdfs dfsadmin command. This will ensure that the NameNode is able to restart quickly (on startup, the NN will read the most recent fsImage, then apply all edits from the EditLog that were created after that fsImage).
Restart your current NameNode and put it into active mode.
Start the new NameNode in standby.
Update configurations on DataNodes and restart them to apply.
Update configurations on other clients and restart to apply.
At this point everything will be HA-aware, and the only downtime incurred was a quick restart of the active NN - equivalent to what you would experience during any code/configuration change in a non-HA setup.
Your second approach should work, but remember that you will need twice as much hardware, and maintaining consistency across the two clusters during the migration may be difficult.

HDFS migrate datanodes servers to new servers

I want to migrate our hadoop server with all the data and components to new servers (newer version of redhat).
I saw a post on cloudera site about how to move the namenode,
but I dont know how to move all the datanodes without data loss.
We have replica factor 2.
If I will shutdown 1 datanode at a time hdsfs will generate new replicas?
Is there A way to migrate all the datanodes at once? what is the correct way to transfer all (about 20 server) datanodes to a new cluster?
Also I wanted to know if hbase will have the same problem or if I can just to delete and add the roles on the new servers
Update for clearify:
My Hadoop cluster already contains two sets of servers (They are in the same hadoop cluster, I just splited it logicly for the example)
First set is the older version of linux servers
Second set is the newer version of linux servers
Both sets are already share data and components (the namenode is in the old set of servers).
I want to remove all the old set of servers so only the new set of servers will remain in the hadoop cluster.
Does the execution should be like:
shutdown one datanode (from old servers set)
run balancer and wait for finish
do the same for the next datanodes
because if so, the balancer operation takes a lot of time and the whole operation will take a lot of time.
Same problem is for the hbase,
Now hbase region and master are only on the old set of servers, and I want remove it and install on the new set of servers without data loss.
Thanks
New Datanodes can be freely added without touching the namenode. But you definitely shouldn't shut down more than one at a time.
As an example, if you pick two servers to shut down at random, and both hold a block of a file, there's no chance of it replicating somewhere else. Therefore, upgrade one at a time if you're reusing the same hardware.
In an ideal scenario, your OS disk is separated from the HDFS disks. In which case, you can unmount them, upgrade the OS, reinstall HDFS service, remount the disks, and everything will work as previously. If that isn't how you have the server set up, you should do that before your next upgrade.
In order to get replicas added to any new datanodes, you'll need to either 1) Increase the replication factor or 2) run the HDFS rebalancer to ensure that the replicas are shuffled across the cluster
I'm not too familiar with Hbase, but I know you'll need to flush the regionservers before you install and migrate that service to other servers. But if you flush the majority of them without rebalancing the regions, you'll have one server that holds all the data. I'm sure the master server has similar caveats, although hbase backup seems to be a command worth trying.
#guylot - After adding the new nodes and running the balancer process take the old nodes out of the cluster by going through the decommissioning process. The decommissioning process will move the data to another node in your cluster. As a matter of precaution, only run against on one node at a time. This will limit the potential for a lost data incident.

Add server with existing data as DataNode to Hadoop

I need to build distributed fail proof (as possible) cluster from several servers with existing data.
I'm new to Hadoop, but as far as I can tell, it more close to satisfy my requirements than other products.
The problem is that I already some data (quite large files) which I want to be available in Hadoop.
Is it possible to add server with existing data as DataNode to Hadoop?
What should I do to make it possible?
It appears to be impossible, except for moving existing data to HDFS after deploying DataNode on that box

Should the HBase region server and Hadoop data node on the same machine?

Sorry that I don't have the resource to set up a cluster to test it, I'm just wondering to know:
Can I deploy hbase region server on a separated machine other than the hadoop data node machine? I guess the answer is yes, but I'm not sure.
Is it good or bad to deploy hbase region server and hadoop data node on different machines?
When putting some data into hbase, where is this data eventually stored in, data node or region server? I guess it's data node, but what is the StoreFile and HFile in region server, isn't it the physical file to store our data?
Thank you!
RegionServers should always run alongside DataNodes in distributed clusters if you want decent performance.
Very bad, that will work against the data locality principle (If you want to know a little more about data locality check this: http://www.larsgeorge.com/2010/05/hbase-file-locality-in-hdfs.html)
Actual data will be stored in the HDFS (DataNode), RegionServers are responsible of serving and managing regions.
For more information about HBase architecture please check this excelent post from Lars' blog: http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html
BTW, as long as you have a PC with decent RAM you can set up a demo cluster with virtual machines. Do not ever try to set up a production environment without properly test the platform first in a development environment.
To go in more detail about this answer:
RegionServers should always run alongside? DataNodes in distributed clusters if you want decent performance."
I'm not sure how anyone would interpet the term alongside, so let's try to be even more precise:
What makes any physical server an "XYZ" server is that it's running a program called a daemon (think "eternally-running background event-handling" program);
What makes a "file" server is that it's running a file-serving daemon;
What makes a "web" server is that it's running a web-serving daemon;
AND
What makes a "data node" server is that it's running the HDFS data-serving daemon;
What makes a "region" server then is that it's running the HBase region-serving daemon (program);
So, in all Hadoop Distributions (eg Cloudera, MAPR, Hortonworks, others), the general best practice is that for HBase, the "RegionServers" are "co-located" with the "DataNodeServers".
This means that the actual slave (datanode) servers which form the HDFS cluster are each running the HDFS data-serving daemon (program)
and they're also running the HBase region-serving daemon (program) as well!
This way we ensure locality - the concurrent processing and storing of data on all the individual nodes in an HDFS cluster, with no "movement" of gigantic loads of big data from "storage" locations to "processing" locations. Locality is vital to the success of a Hadoop cluster, such that HBase region servers (data nodes running the HBase daemon as well) must also do all their processing (putting/getting/scanning) on each data node containing the HFiles which make up HRegions which make up HTables which make up HBases (Hadoop-dataBases) ... .
So, servers (VMs or physical on Windows, Linux, ..) can run multiple daemons concurrently, often, they run dozens of them regularly.

Resources