Does hard restart of agent delete data? - hadoop

Does hard restart of agent delete data?
I did a hard restart of my agent and now I do not see data in hadoop fs -ls /user/hue or hadoop fs -ls /user/hive where did it go? I also do not see my other users but only hue and hive. What do I do? Where did it go?
I don't think data in hdfs should go anywhere with that.
If I query my tables in hive, I keep getting
The operation has no results
Help please!

Doing a hard restart of the Cloudera Manager agent will not cause data loss, but will cause all of the Hadoop daemons to be restarted. A normal restart of the agent does not do this, so a hard restart is useful if you need to force a stop of all the running processes.
If you are seeing no data in HDFS following a restart check the status of the HDFS service in Cloudera Manager. It will tell you how much capacity is used in HDFS, the number of files and other metrics. If you're seeing no data it could be that your DataNodes have not been started. Check to see if this is the case and if your NameNode is still in safe mode.

Related

HDFS /tmp filesystem is filling up rapidly and expected to cause outage

In our Hadoop cluster (Cloudera distribution), we recently found that Hive Job is started by user create a 160 TB of files in '/tmp' location and it almost consumed remaining HDFS space and about to cause an outage. Later we troubleshoot and kill the particular job as we are unable to reach the user who started this job.
So now my question is could we able to set an alert for '/tmp' location if anyone created huge files or we need to restrict the users using HDFA quota. Please share if you have any other suggestion.
You can set and manage quota for a directory by using the below set of commands
hdfs dfsadmin -setQuota <N> <directory>...<directory>
hdfs dfsadmin -clrQuota <directory>...<directory>
hdfs dfsadmin -setSpaceQuota <N> <directory>...<directory>
hdfs dfsadmin -clrSpaceQuota <directory>...<directory>
*where N is the Number of bytes you want to set
Reference Link
Helpful article
Hope this helps your scenario.
You can also manage resources from Cloudera Manager in Yarn resource pool from the processing side. You can limit max cores and memory allocated to each user or service running on your cluster.

Corrupted block in hdfs cluster

The screenshot added below shows the output of hdfs fsck /. It shows that the "/" directory is corrupted. This is the masternode of my Hadoop cluster. What to do?
If you are using Hadoop 2, you can run a Standby namenode to achieve High Availability. Without that, your cluster's master will be a Single Point of Failure.
You can not retrieve the data of Namenode from anywhere else since it is different from the usual data you store. If your namenode goes down, your blocks and files will still be there, but you won't be able to access them since there would be no related metadata in the namenode.

Restarting NameNode in Hadoop Cluster without format

Due to some reasons had to shut down my master node in cluster, as if we start the cluster again the namenode wont run unless we format it again, is their any solution to start name-node without formatting... Tried everything..
Start-all.sh or starting namenode/datanodes individually but Namenode wont start untill i format it again, How can i start Name-node without formatting.
Thanks in Advance
Please post the log information.
In fact, it needn't format when you restart the hadoop. Because the HDFS meta information would be storage in the disk, if you format the namenode, the meta information will be lost.
You can try whether there is namenode process still exist when you stop the cluster, use the commond ps -e|grep java. If yes, kill it and start namenode again.

Hadoop backup and recovery tool and guidance

I am new to hadoop need to learn details about backup and recovery. I have revised oracle backup and recovery will it help in hadoop?From where should I start
There are a few options for backup and recovery. As s.singh points out, data replication is not DR.
HDFS supports snapshotting. This can be used to prevent user errors, recover files, etc. That being said, this isn't DR in the event of a total failure of the Hadoop cluster. (http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsSnapshots.html)
Your best bet is keeping off-site backups. This can be to another Hadoop cluster, S3, etc and can be performed using distcp. (http://hadoop.apache.org/docs/stable1/distcp2.html), (https://wiki.apache.org/hadoop/AmazonS3)
Here is a Slideshare by Cloudera discussing DR (http://www.slideshare.net/cloudera/hadoop-backup-and-disaster-recovery)
Hadoop is designed to work on the big cluster with 1000's of nodes. Data loss is possibly less. You can increase the replication factor to replicate the data into many nodes across the cluster.
Refer Data Replication
For Namenode log backup, Either you can use the secondary namenode or Hadoop High Availability
Secondary Namenode
Secondary namenode will take backup for the namnode logs. If namenode fails then you can recover the namenode logs (which holds the data block information) from the secondary namenode.
High Availability
High Availability is a new feature to run more than one namenode in the cluster. One namenode will be active and the other one will be in standby. Log saves in both namenode. If one namenode fails then the other one becomes active and it will handle the operation.
But also we need to consider for Backup and Disaster Recovery in most cases. Refer #brandon.bell answer.
You can use the HDFS sync application on DataTorrent for DR use cases to backup high volumes of data from one HDFS cluster to another.
https://www.datatorrent.com/apphub/hdfs-sync/
It uses Apache Apex as a processing engine.
Start with official documentation website : HdfsUserGuide
Have a look at below SE posts:
Hadoop 2.0 data write operation acknowledgement
Hadoop: HDFS File Writes & Reads
Hadoop 2.0 Name Node, Secondary Node and Checkpoint node for High Availability
How does Hadoop Namenode failover process works?
Documentation page regarding Recovery_Mode:
Typically, you will configure multiple metadata storage locations. Then, if one storage location is corrupt, you can read the metadata from one of the other storage locations.
However, what can you do if the only storage locations available are corrupt? In this case, there is a special NameNode startup mode called Recovery mode that may allow you to recover most of your data.
You can start the NameNode in recovery mode like so: namenode -recover

How does Apache Spark handles system failure when deployed in YARN?

Preconditions
Let's assume Apache Spark is deployed on a hadoop cluster using YARN. Furthermore a spark execution is running. How does spark handle the situations listed below?
Cases & Questions
One node of the hadoop clusters fails due to a disc error. However replication is high enough and no data was lost.
What will happen to tasks that where running at that node?
One node of the hadoop clusters fails due to a disc error. Replication was not high enough and data was lost. Simply spark couldn't find a file anymore which was pre-configured as resource for the work flow.
How will it handle this situation?
During execution the primary namenode fails over.
Did spark automatically use the fail over namenode?
What happens when the secondary namenode fails as well?
For some reasons during a work flow the cluster is totally shut down.
Will spark restart with the cluster automatically?
Will it resume to the last "save" point during the work flow?
I know, some questions might sound odd. Anyway, I hope you can answer some or all.
Thanks in advance. :)
Here are the answers given by the mailing list to the questions (answers where provided by Sandy Ryza of Cloudera):
"Spark will rerun those tasks on a different node."
"After a number of failed task attempts trying to read the block, Spark would pass up whatever error HDFS is returning and fail the job."
"Spark accesses HDFS through the normal HDFS client APIs. Under an HA configuration, these will automatically fail over to the new namenode. If no namenodes are left, the Spark job will fail."
Restart is part of administration and "Spark has support for checkpointing to HDFS, so you would be able to go back to the last time checkpoint was called that HDFS was available."

Resources