Manually start HDFS every time I boot? - bash

Laconically: Should I start HDFS every that I come back to the cluster after a power-off operation?
I have successfully created a Hadoop cluster (after loosing some battles) and now I want to be very careful on proceeding with this.
Should I execute start-dfs.sh every time I power on the cluster, or it's ready to execute my application's code? Same for start-yarn.sh.
I am afraid that if I run it without everything being fine, it might leave garbage directories after execution.

Just from playing around with the Hortonworks and Cloudera sandboxes, I can say turning them on and off doesn't seem to demonstrate any "side-effects".
However, it is necessary to start the needed services everytime the cluster starts.
As far as power cycling goes in a real cluster, it is recommended to stop the services running on the respective nodes before powering them down (stop-dfs.sh and stop-yarn.sh). That way there are no weird problems and any errors on the way to stopping the services will be properly logged on each node.

Related

HDFS migrate datanodes servers to new servers

I want to migrate our hadoop server with all the data and components to new servers (newer version of redhat).
I saw a post on cloudera site about how to move the namenode,
but I dont know how to move all the datanodes without data loss.
We have replica factor 2.
If I will shutdown 1 datanode at a time hdsfs will generate new replicas?
Is there A way to migrate all the datanodes at once? what is the correct way to transfer all (about 20 server) datanodes to a new cluster?
Also I wanted to know if hbase will have the same problem or if I can just to delete and add the roles on the new servers
Update for clearify:
My Hadoop cluster already contains two sets of servers (They are in the same hadoop cluster, I just splited it logicly for the example)
First set is the older version of linux servers
Second set is the newer version of linux servers
Both sets are already share data and components (the namenode is in the old set of servers).
I want to remove all the old set of servers so only the new set of servers will remain in the hadoop cluster.
Does the execution should be like:
shutdown one datanode (from old servers set)
run balancer and wait for finish
do the same for the next datanodes
because if so, the balancer operation takes a lot of time and the whole operation will take a lot of time.
Same problem is for the hbase,
Now hbase region and master are only on the old set of servers, and I want remove it and install on the new set of servers without data loss.
Thanks
New Datanodes can be freely added without touching the namenode. But you definitely shouldn't shut down more than one at a time.
As an example, if you pick two servers to shut down at random, and both hold a block of a file, there's no chance of it replicating somewhere else. Therefore, upgrade one at a time if you're reusing the same hardware.
In an ideal scenario, your OS disk is separated from the HDFS disks. In which case, you can unmount them, upgrade the OS, reinstall HDFS service, remount the disks, and everything will work as previously. If that isn't how you have the server set up, you should do that before your next upgrade.
In order to get replicas added to any new datanodes, you'll need to either 1) Increase the replication factor or 2) run the HDFS rebalancer to ensure that the replicas are shuffled across the cluster
I'm not too familiar with Hbase, but I know you'll need to flush the regionservers before you install and migrate that service to other servers. But if you flush the majority of them without rebalancing the regions, you'll have one server that holds all the data. I'm sure the master server has similar caveats, although hbase backup seems to be a command worth trying.
#guylot - After adding the new nodes and running the balancer process take the old nodes out of the cluster by going through the decommissioning process. The decommissioning process will move the data to another node in your cluster. As a matter of precaution, only run against on one node at a time. This will limit the potential for a lost data incident.

Do EC2 system reboot events cause EMR to wait for the node or to replace it?

I have a long running EMR cluster. I received EC2 event notifications of upcoming system reboots. The help document advises that even rebooting these manually will not reschedule this, though stopping and starting the instances might.
The EMR cluster claims if a core node goes unresponsive it will provision a new one. I suspect this provisioning takes longer than a reboot, so what I cannot find in the documentation is whether the EC2 event is known to EMR and the cluster will wait for it's missing core nodes (or task nodes) to reboot and rejoin, or whether EMR will respond as though these instances disappeared un-expectantly, and thus will start provisioning new replacements even as the nodes come back and rejoin the cluster.
Does anyone know which it will be?
It turns out the AWS service person operating the HW replacement and rebooting the instance was tasked with doing the correct adjustments in EMR for changing the instance. They started by adding a node, then draining the old node of tasks. Then they rebooted the node and it was reattached to EMR. Then they drained the added node and shut that down.
I'm not sure this is happening every time there's a reboot event though. It seems like the script of service steps is modified for different types of case.

Closing down amazon EMR temporarily

I wanted to know if I can temporarily close down my EMR ec2 instance to avoid extra charges. I waned to know if I can get a snapshot of my cluster and closing the ec2 instances temporarily.
You cannot currently terminate/stop your master instance without losing everything on your cluster, including in HDFS, but one thing you might be able to do is shrink your core/task node instance groups when you don't need them. You must still keep at least one core instance (or more if you have a lot of data in HDFS that you want to keep), but you can resize your task instance groups down to zero if your cluster is not in use.
On the other hand, unless you really have anything on your cluster that you need to keep, you might just want to terminate your cluster when you no longer need it, then clone it to a new cluster when you require it again.
For instance, if you only ever store your output data in S3 or another service external to your EMR cluster, there is no reason you need to keep your EMR cluster running while idle and no reason to need to "snapshot" it because your cluster is essentially stateless.
If you do have any data/configuration stored on your cluster that you don't want to lose, you might want to consider moving it off the cluster so that you can shut down your cluster when not in use. (Of course, how you would do this depends upon what exactly we're talking about.)

Hadoop environment is Down

I am student and doing computer science. As a part of my research i am working on the hadoop environment. The person who was working on this research before me has configured 9 Datanode with a namenode and a stand by node. we have our network traffic data stored in the hive and i am developing hive queries to identify network attack. The person who was working on this already left from our place and working somewhere else and busy with job. so i have couple of questions :
1) how can I understand the architecture on HDFS of my environment i.e how the machines are connected to build this environment. Also what services for this environment installed on which machines?
2) Now we have 9 datanodes in the environement and my professor wants to reduce the datanodes. her goal is to do the research with 2-3 (minimal) machine in this environment.
3) What are the good and easy source to get understanding about the cloudera and hadoop ? Also the commands which can be used to explicitly start and stop a service.
4) Right now in cloudera manager I am not able to start the Namenode server, Secondary datanode and one more. I stop all the services in order from cloudera and now starting in order and in that order the HDFS service comes first so while starting it, it gives the failure message for namenode datanode and datanode8.
I tried several ways but no luck. Please suggest me some ways I can solve issues and good resource(for beginner), I can refer to dig into this more.
Thanks.
There are several resources to start. For everything Cloudera/CDH, the place to go is Cloudera Documentation. For Hadoop, the place to go is Hadoop Documentation. Now, I reckon, this is a rather big bite to chew. If you're new to Hadoop, better start with a book, some introduction (I can't recommend one since I haven't read any).
For your specific problem, it seems that the some services don't start. You need to look at the services' logs, on the respective nodes. I can't tell you where those logs are, because it depends on the your distribution version and on how it was configured. I suspect one vital service does not start (probably HDFS, looks like namenode is down) and this causes every other service to fail. Hadoop Wiki has a troubsleshooting guide, try to follow that and see if it helps you.
As for the question on how to adjust the cluster size, first get it up and running and then consider changing it. Refer to Decommissioning and Recommissioning Hosts.

How does Apache Spark handles system failure when deployed in YARN?

Preconditions
Let's assume Apache Spark is deployed on a hadoop cluster using YARN. Furthermore a spark execution is running. How does spark handle the situations listed below?
Cases & Questions
One node of the hadoop clusters fails due to a disc error. However replication is high enough and no data was lost.
What will happen to tasks that where running at that node?
One node of the hadoop clusters fails due to a disc error. Replication was not high enough and data was lost. Simply spark couldn't find a file anymore which was pre-configured as resource for the work flow.
How will it handle this situation?
During execution the primary namenode fails over.
Did spark automatically use the fail over namenode?
What happens when the secondary namenode fails as well?
For some reasons during a work flow the cluster is totally shut down.
Will spark restart with the cluster automatically?
Will it resume to the last "save" point during the work flow?
I know, some questions might sound odd. Anyway, I hope you can answer some or all.
Thanks in advance. :)
Here are the answers given by the mailing list to the questions (answers where provided by Sandy Ryza of Cloudera):
"Spark will rerun those tasks on a different node."
"After a number of failed task attempts trying to read the block, Spark would pass up whatever error HDFS is returning and fail the job."
"Spark accesses HDFS through the normal HDFS client APIs. Under an HA configuration, these will automatically fail over to the new namenode. If no namenodes are left, the Spark job will fail."
Restart is part of administration and "Spark has support for checkpointing to HDFS, so you would be able to go back to the last time checkpoint was called that HDFS was available."

Resources