This user's topologies are in danger of being unscheduled due to the owner's over-use of cluster resources - apache-storm

We are in the process of upgrading Storm version from 1.x.x to 2.1.0, and were able to bring up the Storm cluster, and it's running fine so far, with just one topology.
We see a new owner section in the Storm UI, with the following message in red background:
This user's topologies are in danger of being unscheduled due to the owner's over-use of cluster resources.
Could someone please let us know why this happens and a way to resolve it?

Related

Does storm support ha host, e.g. for the nimbus host?

If so, from which version, does storm support this? I want to know this because I want to upgrade my storm version(now my storm version is 0.10.0).
High availability for Nimbus was introduced in Storm 1.0.0 (see https://storm.apache.org/2016/04/12/storm100-released.html)
However, even for prior Storm versions, missing HA for Nimbus was not a critical issue, because a failing Nimbus does not affect running topologies. The only problem if Nimbus is down is, that no interaction with the cluster is possible from outside (eg, submitting new topologies etc.).
Workers are HA too, ie, supervisors can restart failing workers. Supervisors are not HA -- however, the task they host will be redistributed automatically to other supervisors if one supervisor fails.

Hadoop environment is Down

I am student and doing computer science. As a part of my research i am working on the hadoop environment. The person who was working on this research before me has configured 9 Datanode with a namenode and a stand by node. we have our network traffic data stored in the hive and i am developing hive queries to identify network attack. The person who was working on this already left from our place and working somewhere else and busy with job. so i have couple of questions :
1) how can I understand the architecture on HDFS of my environment i.e how the machines are connected to build this environment. Also what services for this environment installed on which machines?
2) Now we have 9 datanodes in the environement and my professor wants to reduce the datanodes. her goal is to do the research with 2-3 (minimal) machine in this environment.
3) What are the good and easy source to get understanding about the cloudera and hadoop ? Also the commands which can be used to explicitly start and stop a service.
4) Right now in cloudera manager I am not able to start the Namenode server, Secondary datanode and one more. I stop all the services in order from cloudera and now starting in order and in that order the HDFS service comes first so while starting it, it gives the failure message for namenode datanode and datanode8.
I tried several ways but no luck. Please suggest me some ways I can solve issues and good resource(for beginner), I can refer to dig into this more.
Thanks.
There are several resources to start. For everything Cloudera/CDH, the place to go is Cloudera Documentation. For Hadoop, the place to go is Hadoop Documentation. Now, I reckon, this is a rather big bite to chew. If you're new to Hadoop, better start with a book, some introduction (I can't recommend one since I haven't read any).
For your specific problem, it seems that the some services don't start. You need to look at the services' logs, on the respective nodes. I can't tell you where those logs are, because it depends on the your distribution version and on how it was configured. I suspect one vital service does not start (probably HDFS, looks like namenode is down) and this causes every other service to fail. Hadoop Wiki has a troubsleshooting guide, try to follow that and see if it helps you.
As for the question on how to adjust the cluster size, first get it up and running and then consider changing it. Refer to Decommissioning and Recommissioning Hosts.

Running spark cluster on standalone mode vs Yarn/Mesos

Currently I am running my spark cluster as standalone mode. I am reading data from flat files or Cassandra(depending upon the job) and writing back the processed data to the Cassandra itself.
I was wondering if I switch to Hadoop and start using a Resource manager like YARN or mesos, does it give me an additional performance advantage like execution time and better resource management?
Currently sometime when I am processing huge chunk of data during shuffling with a possibility of stage failure. If I migrate to a YARN, can Resource manager address this issue?
Spark standalone cluster manager can also give you cluster mode capabilities.
Spark standalone cluster will provide almost all the same features as the other cluster managers if you are only running Spark.
When you submit your application in cluster mode all you job related files would be copied on to one of the machines on the cluster which would then submit the job on your behalf, if you submit the application in client mode the machine from which the job is being submitted would be taking care of driver related activities. This means that the machine from which the job has been submitted cannot go offline, whereas in cluster mode the machine from which the job has been submitted can go offline.
Having a Cassandra cluster would also not change any of these behaviors except it can save you network traffic if you can get the nearest contact point for the spark executor(Just like Data locality).
The failed stages gets rescheduled if you use either of the cluster managers.
I was wondering if I switch to Hadoop and start using a Resource manager like YARN or mesos, does it give me an additional performance advantage like execution time and better resource management?
In Standalone cluster model, each application uses all the available nodes in the cluster.
From spark-standalone documentation page:
The standalone cluster mode currently only supports a simple FIFO scheduler across applications. However, to allow multiple concurrent users, you can control the maximum number of resources each application will use. By default, it will acquire all cores in the cluster, which only makes sense if you just run one application at a time.
In other cases (when you are running multiple applications in the cluster) , you can prefer YARN.
Currently sometime when I am processing huge chunk of data during shuffling with a possibility of stage failure. If I migrate to a YARN, can Resource manager address this issue?
Not sure since your application logic is not known. But you can give a try with YARN.
Have a look at related SE question for benefits of YARN over Standalone and Mesos:
Which cluster type should I choose for Spark?

How does Apache Spark handles system failure when deployed in YARN?

Preconditions
Let's assume Apache Spark is deployed on a hadoop cluster using YARN. Furthermore a spark execution is running. How does spark handle the situations listed below?
Cases & Questions
One node of the hadoop clusters fails due to a disc error. However replication is high enough and no data was lost.
What will happen to tasks that where running at that node?
One node of the hadoop clusters fails due to a disc error. Replication was not high enough and data was lost. Simply spark couldn't find a file anymore which was pre-configured as resource for the work flow.
How will it handle this situation?
During execution the primary namenode fails over.
Did spark automatically use the fail over namenode?
What happens when the secondary namenode fails as well?
For some reasons during a work flow the cluster is totally shut down.
Will spark restart with the cluster automatically?
Will it resume to the last "save" point during the work flow?
I know, some questions might sound odd. Anyway, I hope you can answer some or all.
Thanks in advance. :)
Here are the answers given by the mailing list to the questions (answers where provided by Sandy Ryza of Cloudera):
"Spark will rerun those tasks on a different node."
"After a number of failed task attempts trying to read the block, Spark would pass up whatever error HDFS is returning and fail the job."
"Spark accesses HDFS through the normal HDFS client APIs. Under an HA configuration, these will automatically fail over to the new namenode. If no namenodes are left, the Spark job will fail."
Restart is part of administration and "Spark has support for checkpointing to HDFS, so you would be able to go back to the last time checkpoint was called that HDFS was available."

Need help regarding storm

1) What happens if Nimbus fails? Can we convert some other node into a Nimbus?
2) Where is the output of topology stored? When a bolt emits a tuple, where is it stored ?
3) What happens if zookeeper fails ?
Nimbus is itself a failure-tolerant process, which means it doesn't store its state in-memory but in an external database (Zookeeper). So if Nimbus crashes (an unlikely scenario), on the next start it will resume processing just where it stopped. Nimbus usually must be setup to be monitored by an external monitoring system, such as Monit, which will check the Nimbus process state periodically and restart it if any problem occurs. I suggest you read the Storm project's wiki for further information.
Nimbus is the master node of a Storm cluster and isn't possible to have multiple Nimbus nodes. (Update: the Storm community is now (as of 5/2014) actively working on making the Nimbus daemon fault tolerant in a failover manner, by having multiple Nimbuses heartbeating each other)
The tuple is "stored" in the tuple tree, and it is passed to the next bolt in the topology execution chain as topology execution progresses. As for physical storage, tuples are probably stored in an in-memory structure and seralized as necessary to be distributed among the cluster's nodes. The complete Storm cluster's state itself is stored in Zookeeper. Storm doesn't concern itself with persisent storage of a topology or a bolt's output -- it is your job to persist the results of the processing.
Same as for Nimbus, Zookeper in a real, production Storm cluster must be configured for reliability, and for Zookeeper that means having an odd number of Zookeeper nodes running on different servers. You can find more information on configuring a Zookeeper production cluster in the Zookeper Administrator's Guide. If Zookeeper would fail (altough a highly unlikely scenario in a properly configured Zookeeper cluster) the Storm cluster wouldn't be able to continue processing, since all cluster's state is stored in Zookeeper.
Regarding question 1), this bug report and subsequent comment from Storm author and maintainer Nathan Marz clarifies the issue:
Storm is not designed for having topologies partially running. When you bring down the master, it is unable to reassign failed workers. We are working on Nimbus failover. Nimbus is fault-tolerant to the process restarting, which has made it fault-tolerant enough for our and most people's use cases.

Resources