What if I run a Mesos cluster, with both development and say, mission critical applications. Is it possible to have "privileged" task to be executed in the cluster for these type of cases, and even have nodes shut down lesser privileged services to make sure the privileged service gets processing power?
Currently, there is no notion of privileged tasks in Mesos (0.24.1 at the time of writing). Preemption is likely an upcoming feature to be introduced to support other features such as Quota and Optimistic Offers. However, there are reserved resources in which critical tasks can run on.
Resources can be reserved for a role, and frameworks are registered under a certain role. For example, if a framework F registers under role R, F receives resources with role * (i.e. unreserved) as well as resources with role R (i.e. reserved for R).
The privileged tasks then would be launched on these reserved resources. Since reserved resources are only offered to the frameworks in the role, the resources will be available for the relaunch of the critical task even if the critical task were to crash.
NOTE: Since many frameworks can register under R, you can assign R uniquely to F to grant it sole ownership of the resources (Refer to register_frameworks under Authorization).
Refer to Reservation documentation for further information
Related
I have a cluster that does not allow direct ssh access, but does permit me to submit commands through a proprietary interface.
Is there any way to either manually launch OpenMPI jobs, or documentation on how to write a custom launcher?
I don't think you can do it without breaking some kind of agreement.
I assume you have some kind of a web-based interface that allows you to fill certain fields and maybe upload data. Or something similar. What this interface will probably do - is it's going to generate a request/file for a scheduler. Most likely, SGE or PBS. The direct access to the cluster is limited in order to
organize task priorities and order
prevent users from hogging the machines
make it easier to launch complicated tasks requiring complicated machine(s) configuration
So, you, effectively, want to go around a scheduler. I don't think you can or you should.
However, usually, the clusters have so-called, head nodes which would allow SSH access to them. These nodes would serve as a place to submit scheduler requests from them and, maybe, do small compilation/result processing (with very limited resources). Such configuration would eliminate the web interface but still leaves a very important scheduler for a cluster that is used by many people concurrently.
I have been using the /reserve mesos http endpoint to reserve resources for specific roles. However, this will only allow me to reserve unused resources. What I would actually like to do is kill some of the tasks on the mesos agent to make room. Is there a way to tell mesos to kill these tasks to free up the resources?
This is a bit of a chicken-and-egg problem: if you kill before you reserve, freed resources may be allocated before you reserve them; while if you reserve before kill, there may be not enough resources.
I would suggest you to look at Mesos quotas. They work slightly differently from reservations: resources are reserved in the cluster and not on specific agents; operation does not fail if there are currently insufficient resources. Once you set quota for a role, all free resources up to the quota will be reserved for your role. If there are currently not enough resources, Mesos will not kill tasks, but as tasks eventually terminate, freed resources will be given to your role.
In the future, we plan to implement revocation, as well as let operators hint Mesos which tasks should be terminated first.
I just noticed the fact that many Pig jobs on Hadoop are killed due to the following reason: Container preempted by scheduler
Could someone explain me what causes this, and if I should (and am able to) do something about this?
Thanks!
If you have the fair scheduler and a number of different queue's enabled, then higher priority applications can terminate your jobs (in a preemptive fashion).
Hortonworks have a pretty good explanation with more details
https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.2/bk_yarn_resource_mgt/content/preemption.html
Should you do anything about it? Depends if your application is within its SLA's and performing within expectations. General good practice would be to review your job priority and the queue it's assigned to.
If your Hadoop cluster is being used by many business units. then Admins decides queue for them and every queue has its priorities( that too is decided by Admins). If Preemption is enabled at scheduler level,then higher-priority applications do not have to wait because lower priority applications have taken up the available capacity. So in this case lower propriety task must have to release resources, if not available at cluster to let run higher-priority applications.
I have a 12 node cluster and I am running a yarn architecture. It seems that my nodes are busy most of the time and many times job fails. How can I check the usage of the resources at any point of time?
Also is there any method to set a limited resource to a user for eg: if a user submits a job he should be given only 25gb of memory and 12 cores.
There are multiple ways to monitor the cluster.
If you are using Cloudera distribution then you can go to Cloudera Manager to monitor and manage the resources
If you are using Hortonworks distribution then you can go to Ambari web interface to monitor and manage the resources
If you are not using any distributions then clusters will be managed using Ganglia or Nagios web interface
Even if you do not have any of these you can go to resource manager web interface which typically runs on http://:8088. 8088 is default port number, it can be customized and you can get that information from yarn-site.xml
If your organization does not provide access to the web interfaces you can use commands such as yarn application --list and mapred job --list to see what is going on in the cluster
It is little tedious to monitor actual usage. You should know linux commands to monitor and develop shell scripts.
Also is there any method to set a limited resource to a user for eg: if a user submits a job he should be given only 25gb of memory and 12 cores.
Yes, you need to use queues and pools concept of schedulers embedded in Yarn. There are 3 types of scheduler FIFO, Capacity and Fair. FIFO should not be used in any of the clusters, it is mainly for development. You need to understand capacity and fair scheduler and set the limits.
It seems that my nodes are busy most of the time and many times job fails
You can implement some generic performance tuning guidelines to improve the thorughput. Have a look at this post : Tips to improve MapReduce Job performance in Hadoop , cloudera article and Map reduce performance aticle
Also is there any method to set a limited resource to a user for eg: if a user submits a job he should be given only 25gb of memory and 12 cores.
Adding to Durga's answer,
Fair scheduling is a method of assigning resources to applications such that all apps get, on average, an equal share of resources over time. Hadoop NextGen is capable of scheduling multiple resource types.
By default, the Fair Scheduler bases scheduling fairness decisions only on memory. It can be configured to schedule with both memory and CPU, using the notion of Dominant Resource Fairness developed by Ghodsi.
The scheduler organizes apps further into “queues”, and shares resources fairly between these queues. By default, all users share a single queue, named “default”. If an app specifically lists a queue in a container resource request, the request is submitted to that queue. It is also possible to assign queues based on the user name included with the request through configuration.
e.g.
<user name="sample_user">
<maxRunningApps>30</maxRunningApps>
</user>
<userMaxAppsDefault>5</userMaxAppsDefault>
The CapacityScheduler is designed to run Hadoop applications as a shared, multi-tenant cluster in an operator-friendly manner while maximizing the throughput and the utilization of the cluster.
Traditionally each organization has it own private set of compute resources that have sufficient capacity to meet the organization’s SLA under peak or near peak conditions. This generally leads to poor average utilization and overhead of managing multiple independent clusters, one per each organization
<property>
<name>yarn.scheduler.capacity.queue-mappings</name>
<value>u:user1:queue1,g:group1:queue2,u:%user:%user,u:user2:%primary_group</value>
<description>
Here, <user1> is mapped to <queue1>, <group1> is mapped to <queue2>,
maps users to queues with the same name as user, <user2> is mapped
to queue name same as <primary group> respectively. The mappings will be
evaluated from left to right, and the first valid mapping will be used.
</description>
</property>
Have a look at Fair scheduler and Capacity scheduler
According to my reading on jboss documentation it says,
We define high availability as the ability for the system to continue
functioning after failure of one or more of the servers. A part of
high availability is failover which we define as the ability for
client connections to migrate from one server to another in event of
server failure so client applications can continue to operate.
Is failover part of high availability? How can we differentiate failover vs high availability?
Failover is a means of achieving high availability (HA). Think of HA as a feature and failover as one possible implementation of that feature. Failover is not always the only consideration when achieving HA.
For example, Cassandra achieves HA through replication, but the degree of availability is determined by data consistency settings. In essence, these settings dictate how many nodes need to respond for an action (a read or a write) to succeed. Requiring more nodes to respond means less availability, and requiring fewer nodes means more availability. That's an example of HA that has nothing to do with failover, strictly speaking.
High Availability
Refers to the fact that the server system is in some way tolerant to failure.
Most of the time this is done with hardware redundancy. Assume a machine has redundant power supplies, if one fails the machine will keep running.
Failover
Then you have application redundancy (failover), which usually refers to the ability for an application running on multiple hardware installations to respond to clients in a consistent manner from any of those hardware installations. That way, if the hardware does totally fail, or the O/S dies on a particular machine, another machine can carry on.
SQL Server deals with application redundancy in four ways:
Clustering
Mirroring
Replication
Log Shipping
High-availability (HA for short) is a broad term, so when I think about it I tend to think as HA clusters.
From Wikipedia High-availability cluster:
High-availability clusters are groups of computers that
support server applications that can be reliably utilized with a
minimum amount of down-time. They operate by using high availability
software to harness redundant computers in groups or clusters that
provide continued service when system components fail. Without
clustering, if a server running a particular application crashes, the
application will be unavailable until the crashed server is fixed.
So the takeaway from the description above is that HA clusters will provide you with the minimum amount of down-time during a failover. Let me explain the two types of failover that HA clusters can provide you:
Hot-Hot / Active-Active: The redundant computers are truly operating in parallel, producing the exact same state, and the exact same output. They are all active nodes, operating as a perfect mirror of each other. In this scenario, your failover down-time is zero, and you can simply pull the power plug from any machine in the cluster without any downtime or disruption to your service.
Hot-Warn / Active-Passive: Only one primary computer is the active one, while the other computers in the cluster are passively rebuilding the same state as the primary. When the primary computer fails, it has to be disabled or killed (automatically or by an operator) and then a passive computer from the cluster needs to be made active (automatically or by an operator).
So what is the catch? The catch is that applications that can operate in a HA cluster are not trivial to design as they need to be true deterministic finite-state machines. A classic problem is when your application needs to use the clock to build state based on time, as clocks are very non-deterministic by nature.
Disclaimer: I am one of the developers of CoralSequencer.