I have been using the /reserve mesos http endpoint to reserve resources for specific roles. However, this will only allow me to reserve unused resources. What I would actually like to do is kill some of the tasks on the mesos agent to make room. Is there a way to tell mesos to kill these tasks to free up the resources?
This is a bit of a chicken-and-egg problem: if you kill before you reserve, freed resources may be allocated before you reserve them; while if you reserve before kill, there may be not enough resources.
I would suggest you to look at Mesos quotas. They work slightly differently from reservations: resources are reserved in the cluster and not on specific agents; operation does not fail if there are currently insufficient resources. Once you set quota for a role, all free resources up to the quota will be reserved for your role. If there are currently not enough resources, Mesos will not kill tasks, but as tasks eventually terminate, freed resources will be given to your role.
In the future, we plan to implement revocation, as well as let operators hint Mesos which tasks should be terminated first.
Related
I have a mesos / marathon system, and it is working well for the most part. There are upwards of 20 processes running, most of them using only part of a CPU. However, sometimes (especially during development), a process will spin up and start using as much CPU as is available. I can see on my system monitor that there is a pegged CPU, but I can't tell what marathon process is causing it.
Is there a monitor app showing CPU usage for marathon jobs? Something that shows it over time. This would also help with understanding scaling and CPU requirements. Tracking memory usage would be good, but secondary to CPU.
It seems that you haven't configured any isolation mechanism on your agent (slave) nodes. mesos-slave comes with an --isolation flag that defaults to posix/cpu,posix/mem. Which means isolation at process level (pretty much no isolation at all). Using cgroups/cpu,cgroups/mem isolation will ensure that given task will be killed by kernel if exceeds given memory limit. Memory is a hard constraint that can be easily enforced.
Restricting CPU is more complicated. If you have machine that offers 8 CPU cores to Mesos and each of your tasks is set to require cpu=2.0, you'll be able run there at most 4 tasks. That's easy, but at given moment any of your 4 tasks might be able to utilize all idle cores. In case some of your jobs is misbehaving, it might affect other jobs running on the same machine. For restricting CPU utilization see Completely Fair Scheduler (or related question How to understand CPU allocation in Mesos? for more details).
Regarding monitoring there are many possibilities available, choose an option that suits your requirements. You can combine many of the solutions, some are open-source other enterprise level solutions (in random order):
collectd for gathering stats, Graphite for storing, Grafana for visualization
Telegraf for gathering stats, InfluxDB for storing, Grafana for visualization
Prometheus for storing and gathering data, Grafana for visualization
Datadog for a cloud based monitoring solution
Sysdig platform for monitoring and deep insights
I just noticed the fact that many Pig jobs on Hadoop are killed due to the following reason: Container preempted by scheduler
Could someone explain me what causes this, and if I should (and am able to) do something about this?
Thanks!
If you have the fair scheduler and a number of different queue's enabled, then higher priority applications can terminate your jobs (in a preemptive fashion).
Hortonworks have a pretty good explanation with more details
https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.2/bk_yarn_resource_mgt/content/preemption.html
Should you do anything about it? Depends if your application is within its SLA's and performing within expectations. General good practice would be to review your job priority and the queue it's assigned to.
If your Hadoop cluster is being used by many business units. then Admins decides queue for them and every queue has its priorities( that too is decided by Admins). If Preemption is enabled at scheduler level,then higher-priority applications do not have to wait because lower priority applications have taken up the available capacity. So in this case lower propriety task must have to release resources, if not available at cluster to let run higher-priority applications.
I am building an autoscaling system using mesos and marathon. The scenario is, I want to autoscale a task, if the amount of memory allocated to it is consumed more than 80%. How do I find out the amount of memory used at a cgroup level?
Is this the right approach?
You can get the stastics for each task by hitting http://host:5051/monitor/statistics.json where host is mesos slave.
This repo will give you an idea how to autoscale marathon applications.
What if I run a Mesos cluster, with both development and say, mission critical applications. Is it possible to have "privileged" task to be executed in the cluster for these type of cases, and even have nodes shut down lesser privileged services to make sure the privileged service gets processing power?
Currently, there is no notion of privileged tasks in Mesos (0.24.1 at the time of writing). Preemption is likely an upcoming feature to be introduced to support other features such as Quota and Optimistic Offers. However, there are reserved resources in which critical tasks can run on.
Resources can be reserved for a role, and frameworks are registered under a certain role. For example, if a framework F registers under role R, F receives resources with role * (i.e. unreserved) as well as resources with role R (i.e. reserved for R).
The privileged tasks then would be launched on these reserved resources. Since reserved resources are only offered to the frameworks in the role, the resources will be available for the relaunch of the critical task even if the critical task were to crash.
NOTE: Since many frameworks can register under R, you can assign R uniquely to F to grant it sole ownership of the resources (Refer to register_frameworks under Authorization).
Refer to Reservation documentation for further information
I've been playing with Mesos cluster for a little bit, and thinking of utilizing Mesos cluster in our production environment. One problem I can't seem to find an answer to: how to properly schedule long running apps that will have varying load?
Marathon has "CPUs" property, where you can set weight for CPU allocation to particular app. (I'm planning on running Docker containers) But from what I've read, it is only a weight, not a reservation, allocation, or limitation that I am setting for the app. It can still use 100% of CPU on the server, if it's the only thing that's running. The problem is that for long running apps, resource demands change over time. Web server, for example, is directly proportional to the traffic. Coupled to Mesos treating this setting as a "reservation," I am choosing between 2 evils: set it too low, and it may start too many processes on the same host and all of them will suffer, with host CPU going past 100%. Set it too high, and CPU will go idle, as reservation is made (or so Mesos think), but there is nothing that's using those resources.
How do you approach this problem? Am I missing something in how Mesos and Marathon handle resources?
I was thinking of an ideal way of doing this:
Specify weight for CPU for different apps (on the order of, say, 0.1 through 1), so that when going gets tough, higher priority gets more (as is right now)
Have Mesos slave report "Available LA" with its status (e.g. if 10 minute LA is 2, with 8 CPUs available, report 6 "Available LA")
Configure Marathon to require "Available LA" resource on the slave to schedule a task (e.g. don't start on particular host if Available LA is < 2)
When available LA goes to 0 (due to influx of traffic at the same time as some job was started on the same server before the influx) - have Marathon move jobs to another slave, one that has more "Available LA"
Is there a way to achieve any of this?
So far, I gather that I can possible write a custom isolator module that will run on slaves, and report this custom metric to the master. Then I can use it in resource negotiation. Is this true?
I wasn't able to find anything on Marathon rescheduling tasks on different nodes if one becomes overloaded. Any suggestions?
As of Mesos 0.23.0 oversubscription is supported. Unfortunately it is not yet implemented in Marathon: https://github.com/mesosphere/marathon/issues/2424
In order to dynamically do allocation, you can use the Mesos slave metrics along with the Marathon HTTP API to scale, for example, as I've done here, in a different context. My colleague Niklas did related work with nibbler, which might also be of help.