Mesos: what are the OS level techniques for resources allocation? - mesos

I understand Mesos architecture at a high level, but I'm not clear about the OS level techniques used to implement resources allocation. For example, Mesos offers a framework 1 CPU and 400MB memory, and another framework 2 CPUs and 1GB memory, how is this actually implemented at OS level?

tl;dr: Mesos itself doesn't "allocate" any resources at the OS-level. The resources are still allocated by the OS, although Mesos can use OS-level primitives like cgroups to ensure that a task doesn't use more resources than it should.
The Mesos agent at the node advertises that some resources are available at the host (e.g., 4 CPUs and 16GB of RAM) -- either by auto-detecting what is available at the host or because the available resources have been explicitly configured (recommended for production).
The master then offers those resources to a framework.
The framework can then launch a task, using some or all of the resources available at the agent: e.g., the framework might launch a task with 2 CPUs and 8GB of RAM.
The agent then launches an executor to run the task.
How strictly the "2 CPUs and 8GB of RAM" resource limit is enforced depends on how Mesos is configured. For example, if the agent host supports cgroups and the agent is started with --isolation='cgroups/cpu,cgroups/mem', cgroups will be used to throttle the CPU appropriately, and to kill the task if it tries to exceed its memory allocation.

Related

Does Mesos really treat all your resources as a single pool?

Mesos is advertised as a system that lets you program against your datacenter like it's a single pool of resources (See the Mesos Website). But is this really true that you don't need to consider the configuration of the individual machines? Using Mesos, can you request more resources for a task than are available on a single machine?
For example, if you have 10 machines each with 2 cores and 2g of RAM and 20g HD, can you really request 10 cores, 15g of RAM and 100g of disk space for a single task?
If so, how does this work? Is Mesos able to address memory across machines for you, and use other CPUs as local threads and create a single filesystem from a number of distributed nodes?
How does it accomplish this without suffering from the Fallacies of distributed computing, especially those related to network latency and transport cost?
According to this Mesos architecture you can't aggregate resources from different slaves (agents / machines) to use them for one task.
As you can see there is strict "taks per agent" situation
Also their example says pretty much same
Let’s walk through the events in the figure.
Agent 1 reports to the master that it has 4 CPUs and 4 GB of memory
free. The master then invokes the allocation policy module, which
tells it that framework 1 should be offered all available resources.
The master sends a resource offer describing what is available on
agent 1 to framework 1. The framework’s scheduler replies to the
master with information about two tasks to run on the agent, using <2
CPUs, 1 GB RAM> for the first task, and <1 CPUs, 2 GB RAM> for the
second task. Finally, the master sends the tasks to the agent, which
allocates appropriate resources to the framework’s executor, which in
turn launches the two tasks (depicted with dotted-line borders in the
figure). Because 1 CPU and 1 GB of RAM are still unallocated, the
allocation module may now offer them to framework 2.

Could a Resource Manager and Node Manager in the same node? [MapR]

We have a node where Resource Manager and Node Manager are both running. Is that recommmended?
The Resource Manager and the Node Manager can run on the same node provided it has enough memory. If the memory you've allocated to both services plus the memory you've allocated to the rest of the services and daemons running on that node exceed the host's memory you could trigger OOMEs and have applications killed by the Linux OOM killer.
For most distributions it's not a recommended practice. Running the RM and NM on separate nodes provide better memory and CPU isolation. The NM in particular allocates work to containers and these require memory so if you've oversubscribed your nodes by using up all the memory to run services, you won't have any memory left to run your containers.
http://www.oracle.com/technetwork/articles/servers-storage-dev/oom-killer-1911807.html

How do I find a marathon runaway process

I have a mesos / marathon system, and it is working well for the most part. There are upwards of 20 processes running, most of them using only part of a CPU. However, sometimes (especially during development), a process will spin up and start using as much CPU as is available. I can see on my system monitor that there is a pegged CPU, but I can't tell what marathon process is causing it.
Is there a monitor app showing CPU usage for marathon jobs? Something that shows it over time. This would also help with understanding scaling and CPU requirements. Tracking memory usage would be good, but secondary to CPU.
It seems that you haven't configured any isolation mechanism on your agent (slave) nodes. mesos-slave comes with an --isolation flag that defaults to posix/cpu,posix/mem. Which means isolation at process level (pretty much no isolation at all). Using cgroups/cpu,cgroups/mem isolation will ensure that given task will be killed by kernel if exceeds given memory limit. Memory is a hard constraint that can be easily enforced.
Restricting CPU is more complicated. If you have machine that offers 8 CPU cores to Mesos and each of your tasks is set to require cpu=2.0, you'll be able run there at most 4 tasks. That's easy, but at given moment any of your 4 tasks might be able to utilize all idle cores. In case some of your jobs is misbehaving, it might affect other jobs running on the same machine. For restricting CPU utilization see Completely Fair Scheduler (or related question How to understand CPU allocation in Mesos? for more details).
Regarding monitoring there are many possibilities available, choose an option that suits your requirements. You can combine many of the solutions, some are open-source other enterprise level solutions (in random order):
collectd for gathering stats, Graphite for storing, Grafana for visualization
Telegraf for gathering stats, InfluxDB for storing, Grafana for visualization
Prometheus for storing and gathering data, Grafana for visualization
Datadog for a cloud based monitoring solution
Sysdig platform for monitoring and deep insights

Mesos Cgroup resource usage

I am building an autoscaling system using mesos and marathon. The scenario is, I want to autoscale a task, if the amount of memory allocated to it is consumed more than 80%. How do I find out the amount of memory used at a cgroup level?
Is this the right approach?
You can get the stastics for each task by hitting http://host:5051/monitor/statistics.json where host is mesos slave.
This repo will give you an idea how to autoscale marathon applications.

CPU affinity in virtualised environments

Is taskset for CPU affinity applicable when trying to use L2 cache efficiently on a multi core processor in a virtualised environment like Amazon EC2?
No, especially towards the smaller instances, there's heavy CPU sharing, you're dependent on what the other instances are doing with the CPU.

Resources