Same context path services with different ports on Marathon-lb DCOS - elasticsearch

I have deployed elasticsearch and kibana with below application definitions
elasticsearch.json
{
"id": "elasticsearch",
"container": {
"type": "DOCKER",
"docker": {
"image": "docker.elastic.co/elasticsearch/elasticsearch:6.3.2",
"network": "BRIDGE",
"portMappings": [
{ "hostPort": 9200, "containerPort": 9200, "servicePort": 0 },
{ "hostPort": 9300, "containerPort": 9300, "servicePort": 0 }
],
"forcePullImage":true
}
},
"instances": 1,
"cpus": 1,
"mem": 3048,
"labels":{
"HAPROXY_GROUP":"external",
"HAPROXY_0_VHOST":"publichost",
"HAPROXY_0_MODE":"http",
"DCOS_PACKAGE_NAME": "elasticsearch"
},
"env": {
"ES_JAVA_OPTS": "-Xmx2048m -Xms2048m"
}
}
Which deploys elasticsearch on "/" context path
kibana.json
{
"id": "kibana",
"container": {
"type": "DOCKER",
"docker": {
"image": "docker.elastic.co/kibana/kibana:6.3.2",
"network": "BRIDGE",
"portMappings": [
{ "hostPort": 5601, "containerPort": 5601, "servicePort":0}
],
"forcePullImage":true
},
"volumes": [
{
"containerPath": "/usr/share/kibana/config",
"hostPath": "/home/azureuser/kibana/config",
"mode": "RW"
}
]
},
"instances": 1,
"cpus": 0.5,
"mem": 2000,
"labels":{
"HAPROXY_0_VHOST":"publichost",
"HAPROXY_0_MODE":"http",
"DCOS_SERVICE_NAME": "kibana",
"DCOS_SERVICE_SCHEME": "http",
"DCOS_SERVICE_PORT_INDEX": "0"
}
}
this also eploys kibana on "/" context path
Then how to access kibana
when I try to access
http://publichost/app/kibana doesn't work beacuse elasticsearch is on "/"

I did it by removing "HAPROXY_GROUP":"external" from elasticsearch, Now it will not deploy it on marathon-lb and hence wont be accessible via browser.

Related

jq obtain all values of a field only if inside string array the value V is found

I need to get all Distinct "Node" from an array of data objects only if in the object the "ServiceTags" has the String element "orchestrator".
For example, for the json:
[
{
"Node": "abc",
"ServiceTags": [],
"Type": "",
"Definition": {
"Interval": "0s",
"Timeout": "0s"
},
"CreateIndex": 11241543,
"ModifyIndex": 11241543
},
{
"Node": "xyz",
"ServiceTags": [
"rules",
"rhdm_es",
"rhdm",
"orchestrator"
],
"Type": "http",
"Definition": {
"Interval": "0s",
"Timeout": "0s"
},
"CreateIndex": 12907642,
"ModifyIndex": 12907659
},
{
"Node": "teb",
"ServiceTags": [
"rules",
"orchestrator"
],
"Type": "http",
"Definition": {
"Interval": "0s",
"Timeout": "0s"
},
"CreateIndex": 12907642,
"ModifyIndex": 12907659
},
{
"Node": "iry",
"ServiceTags": [
"rules"
],
"Type": "http",
"Definition": {
"Interval": "0s",
"Timeout": "0s"
},
"CreateIndex": 12907642,
"ModifyIndex": 12907659
}
]
the expected result would be an array that contains the values "xyz" and "teb" because there is a "orchestrator" as part of "ServiceTags" property.
I would appreciate any help, I'm currently making this in a basic shell script that only prints all the ServiceTags:
# Get consul data in json format
consulResult=$(cat -)
# Validate if the json is not null or empty
if [ -z "$consulResult" ];
then
echo "NULL OR EMPTY";
else
echo "Not NULL";
echo $consulResult | jq '.[].ServiceTags'
fi
map(select(.ServiceTags | index("orchestrator")).Node)
Will filter based on if orchestrator exist in the ServiceTags array, then output .Node for those objects
Output:
[
"xyz",
"teb"
]
Try it online!

not able to parse the ansible output

I am not able to parse ansible folwing output. trying to get the ip address.
I am trying to get ip address of azure vm, to get the ip. I am using the azure_rm_networkinterface_facts module, in this module I am feeding the nic name, which stored in used the diet.
Here is my output, which i want to parse.
ok: [
localhost
]=>(item={
'value': [
u'datamover-nic10'
],
'key': u'data-mover'
})=>{
"ansible_facts": {
"azure_networkinterfaces": [
{
"etag": "W/\"08842209-be15-1144f26\"",
"id": "/subscriptions/1cf78a5c-5a30--c52c2d3d21b6/resourceGroups/DEVT/providers/Microsoft.Network/networkInterfaces/datamover-nic10",
"location": "westus",
"name": "datamover-nic10",
"properties": {
"dnsSettings": {
"appliedDnsServers": [
],
"dnsServers": [
],
"internalDomainNameSuffix": "3endvnfzb.dx.internal.cloudapp.net"
},
"enableAcceleratedNetworking": false,
"enableIPForwarding": false,
"ipConfigurations": [
{
"etag": "W/\"088421144f26\"",
"id": "/subscriptions/1cf78a52c2d3d21b6/resourceGroups/DEVT/providers/Microsoft.Network/networkInterfaces/datamover-nic10/ipConfigurations/ip1",
"name": "ip1",
"properties": {
"primary": true,
"privateIPAddress": "10.172.240.11",
"privateIPAddressVersion": "IPv4",
"privateIPAllocationMethod": "Static",
"provisioningState": "Succeeded",
"subnet": {
"id": "/subscriptions/1cf78a5c-5ac2d3d21b6/resourceGroups/DEVT/providers/Microsoft.Network/virtualNetworks/vNetOne/subnets/vmsubnet"
}
}
}
],
"macAddress": "00-0D-3A-36-B3-5C",
"networkSecurityGroup": {
"id": "/subscriptions/1cf78ad3d21b6/resourceGroups/DEVT/providers/Microsoft.Network/networkSecurityGroups/datamover-nsg"
},
"primary": true,
"provisioningState": "Succeeded",
"resourceGuid": "03114",
"virtualMachine": {
"id": "/subscriptions/1cf7d3d21b6/resourceGroups/DEVT/providers/Microsoft.Compute/virtualMachines/datamover"
}
},
"tags": {
"component": "datamover",
"provider": "B50E5F"
},
"type": "Microsoft.Network/networkInterfaces"
}
]
},
"changed": false,
"item": {
"key": "data-mover",
"value": [
"datamover-nic10"
]
}
}ok: [
localhost
]=>(item={
'value': [
u'database-nic00'
],
'key': u'database'
})=>{
"ansible_facts": {
"azure_networkinterfaces": [
{
"etag": "W/\"60bfd8c17323612\"",
"id": "/subscriptions/1cf72d3d21b6/resourceGroups/DEVT/providers/Microsoft.Network/networkInterfaces/database-nic00",
"location": "westus",
"name": "database-nic00",
"properties": {
"dnsSettings": {
"appliedDnsServers": [
],
"dnsServers": [
],
"internalDomainNameSuffix": "3wjfzb.dx.internal.cloudapp.net"
},
"enableAcceleratedNetworking": false,
"enableIPForwarding": false,
"ipConfigurations": [
{
"etag": "W/\"603612\"",
"id": "/subscriptions/1c2d3d21b6/resourceGroups/DEVT/providers/Microsoft.Network/networkInterfaces/database-nic00/ipConfigurations/ip1",
"name": "ip1",
"properties": {
"primary": true,
"privateIPAddress": "10.172.240.4",
"privateIPAddressVersion": "IPv4",
"privateIPAllocationMethod": "Static",
"provisioningState": "Succeeded",
"subnet": {
"id": "/subscriptions/1c3d21b6/resourceGroups/DEVT/providers/Microsoft.Network/virtualNetworks/vNetOne/subnets/vmsubnet"
}
}
},
{
"etag": "W/\"60b3612\"",
"id": "/subscriptions/1cfd3d21b6/resourceGroups/DEVT/providers/Microsoft.Network/networkInterfaces/database-nic00/ipConfigurations/ip2",
"name": "ip2",
"properties": {
"primary": false,
"privateIPAddress": "10.172.0.6",
"privateIPAddressVersion": "IPv4",
"privateIPAllocationMethod": "Static",
"provisioningState": "Succeeded",
"subnet": {
"id": "/subscriptions/1d3d21b6/resourceGroups/DEVT/providers/Microsoft.Network/virtualNetworks/vNetOne/subnets/vmsubnet"
}
}
}
],
"macAddress": "00-0D-3A-36-BC-FB",
"networkSecurityGroup": {
"id": "/subscriptions/1cf52c2d3d21b6/resourceGroups/ImcSite-UPAASDEVT/providers/Microsoft.Network/networkSecurityGroups/database-nsg"
},
"primary": true,
"provisioningState": "Succeeded",
"resourceGuid": "4d2fd4441e3c",
"virtualMachine": {
"id": "/subscriptions/1cf7d3d21b6/resourceGroups/DEVT/providers/Microsoft.Compute/virtualMachines/database-vm0"
}
},
"tags": {
"component": "database",
"provider": "B52B9A0E5F"
},
"type": "Microsoft.Network/networkInterfaces"
}
]
},
"changed": false,
"item": {
"key": "database",
"value": [
"database-nic00"
]
}
}
I am trying to parse and get 10.172.240.11 and 10.172.240.4 address from the output using the following method. Can you please someone help me on this?
- debug: msg=" {{ item.value[0] }}"
with_dict:
- "{{ vm_net_intf }}"
I expect to print 10.172.240.11 and 10.172.240.4 ip address in debug
With the last part of the ansible_facts the play below below
tasks:
- debug:
msg: "{{ item.properties.ipConfigurations|
json_query('[].properties.privateIPAddress') }}"
loop: "{{ ansible_facts.azure_networkinterfaces }}"
gives (abridged):
"msg": [
"10.172.240.4",
"10.172.0.6"
]

Cloudformation AWS: Connect RDS to subnets

I am trying to build a cloudformation template but I have some trouble with how to connect my Oracle RDS instance with my two subnets.
My parameters are :
"3DCFDB": {
"Type": "AWS::RDS::DBInstance",
"Properties": {
"DBInstanceClass": "db.t2.micro",
"AllocatedStorage": "20",
"Engine": "oracle-se2",
"EngineVersion": "12.1.0.2.v13",
"MasterUsername": {
"Ref": "user"
},
"MasterUserPassword": {
"Ref": "password"
}
},
"Metadata": {
"AWS::CloudFormation::Designer": {
"id": "*"
}
},
"DependsOn": [
"3DEXPSUBPU",
"3DSUBPRI"
]
}
What parameter am I supposed to add to connect my RDS to 2 subnets?
If I understand correctly, you need to create a resource with Type "Type": AWS::RDS::DBSubnetGroup, then inside your "Type": "AWS::RDS::DBInstance" you can refer to the subnet group with something similar to this
"3DCFDB": {
"Type": "AWS::RDS::DBInstance",
"Properties": {
"DBInstanceClass": "db.t2.micro",
"AllocatedStorage": "20",
"Engine": "oracle-se2",
"EngineVersion": "12.1.0.2.v13",
"DBSubnetGroupName": {
"Ref": "DBsubnetGroup"
}
"MasterUsername": {
"Ref": "user"
},
"MasterUserPassword": {
"Ref": "password"
}
},
"Metadata": {
"AWS::CloudFormation::Designer": {
"id": "*"
}
},
"DependsOn": [
"3DEXPSUBPU",
"3DSUBPRI"
]
},
"DBsubnetGroup": {
"Type" : "AWS::RDS::DBSubnetGroup",
...
...
}
More info can be found here
https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-resource-rds-dbsubnet-group.html

AWS Cloudformation and Autoscaling keeps placing instances in same AZ

Hi all I have the following CF that create a RMQ cluster using RMQ autoclustering. It works. However every time I run this all the instances end up in the same AZ! I've verified that the stack variables are correct. The subnets are set correctly as well. It's all create in the correct account. Not sure what else to try. Im wondering if something is incorrect in the VPC that is supplied to me?
{
"AWSTemplateFormatVersion": "2010-09-09",
"Parameters": {
"EnvironmentValue": {
"AllowedValues": [
"Dev",
"Test"
],
"Default": "Dev",
"Description": "What environment is this?",
"Type": "String"
},
"RabbitMQErlangCookie": {
"Description": "The erlang cookie to propagate to all nodes in the cluster",
"Type": "String",
"MinLength": "20",
"MaxLength": "20",
"Default": "TGFBTKPLRTOYFHNVSTWN",
"AllowedPattern": "^[A-Z]*$",
"NoEcho": true
},
"RabbitMQAdminUserID": {
"Description": "The admin user name to create on the RabbitMQ cluster",
"Type": "String",
"MinLength": "5",
"MaxLength": "20",
"Default": "admin",
"AllowedPattern": "[a-zA-Z0-9]*",
"NoEcho": true
},
"RabbitMQAdminPassword": {
"Description": "The admin password for the admin account",
"Type": "String",
"MinLength": "5",
"MaxLength": "20",
"Default": "xxxxxx",
"AllowedPattern": "[a-zA-Z0-9!]*",
"NoEcho": true
},
"InstanceAvailabilityZones" : {
"Description" : "A list of avilability zones in which instances will be launched. ",
"Type" : "CommaDelimitedList",
"Default" : "us-east-1e,us-east-1d"
},
"Environment": {
"Description": "The environment to confgiure (dev, test, stage, prod",
"Type": "String",
"AllowedValues": [
"d",
"t"
],
"Default": "d",
"NoEcho": false
}
},
"Mappings": {
"Environments" : {
"Dev": {
"VPCProtectedApp":"vpc-protected-app",
"VPCProtectedDb":"vpc-protected-db",
"VPCProtectedFe":"vpc-protected-fe",
"ELB": "App-Dev",
"SecurityGroup": "sg-soa-db",
"Identifier": "d",
"Prefix": "Dev",
"RMQELB": "elb-soa-db-rmq-dev",
"RMQELBTargetGroup": "elb-soarmq-target-group-dev",
"RMQSubnets": "subnet-soa-db-1,subnet-soa-db-2",
"RMQSecurityGroup":"sg-soa-db",
"RMQClusterMin": "3",
"RMQClusterMax": "3",
"ConsulELB": "elb-soa-db-cons-dev",
"ConsulSubnets": "subnet-soa-db-1,subnet-soa-db-2",
"ConsulSecurityGroup":"sg-soa-db-cons",
"ConsulClusterMin": "3",
"ConsulClusterMax": "3"
},
"Test": {
"VPCProtectedApp":"vpc-protected-app",
"VPCProtectedDb":"vpc-protected-db",
"VPCProtectedFe":"vpc-protected-fe",
"ELB": "App-Dev",
"SecurityGroup": "sg-soa-db",
"Identifier": "t",
"Prefix": "Test",
"RMQELB": "elb-soa-db-rmq-test",
"RMQELBTargetGroup": "elb-soarmq-target-group-test",
"RMQSubnets": "subnet-soa-db-1,subnet-soa-db-2",
"RMQSecurityGroup":"sg-soa-db",
"RMQClusterMin": "3",
"RMQClusterMax": "3",
"ConsulELB": "elb-soa-db-cons-test",
"ConsulSubnets": "subnet-soa-db-1,subnet-soa-db-2",
"ConsulSecurityGroup":"sg-soa-db-cons",
"ConsulClusterMin": "3",
"ConsulClusterMax": "3"
}
}
},
"Resources": {
"RabbitMQRole": {
"Type": "AWS::IAM::Role",
"Properties": {
"AssumeRolePolicyDocument": {
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"Service": [
"ec2.amazonaws.com"
]
},
"Action": [
"sts:AssumeRole"
]
}
]
},
"Path": "/",
"Policies": [
{
"PolicyName": "root",
"PolicyDocument": {
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"autoscaling:DescribeAutoScalingInstances",
"ec2:DescribeInstances"
],
"Resource": [
"*"
]
},
{
"Effect": "Allow",
"Action": [
"ecs:CreateCluster",
"ecs:DeregisterContainerInstance",
"ecs:DiscoverPollEndpoint",
"ecs:Poll",
"ecs:RegisterContainerInstance",
"ecs:Submit*"
],
"Resource": "*"
},
{
"Effect": "Allow",
"Action": "ec2:DescribeInstances",
"Resource": "*"
},
{
"Effect": "Allow",
"Action": [
"logs:*"
],
"Resource": [
"arn:aws:logs:*:*:*"
]
}
]
}
}
]
}
},
"RabbitMQInstanceProfile": {
"Type": "AWS::IAM::InstanceProfile",
"Properties": {
"Path": "/",
"Roles": [
{
"Ref": "RabbitMQRole"
}
]
}
},
"ELBSOARabbitMQ": {
"Type": "AWS::ElasticLoadBalancingV2::LoadBalancer",
"Properties": {
"Name": {"Fn::FindInMap" : ["Environments", {"Ref" : "EnvironmentValue" },"RMQELB"]},
"Scheme": "internet-facing",
"Subnets": [
{
"Fn::ImportValue" : "subnet-soa-db-1"
},
{
"Fn::ImportValue" : "subnet-soa-db-2"
}
],
"SecurityGroups": [
{
"Fn::ImportValue" : "sg-soa-db"
}
]
}
},
"ELBSOARMQListener": {
"Type": "AWS::ElasticLoadBalancingV2::Listener",
"Properties": {
"DefaultActions": [
{
"TargetGroupArn": {
"Ref": "ELBSOARMQTargetGroup"
},
"Type": "forward"
}
],
"LoadBalancerArn": {
"Ref": "ELBSOARabbitMQ"
},
"Port": 80,
"Protocol": "HTTP"
}
},
"ELBSOARMQListenerRule": {
"Type": "AWS::ElasticLoadBalancingV2::ListenerRule",
"Properties": {
"Actions": [
{
"TargetGroupArn": {
"Ref": "ELBSOARMQTargetGroup"
},
"Type": "forward"
}
],
"Conditions": [
{
"Field": "path-pattern",
"Values": [
"/"
]
}
],
"ListenerArn": {
"Ref": "ELBSOARMQListener"
},
"Priority": 1
}
},
"ELBSOARMQTargetGroup": {
"Type": "AWS::ElasticLoadBalancingV2::TargetGroup",
"Properties": {
"TargetType": "instance",
"HealthCheckIntervalSeconds": 30,
"HealthCheckPort": 15672,
"HealthCheckProtocol": "HTTP",
"HealthCheckTimeoutSeconds": 3,
"HealthyThresholdCount": 2,
"Name":{"Fn::FindInMap" : ["Environments", {"Ref" : "EnvironmentValue" },"RMQELBTargetGroup"]},
"Port": 15672,
"Protocol": "HTTP",
"UnhealthyThresholdCount": 2,
"VpcId": {
"Fn::ImportValue" : "vpc-protected-db"
}
}
},
"SOARMQServerGroup": {
"Type": "AWS::AutoScaling::AutoScalingGroup",
"DependsOn": "ELBSOARabbitMQ",
"Properties": {
"LaunchConfigurationName": {
"Ref": "SOARMQEc2InstanceLC"
},
"MinSize": "3",
"MaxSize": "5",
"TargetGroupARNs": [
{
"Ref": "ELBSOARMQTargetGroup"
}
],
"Tags": [
{
"ResourceType": "auto-scaling-group",
"ResourceId": "my-asg",
"InstanceName": "rabbitmq",
"PropagateAtLaunch": true,
"Value": "test",
"Key": "environment"
},
{
"ResourceType": "auto-scaling-group",
"ResourceId": "my-asg",
"InstanceName": "rabbitmq",
"PropagateAtLaunch": true,
"Value": "vavd-soa-rmq",
"Key": "Name"
}
],
"AvailabilityZones" : { "Ref" : "InstanceAvailabilityZones" },
"VPCZoneIdentifier": [
{
"Fn::ImportValue": "subnet-soa-db-1"
},
{
"Fn::ImportValue": "subnet-soa-db-2"
}
]
}
},
"SOARMQEc2InstanceLC": {
"Type": "AWS::AutoScaling::LaunchConfiguration",
"DependsOn": "ELBSOARabbitMQ",
"Properties": {
"IamInstanceProfile" : { "Ref" : "RabbitMQInstanceProfile" },
"ImageId": "ami-5e414e24",
"InstanceType": "m1.small",
"KeyName": "soa_dev_us_east_1",
"SecurityGroups": [
{
"Fn::ImportValue" : "sg-soa-db"
}
],
"UserData": {
"Fn::Base64": {
"Fn::Join": [
"",
[
"#!/bin/bash -xe\n",
"sudo su\n",
"exec > >(tee /var/log/user-data.log|logger -t user-data -s 2>/dev/console) 2>&1\n",
"echo \"1. Installing yum updates\"\n",
"sudo yum update -y\n",
"sudo yum install wget -y\n",
"sudo yum install socat -y\n",
"yum install -y aws-cfn-bootstrap\n",
"echo \"2. Downloading erlang distro and install\"\n",
"wget https://github.com/rabbitmq/erlang-rpm/releases/download/v20.3.0/erlang-20.3-1.el6.x86_64.rpm\n",
"sudo rpm -ivh erlang-20.3-1.el6.x86_64.rpm\n",
"export EC2_PUBLIC_IP=$(curl http://169.254.169.254/latest/meta-data/public-ipv4)\n",
"echo \"3. Downloading rabbitmq distro and installing\"\n",
"wget http://dl.bintray.com/rabbitmq/all/rabbitmq-server/3.7.4/rabbitmq-server-3.7.4-1.el6.noarch.rpm\n",
"sudo rpm -Uvh rabbitmq-server-3.7.4-1.el6.noarch.rpm\n",
"export RABBITMQ_USE_LONGNAME=true\n",
"echo \"4. Setting the erlang cookie for clustering\"\n",
"sudo sh -c \"echo ''",
{
"Ref": "RabbitMQErlangCookie"
},
"'' > /var/lib/rabbitmq/.erlang.cookie\"\n",
"sudo chown rabbitmq:rabbitmq /var/lib/rabbitmq/.erlang.cookie\n",
"sudo chmod 600 /var/lib/rabbitmq/.erlang.cookie\n",
"echo \"5. Writing the rabbitmq configurations for AWS Autocluster Group peer discovery\"\n",
"sudo cat << EOF > /etc/rabbitmq/rabbitmq.conf\n",
"cluster_formation.peer_discovery_backend = rabbit_peer_discovery_aws\n",
"cluster_formation.aws.region = us-east-1\n",
"cluster_formation.aws.use_autoscaling_group = true\n",
"log.console.level = debug\n",
"log.file.level = debug\n",
"EOF\n",
"echo \"6. Enable the management and peer discovery plugins\"\n",
"sudo rabbitmq-plugins enable rabbitmq_management\n",
"sudo rabbitmq-plugins --offline enable rabbitmq_peer_discovery_aws\n",
"echo \"7. Restart the service - stop the app prior to clustering\"\n",
"sudo service rabbitmq-server restart\n",
"sudo rabbitmqctl stop_app\n",
"sudo rabbitmqctl reset\n",
"echo \"8. Starting the application\"\n",
"sudo rabbitmqctl start_app\n",
"echo \"9. Adding admin user and setting permissions\"\n",
"sudo rabbitmqctl add_user ",
{
"Ref": "RabbitMQAdminUserID"
},
" ",
{
"Ref": "RabbitMQAdminPassword"
},
"\n",
"sudo rabbitmqctl set_user_tags ",
{
"Ref": "RabbitMQAdminUserID"
},
" administrator\n",
"sudo rabbitmqctl set_permissions -p / ",
{
"Ref": "RabbitMQAdminUserID"
},
" \".*\" \".*\" \".*\" \n",
"echo \"10. Configuration complete!\"\n"
]
]
}
}
}
}
}
}

Why is Mesos framework not being offered resources?

I am using Mesos 1.0.1. I have added an agent with a new role docker_gpu_worker. I register a framework with this role. The framework does not receive any offers. Other frameworks (same Java code) using other roles are working fine. I have not restarted the three Mesos masters. Does anyone have an idea about what might be going wrong?
At master/frameworks, I see my framework:
"{
"id": "fd01b1b0-eb73-4d40-8774-009171ae1db1-0701",
"name": "/data4/Users/mikeb/jobs/999",
"pid": "scheduler-77345362-b85c-4044-8db5-0106b9015119#x.x.x.x:57617",
"used_resources": {
"disk": 0,
"mem": 0,
"gpus": 0,
"cpus": 0
},
"offered_resources": {
"disk": 0,
"mem": 0,
"gpus": 0,
"cpus": 0
},
"capabilities": [],
"hostname": "x-x-x-x.ec2.internal",
"webui_url": "",
"active": true,
"user": "mikeb",
"failover_timeout": 10080,
"checkpoint": true,
"role": "docker_gpu_worker",
"registered_time": 1507028279.18887,
"unregistered_time": 0,
"principal": "test-framework-java",
"resources": {
"disk": 0,
"mem": 0,
"gpus": 0,
"cpus": 0
},
"tasks": [],
"completed_tasks": [],
"offers": [],
"executors": []
}"
At master/roles I see my role:
"{
"frameworks": [
"fd01b1b0-eb73-4d40-8774-009171ae1db1-0701",
"fd01b1b0-eb73-4d40-8774-009171ae1db1-0673",
"fd01b1b0-eb73-4d40-8774-009171ae1db1-0335"
],
"name": "docker_gpu_worker",
"resources": {
"cpus": 0,
"disk": 0,
"gpus": 0,
"mem": 0
},
"weight": 1
}"
At master/slaves I see my agent:
"{
"id": "fd01b1b0-eb73-4d40-8774-009171ae1db1-S5454",
"pid": "slave(1)#x.x.x.x:5051",
"hostname": "x-x-x-x.ec2.internal",
"registered_time": 1506692213.24938,
"resources": {
"disk": 35056,
"mem": 59363,
"gpus": 4,
"cpus": 32,
"ports": "[31000-32000]"
},
"used_resources": {
"disk": 0,
"mem": 0,
"gpus": 0,
"cpus": 0
},
"offered_resources": {
"disk": 0,
"mem": 0,
"gpus": 0,
"cpus": 0
},
"reserved_resources": {
"docker_gpu_worker": {
"disk": 35056,
"mem": 59363,
"gpus": 4,
"cpus": 32,
"ports": "[31000-32000]"
}
},
"unreserved_resources": {
"disk": 0,
"mem": 0,
"gpus": 0,
"cpus": 0
},
"attributes": {},
"active": true,
"version": "1.0.1",
"reserved_resources_full": {
"docker_gpu_worker": [
{
"name": "gpus",
"type": "SCALAR",
"scalar": {
"value": 4
},
"role": "docker_gpu_worker"
},
{
"name": "cpus",
"type": "SCALAR",
"scalar": {
"value": 32
},
"role": "docker_gpu_worker"
},
{
"name": "mem",
"type": "SCALAR",
"scalar": {
"value": 59363
},
"role": "docker_gpu_worker"
},
{
"name": "disk",
"type": "SCALAR",
"scalar": {
"value": 35056
},
"role": "docker_gpu_worker"
},
{
"name": "ports",
"type": "RANGES",
"ranges": {
"range": [
{
"begin": 31000,
"end": 32000
}
]
},
"role": "docker_gpu_worker"
}
]
},
"used_resources_full": [],
"offered_resources_full": []
}"
We have tracked the problem to this Mesos agent config:
--isolation="filesystem/linux,cgroups/devices,gpu/nvidia"
Removing that, the agent works properly, but without access to GPU resources. This config is a requirement according to the docs for Nvidia GPU support and those docs seem to indicate that version 1.0.1 supports it. We are continuing to investigate.
The GPU_RESOURCES capability must be enabled for frameworks.
As illustrated in http://mesos.readthedocs.io/en/latest/gpu-support/,
this can be achieved for example by specifying --framework_capabilities="GPU_RESOURCES" in the mesos-execute command, or with code like this in C++:
FrameworkInfo framework;
framework.add_capabilities()->set_type(
FrameworkInfo::Capability::GPU_RESOURCES);
For Marathon frameworks instead, the Marathon service must be started with the option --enable_features "gpu_resources" as indicated in Enable GPU resources (CUDA) on DC/OS
you can register the roles with master statically,
if you add an agent role at run time it would not be known to master
and it would require mesos master restart for master to see this role.
Try restarting the mesos master.

Resources