unable to control swarm ingress network with ansible - ansible

I'm deploying Docker swarm with ansible and I would like to ensure the ingress network has been created. In that aim, I configured the following task :
- name: Ensure ingress network exists
docker_network:
state: present
name: ingress
driver: overlay
driver_options:
ingress: true
And I'm getting the following error :
An exception occurred during task execution. To see the full traceback, use -vvv. The error was: docker.errors.NotFound: 404 Client Error for http+docker://localhost/v1.41/networks/ingress/disconnect: Not Found ("No such container: ingress-endpoint")
fatal: [swarm-srv-1]: FAILED! => {"changed": false, "msg": "An unexpected docker error occurred: 404 Client Error for http+docker://localhost/v1.41/networks/ingress/disconnect: Not Found (\"No such container: ingress-endpoint\")"}
I've tried to add some arguments likes :
scope: swarm
force: yes
But no changes... I've also tried to delete the ingress with ansible (state: absent), but I always get the same error.
Note that I don't face any issue when trying to delete a recreate the ingress network manually on the swarm : docker network rm ingress
I don't know how to resolve that issue...Any help would be appreciated. Thanks !
Here are some informations that may help...
# docker version
Version: 20.10.6
API version: 1.41
Go version: go1.13.15
Git commit: 370c289
Built: Fri Apr 9 22:47:35 2021
OS/Arch: linux/amd64
# docker inspect ingress
[
{
"Name": "ingress",
"Id": "yb2tkhep8vtaj9q7w3mssc9lx",
"Created": "2021-05-19T05:53:27.524446929-04:00",
"Scope": "swarm",
"Driver": "overlay",
"EnableIPv6": false,
"IPAM": {
"Driver": "default",
"Options": null,
"Config": [
{
"Subnet": "10.0.0.0/24",
"Gateway": "10.0.0.1"
}
]
},
"Internal": false,
"Attachable": false,
"Ingress": true,
"ConfigFrom": {
"Network": ""
},
"ConfigOnly": false,
"Containers": {
"ingress-sbox": {
"Name": "ingress-endpoint",
"EndpointID": "dfdc0f123d21a196c7a815c7e0a886924d0799ae5f3be2d38b64d527ed4620b1",
"MacAddress": "02:42:0a:00:00:02",
"IPv4Address": "10.0.0.2/24",
"IPv6Address": ""
}
},
"Options": {
"com.docker.network.driver.overlay.vxlanid_list": "4096"
},
"Labels": {},
"Peers": [
{
"Name": "8f8932d6f99f",
"IP": "(ip address here)"
},
{
"Name": "28b9ca95dcf0",
"IP": "(ip address here)"
},
{
"Name": "f7c48c8af2f5",
"IP": "(ip address here)"
}
]
}
]

I had the exact same issue when trying to customize the IP range of the ingress network. It looks like the docker_network module does not support modification of swarm specific networks: there is a open Github issue for this.
I went for the ugly workaround of removing the network by executing it through a shell (docker network rm ingress command) and adding it again. When adding it with the docker_network module, I found that adding also seems not be working (fails to set the ingress property of the network). So I ended up doing both remove- and create operation through a shell command.
Since the removal will trigger a confirmation dialogue:
WARNING! Before removing the routing-mesh network, make sure all the nodes in your swarm run the same docker engine version. Otherwise, removal may not be effective and functionality of newly create ingress networks will be impaired.
Are you sure you want to continue? [y/N]
I used the expect module to confirm the dialogue:
- name: remove default ingress network
ansible.builtin.expect:
command: docker network rm ingress
responses:
"[y/N]": "y"
- name: create customized ingress network
shell: "docker network create --ingress --subnet {{ docker_ingress_network }} --driver overlay ingress"
It is not perfect but it works.
There was one last problem I experienced: when running it on an existing swarm I ended up having network issues on the node where I did run this (somehow the docker_gwbridge network on that node could not handle the change). The fix for this was to fully remove the node and re-join the swarm (regenerates the docker_gwbridge).

Related

Mounting AWS EBS into CoreOS

I have launched an EC2 instance with 100Gb EBS as https://coreos.com/os/docs/latest/booting-on-ec2.html docs.
#cloud-config
coreos:
units:
- name: media-ephemeral.mount
command: start
content: |
[Mount]
What=/dev/xvdb
Where=/media/ephemeral
Type=ext4
- name: format-ephemeral.service
command: start
content: |
[Unit]
Description=Formats the ephemeral drive
[Service]
Type=oneshot
RemainAfterExit=yes
ExecStart=/usr/sbin/wipefs -f /dev/xvdb
ExecStart=/usr/sbin/mkfs.btrfs -f /dev/xvdb
- name: var-lib-docker.mount
command: start
content: |
[Unit]
Description=Mount ephemeral to /var/lib/docker
Requires=format-ephemeral.service
After=format-ephemeral.service
Before=docker.service
[Mount]
What=/dev/xvdb
Where=/var/lib/docker
Type=btrfs
if i run the above, the EBS is mounted correctly, but on system reboot, the volume is is not persistent
using
storage:
filesystems:
- name: ephemeral1
mount:
device: /dev/xvdb
format: ext4
wipe_filesystem: true
systemd:
units:
- name: media-ephemeral.mount
enable: true
contents: |
[Unit]
Before=local-fs.target
[Mount]
What=/dev/xvdb
Where=/media/ephemeral
Type=ext4
[Install]
WantedBy=local-fs.target
- name: var-lib-docker.mount
enable: true
contents: |
[Unit]
Description=Mount ephemeral to /var/lib/docker
Before=local-fs.target
[Mount]
What=/dev/xvdb
Where=/var/lib/docker
Type=ext4
[Install]
WantedBy=local-fs.target
- name: docker.service
dropins:
- name: 10-wait-docker.conf
contents: |
[Unit]
After=var-lib-docker.mount
Requires=var-lib-docker.mount
as per docs, i get
core#ip-10-1-2-188 ~ $ sudo /usr/bin/coreos-cloudinit --from-file storage1.conf
2019/01/15 17:09:28 Checking availability of "local-file"
2019/01/15 17:09:28 Fetching user-data from datasource of type "local-file"
2019/01/15 17:09:28 line 2: warning: unrecognized key "storage"
2019/01/15 17:09:28 line 9: warning: unrecognized key "systemd"
2019/01/15 17:09:28 Fetching meta-data from datasource of type "local-file"
2019/01/15 17:09:28 Parsing user-data as cloud-config
2019/01/15 17:09:28 Merging cloud-config from meta-data and user-data
2019/01/15 17:09:28 Updated /etc/environment
2019/01/15 17:09:28 Ensuring runtime unit file "etcd.service" is unmasked
2019/01/15 17:09:28 Ensuring runtime unit file "etcd2.service" is unmasked
2019/01/15 17:09:28 Ensuring runtime unit file "fleet.service" is unmasked
2019/01/15 17:09:28 Ensuring runtime unit file "locksmithd.service" is unmasked
core#ip-10-1-2-188 ~ $ cat /etc/os-release
NAME="Container Linux by CoreOS"
ID=coreos
VERSION=1967.3.0
VERSION_ID=1967.3.0
BUILD_ID=2019-01-08-0044
PRETTY_NAME="Container Linux by CoreOS 1967.3.0 (Rhyolite)"
ANSI_COLOR="38;5;75"
HOME_URL="https://coreos.com/"
BUG_REPORT_URL="https://issues.coreos.com"
COREOS_BOARD="amd64-usr"
What is the correct way to mount the EBS volume on CoreOS?
Any advice is much appreciated
It looks like you missed a step. [cloud-configs have been deprecated for quite some time now. You correctly converted that cloud-config into a container linux config (CLC) file, but missed using config transpiler (CT) to then render an ignition sequence. You can check this by running your config through the online validator. After running that CLC config through the config transpiler I get the following, which validates correctly:
{
"ignition": {
"config": {},
"timeouts": {},
"version": "2.1.0"
},
"networkd": {},
"passwd": {},
"storage": {
"filesystems": [
{
"mount": {
"device": "/dev/xvdb",
"format": "ext4",
"wipeFilesystem": true
},
"name": "ephemeral1"
}
]
},
"systemd": {
"units": [
{
"contents": "[Unit]\nBefore=local-fs.target\n[Mount]\nWhat=/dev/xvdb\nWhere=/media/ephemeral\nType=ext4\n[Install]\nWantedBy=local-fs.target\n",
"enable": true,
"name": "media-ephemeral.mount"
},
{
"contents": "[Unit]\nDescription=Mount ephemeral to /var/lib/docker\nBefore=local-fs.target\n[Mount]\nWhat=/dev/xvdb\nWhere=/var/lib/docker\nType=ext4\n[Install]\nWantedBy=local-fs.target\n",
"enable": true,
"name": "var-lib-docker.mount"
},
{
"dropins": [
{
"contents": "[Unit]\nAfter=var-lib-docker.mount\nRequires=var-lib-docker.mount\n",
"name": "10-wait-docker.conf"
}
],
"name": "docker.service"
}
]
}
}
Additionally, it's important to note that there are other differences as well between ignition and coreos-cloud-init. The most important of which is that ignition only runs once. Thus, for things like wiping the contents of that ephemeral disk, you should not expect wipe_filesystem: true to be run every single boot.
Try booting the machine with this config instead. You should get the expected results.

Error importing Kibana dashboards: fail to create the Kibana loader: Error creating Kibana client

I'm having a problem when I try to run the command sudo metricbeat -e -setup
it return Error importing Kibana dashboards: fail to create the Kibana loader: Error creating Kibana client
but if I run sudo metricbeat test config
Config OK
or
sudo metricbeat test modules
nginx...
stubstatus...OK
result:
{
"#timestamp": "2018-10-05T12:30:19.077Z",
"metricset": {
"host": "127.0.0.1:8085",
"module": "nginx",
"name": "stubstatus",
"rtt": 438
},
"nginx": {
"stubstatus": {
"accepts": 2871,
"active": 2,
"current": 3559,
"dropped": 0,
"handled": 2871,
"hostname": "127.0.0.1:8085",
"reading": 0,
"requests": 3559,
"waiting": 1,
"writing": 1
}
}
}
Kibana is up and running?
Kibana IP and Port are configured correctly in metricbeat?
Metricbeat starting from V6.x are importing their dashboards into Kibana, thus resulting in errors like this if the Kibana endpoint isn't reachable.

Multiple Service Definition in Config File Not Working Consul

I have been trying to add multiple service through configuration in consul.
But Consul UI is throwing error at startup of the agent itself.
The error is:-
$consul.exe agent --dev
Starting Consul agent...
panic: runtime error: invalid memory address or nil pointer dereference
github.com/hashicorp/consul/agent.(*Agent).loadServices(0xc0421268c0,
0xc04223aa80, 0xc042254a00, 0x0)
/gopath/src/github.com/hashicorp/consul/agent/agent.go:2097
github.com/hashicorp/consul/agent.(*Agent).Start()
/gopath/src/github.com/hashicorp/consul/agent/agent.go:326
github.com/hashicorp/consul/command.(*AgentCommand).run()
/gopath/src/github.com/hashicorp/consul/command/agent.go:704
github.com/hashicorp/consul/command.(*AgentCommand).Run()
/gopath/src/github.com/hashicorp/consul/command/agent.go:653
Config file is:-
{
"Services": [{
"id": "somename",
"name": "nameofthissevice",
"service": "myservice",
"address": "127.0.0.1",
"port": 62133,
"enableTagOverride": false,
"tags" : ["service1"]
},
{
"id": "somename1",
"name": "nameofthissevice",
"service": "myservice2",
"address": "127.0.0.1",
"port": 64921,
"enableTagOverride": false,
"tags" : ["service2"]
}]
}
I am using Win 7 platform.
Could any one suggest some ideas in it.
Thx
The configuration file is not loaded, so the problem is not in the file, in order to load the configuration file you should add another flag for loading the configuration, otherwise consul will start with default configuration.
Looks like a faulty binary or incompatible version.
Your windows7 is a 32 bit arch or 64?
And which exec version of consul have you downloaded ?

Querying remote registry service on machine <IP Address> resulted in exception: Unable to change open service manager

My cluster Config file as follows
`
{
"name": "SampleCluster",
"clusterConfigurationVersion": "1.0.0",
"apiVersion": "01-2017",
"nodes":
[
{
"nodeName": "vm0",
"iPAddress": "here is my VPS ip",
"nodeTypeRef": "NodeType0",
"faultDomain": "fd:/dc1/r0",
"upgradeDomain": "UD0"
},
{
"nodeName": "vm1",
"iPAddress": "here is my another VPS ip",
"nodeTypeRef": "NodeType0",
"faultDomain": "fd:/dc1/r1",
"upgradeDomain": "UD1"
},
{
"nodeName": "vm2",
"iPAddress": "here is my another VPS ip",
"nodeTypeRef": "NodeType0",
"faultDomain": "fd:/dc1/r2",
"upgradeDomain": "UD2"
}
],
"properties": {
"reliabilityLevel": "Bronze",
"diagnosticsStore":
{
"metadata": "Please replace the diagnostics file share with an actual file share accessible from all cluster machines.",
"dataDeletionAgeInDays": "7",
"storeType": "FileShare",
"IsEncrypted": "false",
"connectionstring": "c:\\ProgramData\\SF\\DiagnosticsStore"
},
"nodeTypes": [
{
"name": "NodeType0",
"clientConnectionEndpointPort": "19000",
"clusterConnectionEndpointPort": "19001",
"leaseDriverEndpointPort": "19002",
"serviceConnectionEndpointPort": "19003",
"httpGatewayEndpointPort": "19080",
"reverseProxyEndpointPort": "19081",
"applicationPorts": {
"startPort": "20001",
"endPort": "20031"
},
"isPrimary": true
}
],
"fabricSettings": [
{
"name": "Setup",
"parameters": [
{
"name": "FabricDataRoot",
"value": "C:\\ProgramData\\SF"
},
{
"name": "FabricLogRoot",
"value": "C:\\ProgramData\\SF\\Log"
}
]
}
]
}
}
It is almost identical to standalone service fabric download demo file for untrusted cluster except my VPS ip. I enabled remote registry service.I ran the
\TestConfiguration.ps1 -ClusterConfigFilePath \ClusterConfig.Unsecure.MultiMachine.json but i got the following error.
Unable to change open service manager handle because 5
Unable to query service configuration because System.InvalidOperationException: Unable to change open service manager ha
ndle because 5
at System.Fabric.FabricDeployer.FabricDeployerServiceController.GetServiceStartupType(String machineName, String serv
iceName)
Querying remote registry service on machine <IP Address> resulted in exception: Unable to change open service manager
handle because 5.
Unable to change open service manager handle because 5
Unable to query service configuration because System.InvalidOperationException: Unable to change open service manager ha
ndle because 5
at System.Fabric.FabricDeployer.FabricDeployerServiceController.GetServiceStartupType(String machineName, String serv
iceName)
Querying remote registry service on machine <Another IP Address> resulted in exception: Unable to change open service manager
handle because 5.
Best Practices Analyzer determined environment has an issue. Please see additional BPA log output in DeploymentTraces
LocalAdminPrivilege : True
IsJsonValid : True
IsCabValid :
RequiredPortsOpen : True
RemoteRegistryAvailable : False
FirewallAvailable :
RpcCheckPassed :
NoConflictingInstallations :
FabricInstallable :
DataDrivesAvailable :
Passed : False
Test Config failed with exception: System.InvalidOperationException: Best Practices Analyzer determined environment has
an issue. Please see additional BPA log output in DeploymentTraces folder.
at System.Management.Automation.MshCommandRuntime.ThrowTerminatingError(ErrorRecord errorRecord)
I don't understand the problem.VPSs are not locally connected. All are public IP.I don't know, this may b an issue. how do I make virtual LAN among these VPS?Can anyone give me some direction about this error?Anyone helps me is greatly appreciated.
Edit: I used VM term insted of VPS.
Finally I make this working. Actually all the nodes are in a network, i thought it wasn't. I enable file sharing. I try to access the shared file from the node where I ran configuration test to the all other nodes. I have to give the credentials of logins. And then it works like a charm.

Chronos can't run a private Docker container

I'm playing on localhost with a DC/OS installation. While everything works fine, I can't seem to run a docker image located inside a private repo. I'm using python to communicate with chronos:
#celery.task(name='add-job', soft_time_limit=5)
def add_job(job_id):
job_document = mongo.jobs.find_one({
'_id': job_id
})
if job_document:
worker_document = mongo.workers.find_one({
'_id': job_document['workerId']
})
if worker_document:
job = {
'async': True,
'name': job_document['_id'],
'owner': 'owner#gmail.com',
'command': "python /code/run.py",
"disabled": False,
"shell": True,
"cpus": worker_document['cpus'],
"disk": worker_document['disk'],
"mem": worker_document['memory'],
'schedule': 'R1//PT300S',# start now,
"epsilon": "PT60M",
"container": {
"type": "DOCKER",
"forcePullImage": True,
"image": "quay.io/username/container",
"network": "HOST",
"volumes": [{
"containerPath": "/images/",
"hostPath": "/images/",
"mode": "RW"
}]
},
"uris": [
"file:///images/docker.tar.gz"
]
}
return chronos_client.add(job)
else:
return 'worker not found'
else:
return 'job not found'
The job runs fine with a public image (alpine:latest) but it fails without any error inside the dcos installation.
The job gets executed but it fails immediately. The error log of the job inside chronos looks like this:
I1212 12:39:11.141639 25058 fetcher.cpp:498] Fetcher Info: {"cache_directory":"\/tmp\/mesos\/fetch\/slaves\/61d6d037-c9f5-482b-a441-11d85554461b-S1\/root","items":[{"action":"BYPASS_CACHE","uri":{"cache":false,"executable":false,"extract":false,"value":"file:\/\/\/images\/docker.tar.gz"}}],"sandbox_directory":"\/var\/lib\/mesos\/slave\/slaves\/61d6d037-c9f5-482b-a441-11d85554461b-S1\/docker\/links\/7029bbea-4c3d-439a-8720-411f6fe40eb9","user":"root"}
I1212 12:39:11.143575 25058 fetcher.cpp:409] Fetching URI 'file:///images/docker.tar.gz'
I1212 12:39:11.143587 25058 fetcher.cpp:250] Fetching directly into the sandbox directory
I1212 12:39:11.143602 25058 fetcher.cpp:187] Fetching URI 'file:///images/docker.tar.gz'
I1212 12:39:11.143612 25058 fetcher.cpp:167] Copying resource with command:cp '/images/docker.tar.gz' '/var/lib/mesos/slave/slaves/61d6d037-c9f5-482b-a441-11d85554461b-S1/docker/links/7029bbea-4c3d-439a-8720-411f6fe40eb9/docker.tar.gz'
I1212 12:39:11.146726 25058 fetcher.cpp:547] Fetched 'file:///images/docker.tar.gz' to '/var/lib/mesos/slave/slaves/61d6d037-c9f5-482b-a441-11d85554461b-S1/docker/links/7029bbea-4c3d-439a-8720-411f6fe40eb9/docker.tar.gz'
Stdout is empty. Executed directly inside marathon as an application with the same settings the authentication works and my image is downloaded & executed. Is this something that chronos does not support? It should...I mean, it has commands for docker...
Update: digging deeper into the agent logs I found this:
Failed to run 'docker -H unix:///var/run/docker.sock pull quay.io/username/container': exited with status 1; stderr='Error: Status 403 trying to pull repository username/container: "{\"error\": \"Permission Denied\"}"
I tried the archive with it's config.json file on the agent itself and it can download when triggered from the command line. I just can't seem to understand why chronos is not using it properly. I can't find any other reference on how to put my credentials other than this.
As it turns out...the uris param is deprecated in favor of fetch. I started from scratch with a marathon config applied to chronos and watched the logs carefully when I saw this: {'message': 'Tried to add both uri (deprecated) and fetch parameters on aBPepwhG5z33e4teG', 'status': 'Bad Request'}. Then I changed my uris parameter into:
"fetch": [{
"uri": "/images/docker.tar.gz",
"extract": true,
"executable": false,
"cache": false
}]
...and it worked.
your post looked a little like this one, which turned out to be a problem with volumes.

Resources