Mesos Slave on Windows 2016. Not Connecting with Master - mesos

My current set up is as follows:
Mesos Master — 10.20.200.300:14081 - RHEL 7
Zookeeper — 10.20.200.300:14080 - RHEL 7
Mesos Agent — 10.21.210.310:5051 - Windows 2016
The master is up & is able to connect to zookeeper. However, on starting the agent, even if the agent is connecting to zookeeper, it is not getting connected to the Master.
Master was started as systemd process with below paramters under /etc/mesos-master -
hostname - mymaster.mesos.com
quorum - 1
work_dir - /var/lib/mesos
advertise_ip - 10.20.200.300
advertise_port - 14081
Below are the logs from master, slave & zookeeper.
Master Logs(Running on 10.20.200.300:14081) :
E1208 12:22:21.269227 4302 process.cpp:2455] Failed to shutdown socket with fd 26, address 10.20.200.300:14081: Transport endpoint is not connected
Zookeeper Logs(Running on 10.20.200.300:14080) :
2017-12-08 12:22:21,185 [myid:] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:14080:ZooKeeperServer#942] - Client attempting to establish new session at /10.21.210.310:63039
2017-12-08 12:22:21,196 [myid:] - INFO [SyncThread:0:ZooKeeperServer#687] - Established session 0x160372c2b770010 with negotiated timeout 10000 for client /10.21.210.310:63039
Slave Logs(Running on 10.21.210.310:5051) :
I1208 12:22:21.179652 4224 slave.cpp:1007] New master detected at master#10.20.200.300:14081
I1208 12:22:21.195278 4224 slave.cpp:1031] No credentials provided. Attempting to register without authentication
I1208 12:22:21.195278 4224 slave.cpp:1042] Detecting new master
I1208 12:22:21.210924 6156 slave.cpp:5135] Got exited event for master#10.20.200.300:14081
W1208 12:22:21.210924 6156 slave.cpp:5140] Master disconnected! Waiting for a new master to be elected
I1208 12:22:21.226510 2700 slave.cpp:5135] Got exited event for master#10.20.200.300:14081
W1208 12:22:21.226510 2700 slave.cpp:5140] Master disconnected! Waiting for a new master to be elected
Does anyone know the reason for these?
I have tested the connectivity between slave -> master & master -> Slave & it was successful.
Test-NetConnection -ComputerName 10.20.200.300 -Port 14081
ComputerName : 10.20.200.300
RemoteAddress : 10.20.200.300
RemotePort : 14081
InterfaceAlias : Ethernet
SourceAddress : 10.21.210.310
TcpTestSucceeded : True
[root#mesos-master]# telnet 10.21.210.310 5051
Trying 10.21.210.310...
Connected to 10.21.210.310.
Escape character is '^]'.
I got up the agents with below parameters -
C:\Mesos\mesos\build\src>C:\Mesos\mesos\build\src\mesos-agent.exe \
--master=zk://10.20.200.300:14080/mesos \
--work_dir=C:\Mesos\Logs \
--launcher_dir=C:\Mesos\mesos\build\src \
--ip=10.21.210.310 \
--advertise_ip=10.21.210.310 \
--advertise_port=5051
Master/state Logs-
{
"version": "1.3.1",
"git_sha": "1beaede8c13f0832d4921121da34f924deec8950",
"git_tag": "1.3.1",
"build_date": "2017-09-05 18:02:12",
"build_time": 1504634532,
"build_user": "centos",
"start_time": 1513010072.51033,
"elected_time": 1513010072.67995,
"id": "90f5702f-f867-41ac-8087-5d20c87ea96f",
"pid": "master#10.20.200.300:14081",
"hostname": "MYhost.COM",
"activated_slaves": 0,
"deactivated_slaves": 0,
"unreachable_slaves": 0,
"leader": "master#10.20.200.300:14081",
"leader_info": {
"id": "90f5702f-f867-41ac-8087-5d20c87ea96f",
"pid": "master#10.20.200.300:14081",
"port": 14081,
"hostname": "MYhost.COM"
},
"log_dir": "/var/log/mesos",
"flags": {
"advertise_ip": "10.20.200.300",
"advertise_port": "14081",
"agent_ping_timeout": "15secs",
"agent_reregister_timeout": "10mins",
"allocation_interval": "1secs",
"allocator": "HierarchicalDRF",
"authenticate_agents": "false",
"authenticate_frameworks": "false",
"authenticate_http_frameworks": "false",
"authenticate_http_readonly": "false",
"authenticate_http_readwrite": "false",
"authenticators": "crammd5",
"authorizers": "local",
"framework_sorter": "drf",
"help": "false",
"hostname": "MYhost.COM",
"hostname_lookup": "true",
"http_authenticators": "basic",
"initialize_driver_logging": "true",
"log_auto_initialize": "true",
"log_dir": "/var/log/mesos",
"logbufsecs": "0",
"logging_level": "INFO",
"max_agent_ping_timeouts": "5",
"max_completed_frameworks": "50",
"max_completed_tasks_per_framework": "1000",
"max_unreachable_tasks_per_framework": "1000",
"port": "14081",
"quiet": "false",
"quorum": "1",
"recovery_agent_removal_limit": "100%",
"registry": "replicated_log",
"registry_fetch_timeout": "1mins",
"registry_gc_interval": "15mins",
"registry_max_agent_age": "2weeks",
"registry_max_agent_count": "102400",
"registry_store_timeout": "20secs",
"registry_strict": "false",
"root_submissions": "true",
"user_sorter": "drf",
"version": "false",
"webui_dir": "/usr/share/mesos/webui",
"work_dir": "/var/lib/mesos",
"zk": "zk://localhost:14080/mesos",
"zk_session_timeout": "10secs"
},
"slaves": [],
"recovered_slaves": [],
"frameworks": [],
"completed_frameworks": [],
"orphan_tasks": [],
"unregistered_frameworks": []
}
Do we need to test any other connectivity or this error is for some other reason?

I would try this
Set hostname on slave (you can say hostname=10.21.210.310)
Check firewall on Windows machine. Allow incoming conections to 5051 port

Related

unable to control swarm ingress network with ansible

I'm deploying Docker swarm with ansible and I would like to ensure the ingress network has been created. In that aim, I configured the following task :
- name: Ensure ingress network exists
docker_network:
state: present
name: ingress
driver: overlay
driver_options:
ingress: true
And I'm getting the following error :
An exception occurred during task execution. To see the full traceback, use -vvv. The error was: docker.errors.NotFound: 404 Client Error for http+docker://localhost/v1.41/networks/ingress/disconnect: Not Found ("No such container: ingress-endpoint")
fatal: [swarm-srv-1]: FAILED! => {"changed": false, "msg": "An unexpected docker error occurred: 404 Client Error for http+docker://localhost/v1.41/networks/ingress/disconnect: Not Found (\"No such container: ingress-endpoint\")"}
I've tried to add some arguments likes :
scope: swarm
force: yes
But no changes... I've also tried to delete the ingress with ansible (state: absent), but I always get the same error.
Note that I don't face any issue when trying to delete a recreate the ingress network manually on the swarm : docker network rm ingress
I don't know how to resolve that issue...Any help would be appreciated. Thanks !
Here are some informations that may help...
# docker version
Version: 20.10.6
API version: 1.41
Go version: go1.13.15
Git commit: 370c289
Built: Fri Apr 9 22:47:35 2021
OS/Arch: linux/amd64
# docker inspect ingress
[
{
"Name": "ingress",
"Id": "yb2tkhep8vtaj9q7w3mssc9lx",
"Created": "2021-05-19T05:53:27.524446929-04:00",
"Scope": "swarm",
"Driver": "overlay",
"EnableIPv6": false,
"IPAM": {
"Driver": "default",
"Options": null,
"Config": [
{
"Subnet": "10.0.0.0/24",
"Gateway": "10.0.0.1"
}
]
},
"Internal": false,
"Attachable": false,
"Ingress": true,
"ConfigFrom": {
"Network": ""
},
"ConfigOnly": false,
"Containers": {
"ingress-sbox": {
"Name": "ingress-endpoint",
"EndpointID": "dfdc0f123d21a196c7a815c7e0a886924d0799ae5f3be2d38b64d527ed4620b1",
"MacAddress": "02:42:0a:00:00:02",
"IPv4Address": "10.0.0.2/24",
"IPv6Address": ""
}
},
"Options": {
"com.docker.network.driver.overlay.vxlanid_list": "4096"
},
"Labels": {},
"Peers": [
{
"Name": "8f8932d6f99f",
"IP": "(ip address here)"
},
{
"Name": "28b9ca95dcf0",
"IP": "(ip address here)"
},
{
"Name": "f7c48c8af2f5",
"IP": "(ip address here)"
}
]
}
]
I had the exact same issue when trying to customize the IP range of the ingress network. It looks like the docker_network module does not support modification of swarm specific networks: there is a open Github issue for this.
I went for the ugly workaround of removing the network by executing it through a shell (docker network rm ingress command) and adding it again. When adding it with the docker_network module, I found that adding also seems not be working (fails to set the ingress property of the network). So I ended up doing both remove- and create operation through a shell command.
Since the removal will trigger a confirmation dialogue:
WARNING! Before removing the routing-mesh network, make sure all the nodes in your swarm run the same docker engine version. Otherwise, removal may not be effective and functionality of newly create ingress networks will be impaired.
Are you sure you want to continue? [y/N]
I used the expect module to confirm the dialogue:
- name: remove default ingress network
ansible.builtin.expect:
command: docker network rm ingress
responses:
"[y/N]": "y"
- name: create customized ingress network
shell: "docker network create --ingress --subnet {{ docker_ingress_network }} --driver overlay ingress"
It is not perfect but it works.
There was one last problem I experienced: when running it on an existing swarm I ended up having network issues on the node where I did run this (somehow the docker_gwbridge network on that node could not handle the change). The fix for this was to fully remove the node and re-join the swarm (regenerates the docker_gwbridge).

Issue with Consul Connect

I have a service that I want to proxy with Connect and followed the instructions on HashiCorp Learn portal.
This is my "hello" service:
{
"service": {
"name": "node",
"port": 3000,
"connect": {
"sidecar_service": {}
}
}
}
I then do a "consul reload" and create the proxy with
consul connect proxy -sidecar-for node &
When I create another service like this
consul connect proxy -service web -upstream node:9191
I can verify that I can reach my node service by calling the web service on port 9191 (curl localhost:9191). But when I define my web service in a json file as shown below, then register it (with consul reload) and want to connect to it, I have the following error:
curl: (7) Failed to connect to localhost port 9191: Connection refused
web.json
{
"service": {
"name": "web",
"connect": {
"sidecar_service": {
"proxy": {
"upstreams": [
{
"destination_name": "node",
"local_bind_port": 9191
}
]
}
}
}
}
}
Is there anything I missed?

Kafka not Publishing Oracle Data

I have a Confluent on RHEL setup and am trying to read data from an Oracle 12C table/view (I tried both) and it is never creating messages at the consumer.
My suspicion is that that it has something to do with the data in the tables being loaded using a bulk loader and not unary inserts. I do have a unique incrementing id column in the data that I have specified, and the config loads and it shows my topic name as active/running.
Any ideas?
{
"name":"oracle_source_05",
"config": {
"connector.class":
"io.confluent.connect.jdbc.JdbcSourceConnector",
"key.converter": "io.confluent.connect.avro.AvroConverter",
"key.converter.schema.registry.url": "http://<host>:8081",
"value.converter": "io.confluent.connect.avro.AvroConverter",
"value.converter.schema.registry.url": "http://<host>:8081",
"connection.url": “<jdbc url>
"connection.user" : "<user>",
"connection.password" : "<pw>",
"table.whitelist": "<view name>",
"table.type" : "VIEW",
"mode": "incrementing",
"incrementing.column.name" : "<id column>",
"validate.non.null":"false",
"topic.prefix":"ORACLE-"
}
}
Log has this message:
[2018-04-17 10:59:19,965] DEBUG [Controller id=0] Topics not in preferred replica Map() (kafka.controller.KafkaController)
[2018-04-17 10:59:19,965] TRACE [Controller id=0] Leader imbalance ratio for broker 0 is 0.0 (kafka.controller.KafkaController)
server.log:
[2018-04-18 09:24:26,495] INFO Accepted socket connection from /127.0.0.1:39228 (org.apache.zookeeper.server.NIOServerCnxnFactory)
[2018-04-18 09:24:26,498] INFO Client attempting to establish new session at /127.0.0.1:39228 (org.apache.zookeeper.server.ZooKeeperServer)
[2018-04-18 09:24:26,499] INFO Established session 0x162d403daed0004 with negotiated timeout 30000 for client /127.0.0.1:39228 (org.apache.zookeeper.server.ZooKeeperServer)
[2018-04-18 09:24:26,516] INFO Processed session termination for sessionid: 0x162d403daed0004 (org.apache.zookeeper.server.PrepRequestProcessor)
[2018-04-18 09:24:26,517] INFO Closed socket connection for client /127.0.0.1:39228 which had sessionid 0x162d403daed0004 (org.apache.zookeeper.server.NIOServerCnxn)

Multiple Service Definition in Config File Not Working Consul

I have been trying to add multiple service through configuration in consul.
But Consul UI is throwing error at startup of the agent itself.
The error is:-
$consul.exe agent --dev
Starting Consul agent...
panic: runtime error: invalid memory address or nil pointer dereference
github.com/hashicorp/consul/agent.(*Agent).loadServices(0xc0421268c0,
0xc04223aa80, 0xc042254a00, 0x0)
/gopath/src/github.com/hashicorp/consul/agent/agent.go:2097
github.com/hashicorp/consul/agent.(*Agent).Start()
/gopath/src/github.com/hashicorp/consul/agent/agent.go:326
github.com/hashicorp/consul/command.(*AgentCommand).run()
/gopath/src/github.com/hashicorp/consul/command/agent.go:704
github.com/hashicorp/consul/command.(*AgentCommand).Run()
/gopath/src/github.com/hashicorp/consul/command/agent.go:653
Config file is:-
{
"Services": [{
"id": "somename",
"name": "nameofthissevice",
"service": "myservice",
"address": "127.0.0.1",
"port": 62133,
"enableTagOverride": false,
"tags" : ["service1"]
},
{
"id": "somename1",
"name": "nameofthissevice",
"service": "myservice2",
"address": "127.0.0.1",
"port": 64921,
"enableTagOverride": false,
"tags" : ["service2"]
}]
}
I am using Win 7 platform.
Could any one suggest some ideas in it.
Thx
The configuration file is not loaded, so the problem is not in the file, in order to load the configuration file you should add another flag for loading the configuration, otherwise consul will start with default configuration.
Looks like a faulty binary or incompatible version.
Your windows7 is a 32 bit arch or 64?
And which exec version of consul have you downloaded ?

Querying remote registry service on machine <IP Address> resulted in exception: Unable to change open service manager

My cluster Config file as follows
`
{
"name": "SampleCluster",
"clusterConfigurationVersion": "1.0.0",
"apiVersion": "01-2017",
"nodes":
[
{
"nodeName": "vm0",
"iPAddress": "here is my VPS ip",
"nodeTypeRef": "NodeType0",
"faultDomain": "fd:/dc1/r0",
"upgradeDomain": "UD0"
},
{
"nodeName": "vm1",
"iPAddress": "here is my another VPS ip",
"nodeTypeRef": "NodeType0",
"faultDomain": "fd:/dc1/r1",
"upgradeDomain": "UD1"
},
{
"nodeName": "vm2",
"iPAddress": "here is my another VPS ip",
"nodeTypeRef": "NodeType0",
"faultDomain": "fd:/dc1/r2",
"upgradeDomain": "UD2"
}
],
"properties": {
"reliabilityLevel": "Bronze",
"diagnosticsStore":
{
"metadata": "Please replace the diagnostics file share with an actual file share accessible from all cluster machines.",
"dataDeletionAgeInDays": "7",
"storeType": "FileShare",
"IsEncrypted": "false",
"connectionstring": "c:\\ProgramData\\SF\\DiagnosticsStore"
},
"nodeTypes": [
{
"name": "NodeType0",
"clientConnectionEndpointPort": "19000",
"clusterConnectionEndpointPort": "19001",
"leaseDriverEndpointPort": "19002",
"serviceConnectionEndpointPort": "19003",
"httpGatewayEndpointPort": "19080",
"reverseProxyEndpointPort": "19081",
"applicationPorts": {
"startPort": "20001",
"endPort": "20031"
},
"isPrimary": true
}
],
"fabricSettings": [
{
"name": "Setup",
"parameters": [
{
"name": "FabricDataRoot",
"value": "C:\\ProgramData\\SF"
},
{
"name": "FabricLogRoot",
"value": "C:\\ProgramData\\SF\\Log"
}
]
}
]
}
}
It is almost identical to standalone service fabric download demo file for untrusted cluster except my VPS ip. I enabled remote registry service.I ran the
\TestConfiguration.ps1 -ClusterConfigFilePath \ClusterConfig.Unsecure.MultiMachine.json but i got the following error.
Unable to change open service manager handle because 5
Unable to query service configuration because System.InvalidOperationException: Unable to change open service manager ha
ndle because 5
at System.Fabric.FabricDeployer.FabricDeployerServiceController.GetServiceStartupType(String machineName, String serv
iceName)
Querying remote registry service on machine <IP Address> resulted in exception: Unable to change open service manager
handle because 5.
Unable to change open service manager handle because 5
Unable to query service configuration because System.InvalidOperationException: Unable to change open service manager ha
ndle because 5
at System.Fabric.FabricDeployer.FabricDeployerServiceController.GetServiceStartupType(String machineName, String serv
iceName)
Querying remote registry service on machine <Another IP Address> resulted in exception: Unable to change open service manager
handle because 5.
Best Practices Analyzer determined environment has an issue. Please see additional BPA log output in DeploymentTraces
LocalAdminPrivilege : True
IsJsonValid : True
IsCabValid :
RequiredPortsOpen : True
RemoteRegistryAvailable : False
FirewallAvailable :
RpcCheckPassed :
NoConflictingInstallations :
FabricInstallable :
DataDrivesAvailable :
Passed : False
Test Config failed with exception: System.InvalidOperationException: Best Practices Analyzer determined environment has
an issue. Please see additional BPA log output in DeploymentTraces folder.
at System.Management.Automation.MshCommandRuntime.ThrowTerminatingError(ErrorRecord errorRecord)
I don't understand the problem.VPSs are not locally connected. All are public IP.I don't know, this may b an issue. how do I make virtual LAN among these VPS?Can anyone give me some direction about this error?Anyone helps me is greatly appreciated.
Edit: I used VM term insted of VPS.
Finally I make this working. Actually all the nodes are in a network, i thought it wasn't. I enable file sharing. I try to access the shared file from the node where I ran configuration test to the all other nodes. I have to give the credentials of logins. And then it works like a charm.

Resources