How to use Marathon health check command mode? - mesos

I am running docker containers on mesos / marathon. I wanted to implement health checks, basically want to run a health check script. My question is, will the health check command be run on the container itself or does it run on the slave? It probably is container level since this is per application health check, so kind of obvious, but I would like to confirm it. Didn't find any relevant documentation that says where it is run.
Thanks
I did try an echo to /tmp/testfile via the command, which I see on the slave. This means it runs on the slave? Just need confirmation. Any more information is useful

The short answer is: it depends. Long answer below : ).
Command heath checks are run by the Mesos docker executor in your task container via docker exec. If you run your containers using the "unified containerizer", i.e., in case of docker containers without docker daemon, things are similar, with the difference there is no docker exec and Mesos executor simply enters the mnt namespace of your container before executing the command health check (see this doc). HTTP and TCP health checks are run by the Marathon scheduler hence not necessarily on the node where your container is running (unless you run Marathon at the same node with Mesos agent, which is probably you should not be doing). Check out this page.
Now starting with Mesos 1.2.0 and Marathon 1.3, there is a possibility to run so-called Mesos-native health checks. In this case, both HTTP(S) and TCP health checks run on the agent where your container is running. To make sure the container network can be reached, these checks enter the net namespace of your container.

Mesos-level health checks (MESOS_HTTP, MESOS_HTTPS, MESOS_TCP, and COMMAND) are locally executed by Mesos on the agent running the corresponding task and thus test reachability from the Mesos executor. Mesos-level health checks offer the following advantages over Marathon-level health checks:
Mesos-level health checks are performed as close to the task as possible, so they are are not affected by networking failures.
Mesos-level health checks are delegated to the agents running the tasks, so the number of tasks that can be checked can scale horizontally with the number of agents in the cluster.
Limitations and considerations
Mesos-level health checks consume extra resources on the agents; moreover, there is some overhead for fork-execing a process and entering the tasks’ namespaces every time a task is checked.
The health check processes share resources with the task that they check. Your application definition must account for the extra resources consumed by the health checks.
Mesos-level health checks require tasks to listen on the container’s loopback interface in addition to whatever interface they require. If you run a service in production, you will want to make sure that the users can reach it.
Marathon currently does NOT support the combination of Mesos and Marathon level health checks.
Example usage
HTTP:
{
"path": "/api/health",
"portIndex": 0,
"protocol": "HTTP",
"gracePeriodSeconds": 300,
"intervalSeconds": 60,
"timeoutSeconds": 20,
"maxConsecutiveFailures": 3,
"ignoreHttp1xx": false
}
or Mesos HTTP:
{
"path": "/api/health",
"portIndex": 0,
"protocol": "MESOS_HTTP",
"gracePeriodSeconds": 300,
"intervalSeconds": 60,
"timeoutSeconds": 20,
"maxConsecutiveFailures": 3
}
or secure HTTP:
{
"path": "/api/health",
"portIndex": 0,
"protocol": "HTTPS",
"gracePeriodSeconds": 300,
"intervalSeconds": 60,
"timeoutSeconds": 20,
"maxConsecutiveFailures": 3,
"ignoreHttp1xx": false
}
Note: HTTPS health checks do not verify the SSL certificate.
or TCP:
{
"portIndex": 0,
"protocol": "TCP",
"gracePeriodSeconds": 300,
"intervalSeconds": 60,
"timeoutSeconds": 20,
"maxConsecutiveFailures": 0
}
or COMMAND:
{
"protocol": "COMMAND",
"command": { "value": "curl -f -X GET http://$HOST:$PORT0/health" },
"gracePeriodSeconds": 300,
"intervalSeconds": 60,
"timeoutSeconds": 20,
"maxConsecutiveFailures": 3
}
{
"protocol": "COMMAND",
"command": { "value": "/bin/bash -c \\\"</dev/tcp/$HOST/$PORT0\\\"" }
}
Further Information: https://mesosphere.github.io/marathon/docs/health-checks.html

Related

problemin connecting apache superset running inside docker container to Kylin

I have a running apache-superset inside a docker container that i want to connect to a running apache-kylin (Not inside docker ).
I am recieving the following error whenever i test connection with this alchemy URI : 'kylin://ADMIN#KYLIN#local:7070/test ':
[SupersetError(message='(builtins.NoneType) None\n(Background on this error at: http://sqlalche.me/e/13/dbapi)', error_type=<SupersetErrorType.GENERIC_DB_ENGINE_ERROR: 'GENERIC_DB_ENGINE_ERROR'>, level=<ErrorLevel.ERROR: 'error'>, extra={'engine_name': 'Apache Kylin', 'issue_codes': [{'code': 1002, 'message': 'Issue 1002 - The database returned an unexpected error.'}]})]
"POST /api/v1/database/test_connection HTTP/1.1" 422 -
superset_app | 2021-07-02 18:44:17,224:INFO:werkzeug:172.28.0.1 - - [02/Jul/2021 18:44:17] "POST /api/v1/database/test_connection HTTP/1.1" 422 -
You might need to check your superset_app network first.
use docker inspect [container name] i.e.
docker inspect superset_app
in my case, it is running in superset_default network
"Networks": {
"superset_default": {
.....
}
}
Next, you need to connect your kylin docker container to this network i.e.
docker network connect superset_default kylin
kylin is your container name.
Now, your superset_app and kylin container has been exposed within the same network. You can docker inspect your kylin container
docker inspect kylin
and find the IPAddress
"Networks": {
"bridge": {
....
},
"superset_default": {
...
"IPAddress": "172.18.0.5",
...
}
}
In superset you can now connect your kylin docker container
We have hosted Dremio and Superset on an AKS Cluster in Azure and we are trying to connect Superset to the Dremio Database(Lakehouse) for fetching some dashboards. We have installed all the required drivers(arrowflight, sqlalchemy_dremio and unixodc/dev) to establish the connection.
Strangely we are able not able to connect to Dremio from the Superset UI using the connection strings:
dremio+flight://admin:password#dremiohostname.westeurope.cloudapp.azure.com:32010/dremio
dremio://admin:adminpass#dremiohostname.westeurope.cloudapp.azure.com:31010/databaseschema.dataset/dremio?SSL=0
Here’s the error:
(builtins.NoneType) None\n(Background on this error at: https://sqlalche.me/e/14/dbapi)", "error_type": "GENERIC_DB_ENGINE_ERROR", "level": "error", "extra": {"engine_name": "Dremio", "issue_codes": [{"code": 1002, "message": "Issue 1002 - The database returned an unexpected error."}]}}]
However, while trying from inside the Superset pod, using this python script [here][1] 5, the connection goes through without any issues.
PS - One point to note is that, we have not enabled SSL certificates for our hostnames.

Trino server is accepting connection, but is still initializing. How to wait until Trino is ready to receive queries?

For some background, we have included a Trino server as part of our CI setup and tests currently fail while the server is still adding all of the catalogs. Currently, I have setup our CI to retry this curl command, but it does not wait until the server is fully started.
docker run appropriate/curl --retry 60 --retry-delay 1 --retry-connrefused http://trino:8080/
Trino responds before it is fully initialized so the tests start failing due to the Trino server error: Trino server is still initializing.
In order to check if a Trino server is still initializing, you can query information about the cluster by connecting to the coordinator. The following command is where you can see information about a cluster.
docker run appropriate/curl --retry 60 --retry-delay 1 --retry-connrefused http://trino:8080/v1/info
Depending on the state of the server it will return something like the following JSON.
{
"nodeVersion": {
"version": "358"
},
"environment": "test",
"coordinator": true,
"starting": false,
"uptime": "7.56m"
}
You may use the $.starting JSON field to determine if the server is alive and ready for queries. A simple ruby script that tries 60 times before giving is provided.
cat <<"RUBY" | ruby
60.times do
exit if `docker run appropriate/curl http://trino:8080/v1/info | jq .starting`.strip == 'false'
sleep 1
end
exit!
RUBY

Docker Desktop for windows fails on search/build with corporate proxy

I have installed Docker Desktop for windows Docker version 18.09.2, build 6247962, and I'm not able to build and image. Even a docker search does not seem to work.
The error message (for example, when doing a docker search) is:
Error response from daemon: Get https://index.docker.io/v1/search?q=ubuntu&n=25: proxyconnect tcp: dial tcp 172.17.14.133:3128: connect: no route to host
My office is behind a proxy. So on the "Proxies" settings of DockerDesktop I set http://172.17.14.133:3128 for both HTTP and HTTTPS. But it still does not seem to work.
I have defined some ENV variables (both user and system) like this:
HTTPS_PROXY=http://proxypmi.tradyso.com:3128
HTTP_PROXY=http://proxypmi.tradyso.com:3128
Also:
C:\Users\my.user\AppData\Roaming\Docker\http_proxy.json:
{
"http": "http://172.17.14.133:3128",
"https": "http://172.17.14.133:3128",
"exclude": null,
"transparent_http_ports": [],
"transparent_https_ports": []
}
C:\Users\my.user\AppData\Roaming\Docker\settings.json:
{
"settingsVersion": 1,
"autoStart": false,
"checkForUpdates": true,
"analyticsEnabled": false,
"displayedWelcomeWhale": true,
"displayed14393Deprecation": false,
"displayRestartDialog": true,
"displaySwitchWinLinContainers": true,
"latestBannerKey": "",
"debug": false,
"memoryMiB": 2048,
"swapMiB": 1024,
"cpus": 2,
"diskPath": null,
"diskSizeMiB": 64000000000,
"networkCIDR": "10.0.75.0/24",
"proxyHttpMode": true,
"overrideProxyHttp": "http://172.17.14.133:3128",
"overrideProxyHttps": "http://172.17.14.133:3128",
"overrideProxyExclude": null,
"useDnsForwarder": true,
"dns": "10.44.24.10",
"kubernetesEnabled": false,
"showKubernetesSystemContainers": false,
"kubernetesInitialInstallPerformed": false,
"cliConfigCreationDate": "03/22/2019 12:23:58",
"linuxDaemonConfigCreationDate": "03/22/2019 12:22:19",
"windowsDaemonConfigCreationDate": null,
"versionPack": "default",
"sharedDrives": {},
"executableDate": "",
"useWindowsContainers": false,
"swarmFederationExplicitlyLoggedOut": false,
"activeOrganizationName": null,
"exposeDockerAPIOnTCP2375": false
}
C:\Users\my.user\.docker\config.json:
{
"stackOrchestrator": "swarm",
"auths": {},
"credsStore": "wincred",
"proxies":
{
"default":
{
"httpProxy": "http://172.17.14.133:3128",
"httpsProxy": "http://172.17.14.133:3128",
"noProxy": ""
}
}
}
I also tried passing build-arg to tocker build:
docker build --build-arg HTTP_PROXY=http://172.17.14.133:3128 --build-arg HTTPS_PROXY=http://172.17.14.133:3128 ...
Finally, on the Docker Desktop network configuration, I have tried with DNSs both "Automatic" and Manual (Using my corporate dns servers)
None of this has worked.
Any hint on what should I do?
Thank you.
A collegue found out the problem:
By default, the bridge network that docker creates uses the same subnet as our office (172.17.0.0/16) and that causes trouble with the proxy ip address (172.17.14.133).
To solve this:
[Edit]: for a simpler method, use the following:
On daemon configuration, add "bip": "new_subbet". For example: "bip": "172.19.0.1/16". Then, restart docker.
Now, you won't even need to pass the extra --network="docker_gwbridge" parameter to the commands.
This shorter solution may not work with older versions of Docker for windows, so you may consider the original answer if this option does not work.
[Edit]: original method for old versions of docker for windows:
The bridge network cannot be deleted, but We can tell docker not to create it.
Go to Daemon Settings, Advanced => add "bridge": "none", to the configuration
After applying changes, Docker will restart and now We will be able to create our own custom bridge network
In this example, We are going to use (172.19.0.0/16)
Open a console and write:
docker network create --subnet=172.19.0.0/16 --gateway 172.19.0.1 -o com.docker.network.bridge.enable_icc=true -o com.docker.network.bridge.name=docker_gwbridge -o com.docker.network.bridge.enable_ip_masquerade=true docker_gwbridge
Then we can do a docker ls for checking that the previous command was successful:
$ docker network ls
NETWORK ID NAME DRIVER SCOPE
2a3635a49ffa docker_gwbridge bridge local
4e9ec758ee9f host host local
737c6c5fc82b none null local
Now do a docker search ubuntu (for example). It should be able to connect to the internet and fetch the images.
Important: From now on, some commands that need internet access will need to specify the new docker network with the extra parameter --network="docker_gwbridge"
For example a docker build command could be:
docker build --network="docker_gwbridge" --build-arg http_proxy=http://172.17.14.133:3128 --build-arg https_proxy=http://172.17.14.133:3128 -t foobar .

How to disable apache mesos memory/disk isolation?

I am checking out Apache Aurora (1.1.0)(0.16.0) and Apache Mesos (0.16.0) (1.1.0)with docker container. Here is an example Aurora job definition,
process_nginx = Process(
name='nginx',
cmdline=textwrap.dedent(r'''
exec /path_to/nginx -g "daemon off; pid /run/nginx.pid; error_log stderr notice;"
'''),
min_duration=3,
daemon=True,
)
task_nginx = Task(
name='nginx',
processes=[process_nginx,],
resources=Resources(
cpu=0.1,
ram=20*MB,
disk=50*MB,
),
finalization_wait=14,
)
job_nginx = Job(
cluster='x',
role='root',
name='nginx',
instances=6,
service=True,
task=task_nginx,
priority=1,
#tier='preferred',
constraints={
'X_HOST_MACHINE_ID': 'limit:2',
'HOST_TYPE.FRONTEND': 'true',
},
update_config=UpdateConfig(
batch_size=1,
watch_secs=29,
rollback_on_failure=True,
),
container=Docker(
image='my_nginx_docker_image_name',
parameters=[
{'name': 'network', 'value': 'host'},
{'name': 'log-driver', 'value': 'journald'},
{'name': 'log-opt', 'value': 'tag=nginx'},
{'name': 'oom-score-adj', 'value': '-500'},
{'name': 'memory-swappiness', 'value': '1'},
],
),
)
But, since specifying disk and ram limits bother me, I want to make both disabled.
problem 1
I thought only CPU resource would be isolated(=limited) if my all mesos agents are launched with the option --isolation=cgroups/cpu (not --isolation=cgroups/cpu,cgroups/mem).
But even in this case, all docker containers launched by mesos docker containerizer have --memory option, which is hard limit and causes OOM killer if a docker container requires more memory. (And it seems mesos docker containerizer does not support --memory-reservation.)
problem 2
Even in case of --isolation=cgroups/cpu, removing ram or disk parameter from Aurora Resource instance causes the following error.
Error loading configuration: TypeCheck(FAILED): MesosJob[task] failed: Task[resources] failed: Resources[ram] is required.
My question
Is it possible to disable memory and disk isolation ?
What is the difference between --isolation=cgroups/cpu and --isolation=cgroups/cpu,cgroups/mem?
As you've discovered, you can disable the memory and disk isolators in Mesos by not specifying them as part of the isolation agent flag. I'm unsure about the behavior of the Docker Containerizer in this scenario, but you might want to try using the Mesos Containerizer instead, as this is the preferred way to run Docker images in Mesos going forward.
As far as omitting the Resources from your Aurora config goes, unfortunately that won't be possible. Every Aurora job must specify its resource requirements so that the scheduler can match your task instances up with an offer from Mesos.

Recovering from Consul "No Cluster leader" state

I have:
one mesos-master in which I configured a consul server;
one mesos-slave in which I configure consul client, and;
one bootstrap server for consul.
When I hit start I am seeing the following error:
2016/04/21 19:31:31 [ERR] agent: failed to sync remote state: rpc error: No cluster leader
2016/04/21 19:31:44 [ERR] agent: coordinate update error: rpc error: No cluster leader
How do I recover from this state?
Did you look at the Consul docs ?
It looks like you have performed a ungraceful stop and now need to clean your raft/peers.json file by removing all entries there to perform an outage recovery. See the above link for more details.
As of Consul 0.7 things work differently from Keyan P's answer. raft/peers.json (in the Consul data dir) has become a manual recovery mechanism. It doesn't exist unless you create it, and then when Consul starts it loads the file and deletes it from the filesystem so it won't be read on future starts. There are instructions in raft/peers.info. Note that if you delete raft/peers.info it won't read raft/peers.json but it will delete it anyway, and it will recreate raft/peers.info. The log will indicate when it's reading and deleting the file separately.
Assuming you've already tried the bootstrap or bootstrap_expect settings, that file might help. The Outage Recovery guide in Keyan P's answer is a helpful link. You create raft/peers.json in the data dir and start Consul, and the log should indicate that it's reading/deleting the file and then it should say something like "cluster leadership acquired". The file contents are:
[ { "id": "<node-id>", "address": "<node-ip>:8300", "non_voter": false } ]
where <node-id> can be found in the node-id file in the data dir.
If u got raft version more than 2:
[
{
"id": "e3a30829-9849-bad7-32bc-11be85a49200",
"address": "10.88.0.59:8300",
"non_voter": false
},
{
"id": "326d7d5c-1c78-7d38-a306-e65988d5e9a3",
"address": "10.88.0.45:8300",
"non_voter": false
},
{
"id": "a8d60750-4b33-99d7-1185-b3c6d7458d4f",
"address": "10.233.103.119",
"non_voter": false
}
]
In my case I had 2 worker nodes in the k8s cluster, after adding another node the consul servers could elect a master and everything is up and running.
I will update what I did:
Little Background: We scaled down the AWS Autoscaling so lost the leader. But we had one server still running but without any leader.
What I did was:
I scaled up to 3 servers(don't make 2-4)
stopped consul in all 3 servers.sudo service consul stop(you can do status/stop/start)
created peers.json file and put it in old server(/opt/consul/data/raft)
start the 3 servers (peers.json should be placed on 1 server only)
For other 2 servers join it to leader using consul join 10.201.8.XXX
check peers are connected to leader using consul operator raft list-peers
Sample peers.json file
[
{
"id": "306efa34-1c9c-acff-1226-538vvvvvv",
"address": "10.201.n.vvv:8300",
"non_voter": false
},
{
"id": "dbeeffce-c93e-8678-de97-b7",
"address": "10.201.X.XXX:8300",
"non_voter": false
},
{
"id": "62d77513-e016-946b-e9bf-0149",
"address": "10.201.X.XXX:8300",
"non_voter": false
}
]
These id you can get from each server in /opt/consul/data/
[root#ip-10-20 data]# ls
checkpoint-signature node-id raft serf
[root#ip-10-1 data]# cat node-id
Some useful commands:
consul members
curl http://ip:8500/v1/status/peers
curl http://ip:8500/v1/status/leader
consul operator raft list-peers
cd opt/consul/data/raft/
consul info
sudo service consul status
consul catalog services
You may also ensure that bootstrap parameter is set in your Consul configuration file config.json on the first node:
# /etc/consul/config.json
{
"bootstrap": true,
...
}
or start the consul agent with the -bootstrap=1 option as described in the official Failure of a single server cluster Consul documentation.

Resources