Could not delete DC/OS service that was failed to deploy - mesos

I deployed a service in DC/OS (the service is cassandra). The deployment failed and it kept retrying. Under DC/OS > Services > Tasks I could see a new task was created every a few minutes, but they all had the status of "Failed". Under the Debug tab I could see the TASK_FAILED state with a error message about how I misconfigured the service (I picked a user that does not exist).
So I wanted to destroy the service and start over again.
Under Services, I clicked on the menu on the service and selected "Delete". The command was taken, and the Status changed to "Deleting" But then it stayed there forever.
If I checked the Tasks tab, I could see that DC/OS was still attempting to start the server every a few minutes.
Now how do I delete the service? Thanks!

As per latest DCOS cassandra servicce docs, you should uninstall it using dcos cli :
dcos package uninstall --app-id=<service-name> cassandra
If you are using DCOS 1.9 or older version, then follow below steps to uninstall service :
$ MY_SERVICE_NAME=<service-name>
$ dcos package uninstall --app-id=$MY_SERVICE_NAME cassandra`.
$ dcos node ssh --master-proxy --leader "docker run mesosphere/janitor /janitor.py \
-r $MY_SERVICE_NAME-role \
-p $MY_SERVICE_NAME-principal \
-z dcos-service-$MY_SERVICE_NAME"

Related

helm chart installation of cert-manager timed out for AKS

I'm trying to create two kubernetes (AKS) clusters for test environment using Azure DevOps. These clusters using letsencrypt certificates for their endpoints. I'm therefore automating the creation of these certificates using helm charts.
For some reason, the cert-manager installation helm task times out if I create two clusters around the same time.
I have tested the same release process with a single cluster and there isn't a problem when i run my deployment.
The helm cert-manager installation command that runs is:
c:\agent\_work\_tool\helm\2.11.0\x64\windows-amd64\helm.exe install --set ingressShim.defaultIssuerName=letsencrypt-prod,ingressShim.defaultIssuerKind=ClusterIssuer,rbac.create=false,serviceAccount.create=false --name appl-cert-manager --wait stable/cert-manager
As i said, this command for the 1st cluster succeeds. I receive a message:
16:20:26.4583241Z cert-manager has been deployed successfully!
However, the second command takes about 5 minutes. Then I receive this message:
2018-11-08T16:28:14.4988796Z ##[error]Error: release appl-cert-manager failed: timed out waiting for the condition
Is this happening because the name has to be globally unique?
thanks
In case someone has the same problem, it's got a simple solution that works consistently for me.
Add a timeout argument to helm:
--timeout 600
for example for what I assume is a 10 minute timeout setting.

ICP fails to start after machine reboot

I have ICP V2.1 installed into a RHEL VMWare image. After rebooting the image, ICP fails to start in what appears to be the first known issue in the documentation (Kubernetes controller manager fails to start after a master or cluster restart). However, the prescribed resolution does not get my system going.
Here is the running pod list:
NAME READY STATUS RESTARTS AGE
calico-node-amd64-dtl47 2/2 Running 14 20h
filebeat-ds-amd64-mvcsj 1/1 Running 8 20h
k8s-etcd-192.168.232.131 1/1 Running 7 20h
k8s-mariadb-192.168.232.131 1/1 Running 7 20h
k8s-master-192.168.232.131 2/3 CrashLoopBackOff 15 17m
k8s-proxy-192.168.232.131 1/1 Running 7 20h
metering-reader-amd64-gkwt4 1/1 Running 7 20h
monitoring-prometheus-nodeexporter-amd64-sghrv 1/1 Running 7 20h
Removing the k8s-master-192.168.232.131 pod and allowing it to restart only puts it back into the CrashLoopBackOff state. Here is how the last line in controller manager log looks:
F1029 23:55:07.345341 1 controllermanager.go:176] error building controller context: failed to get supported resources from server: unable to retrieve the complete list of server APIs: servicecatalog.k8s.io/v1alpha1: an error on the server ("Error: 'dial tcp 10.0.0.145:443: getsockopt: connection refused'\nTrying to reach: 'https://10.0.0.145:443/apis/servicecatalog.k8s.io/v1alpha1'") has prevented the request from succeeding
Removing the pod or removing the failed controller master docker container directly has no effect. It seems like another service hasn't started yet, or failed to start. I've waited several hours to see if the issue resolves itself, but to no avail.
Thanks...
Before the fix of https://github.com/kubernetes/kubernetes/pull/49495, kuberentes controller manager failed to start if an registered extension-apiserver is not ready. In ICP, service catalog is implemented as extension-apiserver.
Usually after ICP master is restarted, kubelet will start the k8s management service first as static pod. After that, it will get pods/nodes/service information from kubernetes api server, and then start all the pods including catalog api service. For that case, the whole cluster is recovered.
However for your case, there is a race condition that when kubelet get pods information from kuberentes api server and start all the pods, it has not get the nodes information from kubernetes api server yet. As a result, kubelet failed to start catalog api service due to nodeSelector is not met. The whole cluster failed to be recovered.
In next release of ICP 2.1.0.1, kuberentes will be upgraded into 1.8.2 with the fix of https://github.com/kubernetes/kubernetes/pull/49495. The issue will be resolved completely.
Before that you could try the following workaround method.
Use the -s flag form of the kubectl command if your token has expired after restart and you no longer have access to the GUI to re-establish it.
Delete apiservices of v1alpha1.servicecatalog.k8s.io
kubectl delete apiservices v1alpha1.servicecatalog.k8s.io
kubectl -s 127.0.0.1:8888 delete apiservices v1alpha1.servicecatalog.k8s.io
Delete the dead controller manager
docker rm <k8s controller manager>
Wait until service catalog started
Recover the service catalog apiservices by re-register the apiservice of v1alpha1.servicecatalog.k8s.io
kubectl apply -f cluster/cfc-components/service-catalog/apiregistration.yaml
kubectl -s 127.0.0.1:8888 apply -f cluster/cfc-components/service-catalog/apiregistration.yaml

DCOS Slaves : add placement constraints

I Have installed a DCOS cluster with the guidance of below link(https://dcos.io/docs/1.10/installing/custom/advanced/).
Now, DCOS cluster up and running.I want to add "Placement Constraints" for the applications that host top of DCOS cluster.
I added parameters(MESOS_ATTRIBUTES=SPACE:RACK1) into
/opt/mesosphere/etc/mesos-slave-common file. After I added the, I could not up the dcos-mesos-slave service again
Could you please advise me how to approach this by using above DCOS installation method.
etc # cat mesos-slave-common
MESOS_MASTER=zk://zk-1.zk:2181,zk-2.zk:2181,zk-3.zk:2181,zk-4.zk:2181,zk-5.zk:2181/mesos
MESOS_CONTAINERIZERS=docker,mesos
MESOS_EXTERNAL_LOG_FILE=/var/log/mesos/mesos-agent.log
MESOS_MODULES_DIR=/opt/mesosphere/etc/mesos-slave-modules
MESOS_CONTAINER_LOGGER=com_mesosphere_mesos_JournaldLogger
MESOS_ISOLATION=cgroups/cpu,cgroups/mem,disk/du,network/cni,filesystem/linux,docker/runtime,docker/volume,volume/sandbox_path,volume/secret,posix/rlimits,namespaces/pid,linux/capabilities,com_mesosphere_MetricsIsolatorModule,cgroups/devices,gpu/nvidia
MESOS_DOCKER_VOLUME_CHECKPOINT_DIR=/var/lib/mesos/isolators/docker/volume
MESOS_IMAGE_PROVIDERS=docker
MESOS_NETWORK_CNI_CONFIG_DIR=/opt/mesosphere/etc/dcos/network/cni
MESOS_NETWORK_CNI_PLUGINS_DIR=/opt/mesosphere/active/cni/:/opt/mesosphere/active/dcos-cni/:/opt/mesosphere/active/mesos/libexec/mesos
MESOS_WORK_DIR=/var/lib/mesos/slave
MESOS_SLAVE_SUBSYSTEMS=cpu,memory
MESOS_LAUNCHER_DIR=/opt/mesosphere/active/mesos/libexec/mesos
MESOS_EXECUTOR_ENVIRONMENT_VARIABLES=file:///opt/mesosphere/etc/mesos-executor-environment.json
MESOS_EXECUTOR_REGISTRATION_TIMEOUT=10mins
MESOS_CGROUPS_ENABLE_CFS=true
MESOS_CGROUPS_LIMIT_SWAP=false
MESOS_DISALLOW_SHARING_AGENT_PID_NAMESPACE=true
MESOS_DOCKER_REMOVE_DELAY=1hrs
MESOS_DOCKER_STOP_TIMEOUT=20secs
MESOS_DOCKER_STORE_DIR=/var/lib/mesos/slave/store/docker
MESOS_GC_DELAY=2days
MESOS_HOSTNAME_LOOKUP=false
GLOG_drop_log_memory=false
MESOS_ATTRIBUTES=SPACE:RACK1
Mesos attributes should be added to /var/lib/dcos/mesos-slave-common, not /opt/mesosphere/etc/mesos-slave-common. Note that you may need to create this file the first time.
Steps
Stop the slave: systemctl stop dcos-mesos-slave
Add your attributes to /var/lib/dcos/mesos-slave-common
Clean out old live executors: rm -f /var/lib/mesos/slave/meta/slaves/latest
Start the slave: systemctl restart dcos-mesos-slave

on installation of gitlab , at final step sudo service gitlab start ,facing an issue i.e. unicorn webserver is not running ,what to do?

On installing gitlab, everything is fine.
But at final step of installation when I run command
$ sudo service gitlab start
it shows an error i.e., unicorn web server is not working but GitLab Sidekiq job dispatcher with pid 4182 is running properly.

Hadoop installation: what is "This is comment for WebHCat Service (sic)"

Using Ambari, This is comment for WebHcat Service is the final selection in the “Services Selection” step.
If I don't select this service, then the Customize Services step hangs indefinitely. It doesn't matter which other services are selected.
If I select it, then the Customize Services step functions normally, but the installation will stop on step four with the error message:
“org.apache.ambari.server.controller.spi.SystemException:
An internal system exception occurred:
Configuration with tag version1439256707212 exists for webhcat-site
This is on a clean install, for a single node SLES 11 SP3 server.
What is the service This is comment for WebHcat Service, and why is it a comment instead of a service name?
If this is a fresh install, it's strange your getting configuration already exists errors. I would try to clean your ambari server instance by running:
sudo ambari-server reset
This will reset the postgres database that ambari-server uses, giving you a clean slate to retry another cluster install.

Resources