.Marathon-lb has been installed successfully. But dcos has unable to complete Deploy marathon-lb for more than hour - mesos

I installed dcos locally and installed marathon-lb on top of dcos. Installation has been completed but dcos has been unable to complete marathon-lb deployment. It is still either waiting or delayed.

At least one node in your cluster needs to be installed with slave_public. In node's properties you'll see:
public_ip: true
marathon-lb jobs are supposed to run on nodes with:
RESOURCE ROLES: slave_public
as you can see from marathon-lb configuration page. Such nodes should be facing the internet.

Related

CDH cluster installation failing in "distributing" stage- failed due to stall on seeded torrent

Hi,
We are trying to install CDH cluster on Redhat 7 remote server using cloudera-installer.bin file, in standalone mode( we have only 1 host) . We are specifying hostname/ip address of the machine during installation , it is able to resolve it. But the installation halts during parcel distribution stage. Here are the logs of cloudera-scm-agent :(We tried both cloudera express edition and entrerprise trial version too)
['http://INHUSZ1-V250152:7180/cmf/parcel/download/CDH-5.15.1-1.cdh5.15.1.p0.4-el7.parcel'] location=/opt/cloudera/parcels/.flood/CDH-5.15.1-1.cdh5.15.1.p0.4-el7.parcel progress=0]
[03/Oct/2018 10:11:55 +0000] 28315 Thread-13 downloader INFO Current state: CDH-5.15.1-1.cdh5.15.1.p0.4-el7.parcel [totalDownloaded=0 totalSize=2120090032 upload=0 state=downloading seed=['http://INHUSZ1-V250152:7180/cmf/parcel/download/CDH-5.15.1-1.cdh5.15.1.p0.4-el7.parcel'] location=/opt/cloudera/parcels/.flood/CDH-5.15.1-1.cdh5.15.1.p0.4-el7.parcel progress=0]
[03/Oct/2018 10:11:57 +0000] 28315 Thread-13 downloader INFO Current state: CDH-5.15.1-1.cdh5.15.1.p0.4-el7.parcel [totalDownloaded=0 totalSize=2120090032 upload=0 state=downloading seed=['http://INHUSZ1-V250152:7180/cmf/parcel/download/CDH-5.15.1-1.cdh5.15.1.p0.4-el7.parcel'] location=/opt/cloudera/parcels/.flood/CDH-5.15.1-1.cdh5.15.1.p0.4-el7.parcel progress=0]
[03/Oct/2018 10:11:59 +0000] 28315 Thread-13 downloader INFO Current state: CDH-5.15.1-1.cdh5.15.1.p0.4-el7.parcel [totalDownloaded=0 totalSize=2120090032 upload=0 state=downloading seed=['http://INHUSZ1-V250152:7180/cmf/parcel/download/CDH-5.15.1-1.cdh5.15.1.p0.4-el7.parcel'] location=/opt/cloudera/parcels/.flood/CDH-5.15.1-1.cdh5.15.1.p0.4-el7.parcel progress=0]
Please let us know what can be done
I just had the same error message and stall during install at parcel distribution stage.
Installing a single node (test) cluster on CentOS 7.5 with CDH Express 5.15.
Solution that worked for me was adding the node IP and FQDN to /etc/hosts (previously it only contained entries for 127.0.0.1 localhost):
[root#mynode ~]# vi /etc/hosts
192.168.1.1 myhostname.mydomain
Then restarted Cloudera SCM Agent:
[root#mynode ~]# service cloudera-scm-agent restart
Installation then continued successfully.
Do the following:
Stop all services.
Deactivate all in-use parcels.
Shut down the Cloudera Manager Agent on all hosts.
Move the existing parcels to the new location.
Configure the host parcel directory.
Start the Cloudera Manager Agents.
Activate the parcels.
Start all services.
Delete the corresponding parcels package from below folder including .torrent file
/opt/cloudera/parcels/.flood/
Download and distribute
This is happening because .torrent file is corrupted

Could not delete DC/OS service that was failed to deploy

I deployed a service in DC/OS (the service is cassandra). The deployment failed and it kept retrying. Under DC/OS > Services > Tasks I could see a new task was created every a few minutes, but they all had the status of "Failed". Under the Debug tab I could see the TASK_FAILED state with a error message about how I misconfigured the service (I picked a user that does not exist).
So I wanted to destroy the service and start over again.
Under Services, I clicked on the menu on the service and selected "Delete". The command was taken, and the Status changed to "Deleting" But then it stayed there forever.
If I checked the Tasks tab, I could see that DC/OS was still attempting to start the server every a few minutes.
Now how do I delete the service? Thanks!
As per latest DCOS cassandra servicce docs, you should uninstall it using dcos cli :
dcos package uninstall --app-id=<service-name> cassandra
If you are using DCOS 1.9 or older version, then follow below steps to uninstall service :
$ MY_SERVICE_NAME=<service-name>
$ dcos package uninstall --app-id=$MY_SERVICE_NAME cassandra`.
$ dcos node ssh --master-proxy --leader "docker run mesosphere/janitor /janitor.py \
-r $MY_SERVICE_NAME-role \
-p $MY_SERVICE_NAME-principal \
-z dcos-service-$MY_SERVICE_NAME"

Upgrade MariaDB Cluster 10.1 to 10.2

I'm planning to upgrade MariaDB Galera cluster from 10.1 to 10.2. Does anyone have details for steps to upgrade? My idea is something
Backup
Shutdown cluster
Uninstall 10.1 from each node
Install 10.2 to each node
Run mysql_upgrade at node which going to started first
Configure the first node and start
Configure rest of nodes and start them
I have three node cluster with maxscale loadbalancing.
You can upgrade the cluster in a rolling fashion, i.e. one node at a time without shutting down the others. That is one of the benefits of Galera cluster.
Make sure to avoid 10.2.9 or be ready to edit mysqld_safe, see here.
For each node:
maxadmin: set server $node-name maintenance
Backup databases and config files
Shutdown the mysqld instance
Uninstall 10.1. On Redhat use rpm -e --nodeps rather than yum remove to avoid uninstalling packages such as postfix and cronie.
Install 10.2
Copy back config files, change any mariadb-10.1 sections to mariadb-10.2
Startup the mysqld instance
If you're on Redhat, CentOS or Fedora run mysql_upgrade
maxadmin: clear server $node-name maintenance

What docker images does DCOS Flink package require?

I have built a DCOS local universe and installed it into a cluster behind a firewall - there is no internet access to the cluster. One of the packages installed in the universe is Flink. I have installed DCOS using the cluster_docker_registry_url variable pointing at a local Docker registry which has a very small number of packages on it; it is not a mirror of the main Docker Hub.
When I try to install the Flink package into DCOS, I get 404 errors in the Mesos logs relating to missing docker images that I assume the package tries to download from the local Docker registry. The Flink cluster fails to start.
What Docker images does the Flink package try to download? I thought the build process of a local universe pulled all dependencies down when it is built, so there should be no external dependencies once it's built? What do I need to do to be able to install DCOS when there is no internet access?
That depends on the scala version you are using:
scala 2.10: mesosphere/dcos-flink:1.2.0-1.4
scala 2.11: mesosphere/dcos-flink-2-11:1.2.0-1.4
See here
Furthermore, it requires
openjdk:8-jre ,see here
For more details feel free to refer to the universe specification for the Apache Flink service (or ping me directly):
https://github.com/mesosphere/universe/blob/version-3.x/repo/packages/F/flink/1/

Impala The Cloudera Manager Agent got an unexpected response from this role's web server

i have done an hadoop cluster installation with cloudera manager. After this installation impala status has become bad.
I have the following error for master node:
Web Server Status
and this one for nodes with imapala daemon:
Impala Daemon Ready Check, Web Server Status
looking into logs i have found some errors:
The health test result for IMPALAD_WEB_METRIC_COLLECTION has become bad: The Cloudera Manager Agent got an unexpected response from this role's web server.
looking into cloudera-scm-agent.log there are those errors:
1261 Monitor-HostMonitor throttling_logger ERROR (29 skipped) Failed to collect NTP metrics
i tryed to install NTP (sudo apt-get install ntp) but after this installation HDFS, HIVE, YARN and others services goes bad, removing that only impala goes bad.
MainThread agent ERROR Failed to connect to previous supervisor.
Another error is this:
Monitor-GenericMonitor throttling_logger ERROR Error fetching metrics at 'http://nodo-1:50075/jmx'
i tried looking all hostnames and seems correct...
so, what is this problem? how can i solve it?
I also had problem with NTP, the problem still existed after installing NTP , but when I done sudo service ntp restart the error was fixed

Resources