Waiting for configuring calico node to node mesh - ICP 2.1.0.2 Installation - ibm-cloud-private

During the installation I encounter this error on Ubuntu 16.04 single node with the host ip 192.168.240.14:
TASK [network : Ensuring that the calico.yaml file exist] **********************
changed: [localhost]
TASK [network : include] *******************************************************
TASK [network : include] *******************************************************
TASK [network : include] *******************************************************
included: /installer/playbook/roles/network/tasks/calico.yaml for localhost
TASK [network : Enabling calico] ***********************************************
changed: [localhost]
TASK [network : Waiting for configuring calico service] ************************
ok: [localhost -> 192.168.240.14] => (item=192.168.240.14)
TASK [network : Waiting for configuring calico node to node mesh] **************
FAILED - RETRYING: Waiting for configuring calico node to node mesh (100 retries left).
FAILED - RETRYING: Waiting for configuring calico node to node mesh (99 retries left).
FAILED - RETRYING: Waiting for configuring calico node to node mesh (98 retries left).
FAILED - RETRYING: Waiting for configuring calico node to node mesh (97 retries left).
FAILED - RETRYING: Waiting for configuring calico node to node mesh (96 retries left).
FAILED - RETRYING: Waiting for configuring calico node to node mesh (95 retries left).
FAILED - RETRYING: Waiting for configuring calico node to node mesh (94 retries left).
FAILED - RETRYING: Waiting for configuring calico node to node mesh (93 retries left).
FAILED - RETRYING: Waiting for configuring calico node to node mesh (92 retries left).
FAILED - RETRYING: Waiting for configuring calico node to node mesh (91 retries left).
I read that it's possible to disable calico's node to node mesh functionality but since calico is installed via ICP the calicoctl command is not recognized. In the config.yaml I couldn't find an option where I can disable this setting either.
So far I've tried to disable it by downloading and executing calicoctl seperately, but the connection to the cluster can't be established:
user#user:~/Desktop/calicoctl$ ./calicoctl config set nodeToNodeMesh off
Error executing command: client: etcd cluster is unavailable or misconfigured; error #0: dial tcp 127.0.0.1:2379: getsockopt: connection refused
I'm not sure if it's because it tries to dial the loopback ip adress instead of 192.168.240.14 or something else. And I also don't know if this can actually fix the issue during the installation.
I'm not very experienced with this and am thankful for any help!
EDIT:
I ran the installation with ICP 2.1.0.1 again and had the same error, but with 10 retries instead and received following error message:
TASK [network : Enabling calico] ***********************************************
changed: [localhost]
TASK [network : Waiting for configuring calico service] ************************
ok: [localhost -> 192.168.240.14] => (item=192.168.240.14)
FAILED - RETRYING: Waiting for configuring calico node to node mesh (10 retries left).
FAILED - RETRYING: Waiting for configuring calico node to node mesh (9 retries left).
FAILED - RETRYING: Waiting for configuring calico node to node mesh (8 retries left).
FAILED - RETRYING: Waiting for configuring calico node to node mesh (7 retries left).
FAILED - RETRYING: Waiting for configuring calico node to node mesh (6 retries left).
FAILED - RETRYING: Waiting for configuring calico node to node mesh (5 retries left).
FAILED - RETRYING: Waiting for configuring calico node to node mesh (4 retries left).
FAILED - RETRYING: Waiting for configuring calico node to node mesh (3 retries left).
FAILED - RETRYING: Waiting for configuring calico node to node mesh (2 retries left).
FAILED - RETRYING: Waiting for configuring calico node to node mesh (1 retries left).
TASK [network : Waiting for configuring calico node to node mesh] **************
fatal: [localhost]: FAILED! => {"attempts": 10, "changed": true, "cmd": "kubectl get pods --show-all --namespace=kube-system |grep configure-calico-mesh", "delta": "0:00:01.343071", "end": "2018-06-20 08:12:28.433186", "failed": true, "rc": 0, "start": "2018-06-20 08:12:27.090115", "stderr": "", "stderr_lines": [], "stdout": "configure-calico-mesh-9f756 0/1 Pending 0 5m", "stdout_lines": ["configure-calico-mesh-9f756 0/1 Pending 0 5m"]}
PLAY RECAP *********************************************************************
192.168.240.14 : ok=168 changed=54 unreachable=0 failed=0
localhost : ok=81 changed=16 unreachable=0 failed=1
Playbook run took 0 days, 0 hours, 19 minutes, 8 seconds
user#user:/opt/ibm-cloud-private-ce-2.1.0.1/cluster$
I don't understand why suddenly the localhost is included in the setup steps, since I only specified my IP adress in the hosts file:
[master]
192.168.240.14 ansible_user="user" ansible_ssh_pass="6CEd29CN" ansible_become=true ansible_become_pass="6CEd29CN" ansible_port="22" ansible_ssh_common_args="-oPubkeyAuthentication=no"
[worker]
192.168.240.14 ansible_user="user" ansible_ssh_pass="6CEd29CN" ansible_become=true ansible_become_pass="6CEd29CN" ansible_port="22" ansible_ssh_common_args="-oPubkeyAuthentication=no"
[proxy]
192.168.240.14 ansible_user="user" ansible_ssh_pass="6CEd29CN" ansible_become=true ansible_become_pass="6CEd29CN" ansible_port="22" ansible_ssh_common_args="-oPubkeyAuthentication=no"
#[management]
#4.4.4.4
#[va]
#5.5.5.5
My config.yaml file looks like this:
# Licensed Materials - Property of IBM
# IBM Cloud private
# # Copyright IBM Corp. 2017 All Rights Reserved
# US Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp.
---
###### docker0: 172.17.0.1
###### eth0: 192.168.240.14
## Network Settings
#network_type: calico
# network_helm_chart_path: < helm chart path >
## Network in IPv4 CIDR format
network_cidr: 10.1.0.0/16
## Kubernetes Settings
service_cluster_ip_range: 10.0.0.1/24
## Makes the Kubelet start if swap is enabled on the node. Remove
## this if your production env want to disble swap.
kubelet_extra_args: ["--fail-swap-on=false"]
# cluster_domain: cluster.local
# cluster_name: mycluster
cluster_CA_domain: "mydomain.icp"
# cluster_zone: "myzone"
# cluster_region: "myregion"
## Etcd Settings
#etcd_extra_args: ["--grpc-keepalive-timeout=0", "--grpc-keepalive-interval=0", #"--snapshot-count=10000"]
## General Settings
# wait_for_timeout: 600
# docker_api_timeout: 100
## Advanced Settings
default_admin_user: user
default_admin_password: 6CEd29CN
# ansible_user: <username>
# ansible_become: true
# ansible_become_password: <password>
## Kubernetes Settings
# kube_apiserver_extra_args: []
# kube_controller_manager_extra_args: []
# kube_proxy_extra_args: []
# kube_scheduler_extra_args: []
## Enable Kubernetes Audit Log
# auditlog_enabled: false
## GlusterFS Settings
# glusterfs: false
## GlusterFS Storage Settings
# storage:
# - kind: glusterfs
# nodes:
# - ip: <worker_node_m_IP_address>
# device: <link path>/<symlink of device aaa>,<link path>/<symlink of device bbb>
# - ip: <worker_node_n_IP_address>
# device: <link path>/<symlink of device ccc>
# - ip: <worker_node_o_IP_address>
# device: <link path>/<symlink of device ddd>
# storage_class:
# name:
# default: false
# volumetype: replicate:3
## Network Settings
## Calico Network Settings
### calico_ipip_enabled: true
calico_ipip_enabled: false
calico_tunnel_mtu: 1430
calico_ip_autodetection_method: interface=eth0
## IPSec mesh Settings
## If user wants to configure IPSec mesh, the following parameters
## should be configured through config.yaml
ipsec_mesh:
enable: false
# interface: <interface name on which IPsec will be enabled>
# subnets: []
# exclude_ips: "<list of IP addresses separated by a comma>"
kube_apiserver_insecure_port: 8080
kube_apiserver_secure_port: 8001
## External loadbalancer IP or domain
## Or floating IP in OpenStack environment
# cluster_lb_address: none
## External loadbalancer IP or domain
## Or floating IP in OpenStack environment
# proxy_lb_address: none
## Install in firewall enabled mode
firewall_enabled: false
## Allow loopback dns server in cluster nodes
loopback_dns: true
## High Availability Settings
# vip_manager: etcd
## High Availability Settings for master nodes
# vip_iface: eth0
# cluster_vip: 127.0.1.1
## High Availability Settings for Proxy nodes
# proxy_vip_iface: eth0
# proxy_vip: 127.0.1.1
## Federation cluster Settings
# federation_enabled: false
# federation_cluster: federation-cluster
# federation_domain: cluster.federation
# federation_apiserver_extra_args: []
# federation_controllermanager_extra_args: []
# federation_external_policy_engine_enabled: false
## vSphere cloud provider Settings
## If user wants to configure vSphere as cloud provider, vsphere_conf
## parameters should be configured through config.yaml
# kubelet_nodename: hostname
# cloud_provider: vsphere
# vsphere_conf:
# user: <vCenter username for vSphere cloud provider>
# password: <password for vCenter user>
# server: <vCenter server IP or FQDN>
# port: [vCenter Server Port; default: 443]
# insecure_flag: [set to 1 if vCenter uses a self-signed certificate]
# datacenter: <datacenter name on which Node VMs are deployed>
# datastore: <default datastore to be used for provisioning volumes>
# working_dir: <vCenter VM folder path in which node VMs are located>
## Disabled Management Services Settings
## You can disable the following management services: ["service-catalog", "metering", "monitoring", "istio", "vulnerability-advisor", "custom-metrics-adapter"]
#disabled_management_services: ["istio", "vulnerability-advisor", "custom-metrics-adapter"]
disabled_management_services: ["service-catalog", "metering", "monitoring", "istio", "vulnerability-advisor", "custom-metrics-adapter"]
## Docker Settings
# docker_env: []
# docker_extra_args: []
## The maximum size of the log before it is rolled
# docker_log_max_size: 50m
## The maximum number of log files that can be present
# docker_log_max_file: 10
## Install/upgrade docker version
# docker_version: 17.12.1
## ICP install docker automatically
# install_docker: true
## Ingress Controller Settings
## You can add your ingress controller configuration, and the allowed configuration can refer to
## https://github.com/kubernetes/ingress-nginx/blob/nginx-0.9.0/docs/user-guide/configmap.md#configuration-options
# ingress_controller:
# disable-access-log: 'true'
## Clean metrics indices in Elasticsearch older than this number of days
# metrics_max_age: 1
## Clean application log indices in Elasticsearch older than this number of days
# logs_maxage: 1
## Uncomment the line below to install Kibana as a managed service.
kibana_install: true
# STARTING_CLOUDANT
# cloudant:
# namespace: kube-system
# pullPolicy: IfNotPresent
# pvPath: /opt/ibm/cfc/cloudant
# database:
# password: fdrreedfddfreeedffde
# federatorCommand: hostname
# federationIdentifier: "-0"
# readinessProbePeriodSeconds: 2
# readinessProbeInitialDelaySeconds: 90
# END_CLOUDANT

Had a similar issue when deploying with Ansible on Ubuntu servers.... As a user mentioned on Kubernetes issue 43156, "We should not inherit nameserver 127.x.x.x in the pod resolv.conf from the node, as the node localhost will not be accessible from the pod."
If your /etc/resolv.conf has the localhost IP on it, I suggest you to replace it with the nodes IP, as for instance, and in case your using Ubuntu, to opt-out of NetworkManager to avoid it setting it back after a restart:
systemctl disable --now systemd-resolved.service
cp /etc/resolv.conf /etc/resolv.conf.bkp
echo "nameserver <Node's_IP>" > /etc/resolv.conf
More details on opting-out of NetworkManager at the following link:
How to take back control of /etc/resolv.conf on Linux

Related

Logstash error : Failed to publish events caused by: write tcp YY.YY.YY.YY:40912->XX.XX.XX.XX:5044: write: connection reset by peer

I am using filebeat to push my logs to elasticsearch using logstash and the set up was working fine for me before. I am getting Failed to publish events error now.
filebeat | 2020-06-20T06:26:03.832969730Z 2020-06-20T06:26:03.832Z INFO log/harvester.go:254 Harvester started for file: /logs/app-service.log
filebeat | 2020-06-20T06:26:04.837664519Z 2020-06-20T06:26:04.837Z ERROR logstash/async.go:256 Failed to publish events caused by: write tcp YY.YY.YY.YY:40912->XX.XX.XX.XX:5044: write: connection reset by peer
filebeat | 2020-06-20T06:26:05.970506599Z 2020-06-20T06:26:05.970Z ERROR pipeline/output.go:121 Failed to publish events: write tcp YY.YY.YY.YY:40912->XX.XX.XX.XX:5044: write: connection reset by peer
filebeat | 2020-06-20T06:26:05.970749223Z 2020-06-20T06:26:05.970Z INFO pipeline/output.go:95 Connecting to backoff(async(tcp://xx.com:5044))
filebeat | 2020-06-20T06:26:05.972790871Z 2020-06-20T06:26:05.972Z INFO pipeline/output.go:105 Connection to backoff(async(tcp://xx.com:5044)) established
Logstash pipeline
02-beats-input.conf
input {
beats {
port => 5044
}
}
10-syslog-filter.conf
filter {
json {
source => "message"
}
}
30-elasticsearch-output.conf
output {
elasticsearch {
hosts => ["localhost:9200"]
manage_template => false
index => "index-%{+YYYY.MM.dd}"
}
}
Filebeat configuration
Sharing my filebeat config at /usr/share/filebeat/filebeat.yml
filebeat.inputs:
- type: log
# Change to true to enable this input configuration.
enabled: true
# Paths that should be crawled and fetched. Glob based paths.
paths:
- /logs/*
#============================= Filebeat modules ===============================
filebeat.config.modules:
# Glob pattern for configuration loading
path: ${path.config}/modules.d/*.yml
# Set to true to enable config reloading
reload.enabled: false
# Period on which files under path should be checked for changes
#reload.period: 10s
#==================== Elasticsearch template setting ==========================
setup.template.settings:
index.number_of_shards: 3
#index.codec: best_compression
#_source.enabled: false
#============================== Kibana =====================================
# Starting with Beats version 6.0.0, the dashboards are loaded via the Kibana API.
# This requires a Kibana endpoint configuration.
setup.kibana:
# Kibana Host
# Scheme and port can be left out and will be set to the default (http and 5601)
# In case you specify and additional path, the scheme is required: http://localhost:5601/path
# IPv6 addresses should always be defined as: https://[2001:db8::1]:5601
#host: "localhost:5601"
# Kibana Space ID
# ID of the Kibana Space into which the dashboards should be loaded. By default,
# the Default Space will be used.
#space.id:
#----------------------------- Logstash output --------------------------------
output.logstash:
# The Logstash hosts
hosts: ["xx.com:5044"]
# Optional SSL. By default is off.
# List of root certificates for HTTPS server verifications
#ssl.certificate_authorities: ["/etc/pki/root/ca.pem"]
# Certificate for SSL client authentication
#ssl.certificate: "/etc/pki/client/cert.pem"
# Client Certificate Key
#ssl.key: "/etc/pki/client/cert.key"
#================================ Processors =====================================
# Configure processors to enhance or manipulate events generated by the beat.
processors:
- add_host_metadata: ~
- add_cloud_metadata: ~
When I do telnet xx.xx 5044, this is the what I see in terminal
Trying X.X.X.X...
Connected to xx.xx.
Escape character is '^]'
I had the same problem. Here some steps, which could help you to find the core of your problem.
Firstly I tested such way: filebeat (localhost) -> logstash (localhost) -> elastic -> kibana. Each service is on the same machine.
My /etc/logstash/conf.d/config.conf:
input {
beats {
port => 5044
ssl => false
}
}
output {
elasticsearch {
hosts => ["localhost:9200"]
}
}
Here, I specially disabled ssl (in my case it was a main reason of the issue, even when certificates were correct, magic).
After that don't forget to restart logstash and test with sudo filebeat -e command.
If everything is ok, you wouldn't see 'connection reset by peer' error
I had the same problem. Starting filebeat as a sudo user worked for me.
sudo ./filebeat -e
I have made some changes to input plugin config, as specifying ssl => false but did not worked without starting filebeat as a sudo privileged user or as root.
In order to start filebeat as a sudo user, filebeat.yml file must be owned by root. Change whole filebeat folder permission to a sudo privileged user by using sudo chown -R sime_sudo_user:some_group filebeat-7.15.0-linux-x86_64/ and then chown root filebeat.yml will change the permission of file.

Setup network access for Elasticsearch on Ubuntu running on parallels

I have an instance of ubuntu server running version 20.04 installed on parallels on a Mac however I am unable to access it from other devices on my network, only from the instance of ubuntu.
lsb_release -a gives the following result
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 20.04 LTS
Release: 20.04
Codename: focal
I have installed Elasticsearch version 7.6.2 from the APT repository using the instructions here
I can run curl -X GET 'http://localhost:9200' on ubuntu and get the following output
{
"name" : "dev",
"cluster_name" : "elasticsearch",
"cluster_uuid" : "u4Xx8JDyTdaWv_HsYK6xXA",
"version" : {
"number" : "7.6.2",
"build_flavor" : "default",
"build_type" : "deb",
"build_hash" : "ef48eb35cf30adf4db14086e8aabd07ef6fb113f",
"build_date" : "2020-03-26T06:34:37.794943Z",
"build_snapshot" : false,
"lucene_version" : "8.4.0",
"minimum_wire_compatibility_version" : "6.8.0",
"minimum_index_compatibility_version" : "6.0.0-beta1"
},
"tagline" : "You Know, for Search"
}
If I run hostname -I I get 10.211.55.11 fdb2:2c26:f4e4:0:21c:42ff:fee9:e2c5 which is the IP address of the ubuntu instance.
However when I run curl -X GET 'http://10.211.55.11:9200' from my Mac I get the following result curl: (7) Failed to connect to 10.211.55.11 port 9200: Connection refused
How can I get access to my instance of Elasticsearch from other devices on my network?
sudo ufw status gives me the following rules
Status: active
To Action From
-- ------ ----
9200 ALLOW Anywhere
9200 (v6) ALLOW Anywhere (v6)
/etc/elasticsearch/elasticsearch.yml contains the following
# ======================== Elasticsearch Configuration =========================
#
# NOTE: Elasticsearch comes with reasonable defaults for most settings.
# Before you set out to tweak and tune the configuration, make sure you
# understand what are you trying to accomplish and the consequences.
#
# The primary way of configuring a node is via this file. This template lists
# the most important settings you may want to configure for a production cluster.
#
# Please consult the documentation for further information on configuration options:
# https://www.elastic.co/guide/en/elasticsearch/reference/index.html
#
# ---------------------------------- Cluster -----------------------------------
#
# Use a descriptive name for your cluster:
#
#cluster.name: my-application
#
# ------------------------------------ Node ------------------------------------
#
# Use a descriptive name for the node:
#
#node.name: node-1
#
# Add custom attributes to the node:
#
#node.attr.rack: r1
#
# ----------------------------------- Paths ------------------------------------
#
# Path to directory where to store the data (separate multiple locations by comma):
#
path.data: /var/lib/elasticsearch
#
# Path to log files:
#
path.logs: /var/log/elasticsearch
#
# ----------------------------------- Memory -----------------------------------
#
# Lock the memory on startup:
#
#bootstrap.memory_lock: true
#
# Make sure that the heap size is set to about half the memory available
# on the system and that the owner of the process is allowed to use this
# limit.
#
# Elasticsearch performs poorly when the system is swapping the memory.
#
# ---------------------------------- Network -----------------------------------
#
# Set the bind address to a specific IP (IPv4 or IPv6):
#
#network.host: 0.0.0.0
# Set a custom port for HTTP:
#
#http.port: 9200
#
# For more information, consult the network module documentation.
#
# --------------------------------- Discovery ----------------------------------
#
# Pass an initial list of hosts to perform discovery when this node is started:
# The default list of hosts is ["127.0.0.1", "[::1]"]
#
#discovery.seed_hosts: ["host1", "host2"]
#
# Bootstrap the cluster using an initial set of master-eligible nodes:
#
#cluster.initial_master_nodes: ["node-1", "node-2"]
#
# For more information, consult the discovery and cluster formation module documentation.
#
# ---------------------------------- Gateway -----------------------------------
#
# Block initial recovery after a full cluster restart until N nodes are started:
#
#gateway.recover_after_nodes: 3
#
# For more information, consult the gateway module documentation.
#
# ---------------------------------- Various -----------------------------------
#
# Require explicit names when deleting indices:
#
#action.destructive_requires_name: true
Adding network.host:0.0.0.0
After adding the above line to the elasticsearch.yml file I get the following error
Job for elasticsearch.service failed because the control process exited with error code.
See "systemctl status elasticsearch.service" and "journalctl -xe" for details.
running systemctl status elasticsearch.service gives me the following error message
● elasticsearch.service - Elasticsearch
Loaded: loaded (/lib/systemd/system/elasticsearch.service; enabled; vendor preset: enabled)
Active: failed (Result: exit-code) since Fri 2020-05-01 14:52:23 UTC; 59s ago
Docs: http://www.elastic.co
Process: 7086 ExecStart=/usr/share/elasticsearch/bin/elasticsearch -p ${PID_DIR}/elasticsearch.pid --quiet (code=exited, status=78)
Main PID: 7086 (code=exited, status=78)
May 01 14:52:03 dev systemd[1]: Starting Elasticsearch...
May 01 14:52:03 dev elasticsearch[7086]: OpenJDK 64-Bit Server VM warning: Option UseConcMarkSweepGC was deprecated in version 9.0 and will likely be removed in a future release.
May 01 14:52:23 dev elasticsearch[7086]: ERROR: [1] bootstrap checks failed
May 01 14:52:23 dev elasticsearch[7086]: [1]: the default discovery settings are unsuitable for production use; at least one of [discovery.seed_hosts, discovery.seed_providers, cluster.initial_master_nod>
May 01 14:52:23 dev elasticsearch[7086]: ERROR: Elasticsearch did not exit normally - check the logs at /var/log/elasticsearch/elasticsearch.log
May 01 14:52:23 dev systemd[1]: elasticsearch.service: Main process exited, code=exited, status=78/CONFIG
May 01 14:52:23 dev systemd[1]: elasticsearch.service: Failed with result 'exit-code'.
May 01 14:52:23 dev systemd[1]: Failed to start Elasticsearch.
/var/log/elasticsearch/elasticsearch.log contains the following error logs
[2020-05-01T14:52:22,378][INFO ][o.e.n.Node ] [dev] starting ...
[2020-05-01T14:52:22,740][INFO ][o.e.t.TransportService ] [dev] publish_address {10.211.55.11:9300}, bound_addresses {[::]:9300}
[2020-05-01T14:52:23,333][INFO ][o.e.b.BootstrapChecks ] [dev] bound or publishing to a non-loopback address, enforcing bootstrap checks
[2020-05-01T14:52:23,355][ERROR][o.e.b.Bootstrap ] [dev] node validation exception
[1] bootstrap checks failed
[1]: the default discovery settings are unsuitable for production use; at least one of [discovery.seed_hosts, discovery.seed_providers, cluster.initial_master_nodes] must be configured
[2020-05-01T14:52:23,361][INFO ][o.e.n.Node ] [dev] stopping ...
[2020-05-01T14:52:23,386][INFO ][o.e.n.Node ] [dev] stopped
[2020-05-01T14:52:23,387][INFO ][o.e.n.Node ] [dev] closing ...
[2020-05-01T14:52:23,435][INFO ][o.e.n.Node ] [dev] closed
[2020-05-01T14:52:23,436][INFO ][o.e.x.m.p.NativeController] [dev] Native controller process has stopped - no new native processes can be started
So there were couple of issues and following below steps should solve the issue:
Added network.host:0.0.0.0 which allowed exposing port on non-loopback address(ie localhost or 127.0.0.1) so that other systems on LAN can connect to its IP address.
Added discovery.type: single-node config to avoid the production bootstrap checks.
set network.host:0.0.0.0 that allowed exposing port on non-loopback address
Added "discovery.type: single-node" in the yml file config to avoid the production bootstrap checks.

dcos mesos dns is not resolving masters and leaders

Could you please help me to resolve this issue.I am struggling on this more than three days and I could not fix this.
I am configuring a DCOS installation with the guidance of https://dcos.io/docs/1.7/administration/installing/custom/advanced/
But Unfortunately, my dcos's dns server is not working properly.
1)Below is the output of the nslookup command:
# nslookup leader.mesos
;; Warning: query response not set
;; Got SERVFAIL reply from 198.51.100.1, trying next server
;; Warning: query response not set
;; Got SERVFAIL reply from 198.51.100.2, trying next server
;; Warning: query response not set
Server: 198.51.100.3
Address: 198.51.100.3#53
** server can't find leader.mesos: SERVFAIL
2) Below is the output of /opt/mesosphere/etc/mesos-dns.json
{
"zk": "zk://127.0.0.1:2181/mesos",
"refreshSeconds": 30,
"ttl": 60,
"domain": "mesos",
"port": 61053,
"resolvers": ["172.31.0.2"],
"timeout": 5,
"listener": "0.0.0.0",
"email": "root.mesos-dns.mesos",
"IPSources": ["host", "netinfo"]
}
3) Below is the output of the journalctl -u dcos-mesos-dns -b
19:29:50 Authentication failed: EOF
***19:29:51 Failed to connect to 127.0.0.1:2181: dial tcp 127.0.0.1:2181: getsockopt: connection refused***
19:29:52 Connected to 127.0.0.1:2181
19:29:52 Authenticated: id=98693002200481794, timeout=40000
service: Main process exited, code=exited, status=1/FAILURE
service: Unit entered failed state.
service: Failed with result 'exit-code'.
7/09/20 19:30:14 generator.go:124: no master
7/09/20 19:30:14 resolver.go:156: Warning: Error generating records: no master; keeping old DNS state
7/09/20 19:30:14 main.go:80: master detection timed out after 30s
service: Service hold-off time over, scheduling restart.
NS: DNS based Service Discovery.
DNS: DNS based Service Discovery...
19:30:19 Connected to 127.0.0.1:2181
19:30:19 Authenticated: id=98693008095641600, timeout=40000
DNS: DNS based Service Discovery...
service: Main process exited, ***code=exited, status=2/INVALIDARGUMENT***
service: Unit entered failed state.
service: Failed with result 'exit-code'.
NS: DNS based Service Discovery.
DNS: DNS based Service Discovery...
19:39:57 Connected to 127.0.0.1:2181
19:39:57 Authenticated: id=98693008095641610, timeout=40000
lines 170-207/207 (END)
Please ask if you need more logs...
Thank you very much!
I could able to fix this issue by following the latest version of the DCOS installation guidance: https://dcos.io/docs/1.10/installing/custom/advanced/
my genconf/config.yaml
---
bootstrap_url: http://xx.xx.xx.xx:9090
cluster_name: 'ProjectName'
exhibitor_storage_backend: aws_s3
aws_access_key_id: xxxxxxxxxxxxxxxxxx
aws_secret_access_key: xxxxxxxxxxxxxxxxxxxxxxxxxxxxx
s3_bucket: cp-v902
aws_region: us-east-1
exhibitor_explicit_keys: 'false'
s3_prefix: dcos
ip_detect_filename: genconf/ip-detect
dns_search: ec2.internal
master_discovery: master_http_loadbalancer
exhibitor_address: internal-osdc-123456789.us-east-1.elb.amazonaws.com
num_masters: 1
resolvers:
- XXXX.XXX.XX.XX

Apache Zookeeper multi-nodes cluster not running

I'm following the http://jayatiatblogs.blogspot.com/2011/11/storm-installation.html & http://seaip.narlabs.org.tw/upload/content_file/547c1db495987.pdf to set up my Apache Storm & Apache Zookeeper cluster at Ubuntu 14.04 LTS in Amazon Web Services EC2.
Below are my zoo.cfg for my slave nodes:
## The number of milliseconds of each tick. The length of a single tick, which i
s the basic time unit used by ZooKeeper, as measured in milliseconds.
## It is used to regulate heartbeats, and timeouts. For example, the minimum ses
sion timeout will be 2 ticks.
tickTime=2000
## The number of ticks that the initial synchronization phase can take
## The new entry, initLimit is timeouts ZooKeeper uses to limit the length of ti
me the ZooKeeper servers in quorum have to connect to a leader.
initLimit=1000
## The number of ticks that can pass between sending a request and getting an ac
knowledgement
## Amount of time, in ticks to allow followers to sync with ZooKeeper. If follow
ers fall too far behind a leader, they will be dropped.
## The entry syncLimit limits how far out of date a server can be from a leader.
syncLimit=500
## The directory where the snapshot is stored. The location where ZooKeeper will
store the in-memory database snapshots and, unless specified otherwise, the tra
nsaction log of updates to the database.
dataDir=/home/ubuntu/zookeeper-data
## The location of the log file. Write the transaction log to the dataLogDir rat
her than the dataDir.
#dataLogDir=/home/ubuntu/zookeeper/log/data_log
## The port to listen for client connections; that is, the port that clients att
empt to connect to.
clientPort=2181
## No need to put in standalone mode
## server.id = host:port:port
server.1=10.0.0.79:2888:3888
server.2=10.0.0.124:2888:3888
server.3=10.0.0.84:2888:3888
## The number of snapshots to retain in dataDir
## When enabled, ZooKeeper auto purge feature retains the autopurge.snapRetainCo
unt most recent snapshots and the corresponding transaction logs in the dataDir
and dataLogDir respectively and deletes the rest.
## Defaults to 3. Minimum value is 3.
autopurge.snapRetainCount=3
## Purge task interval in hours. Enable regular purging of old data and transact
ion logs every 24 hours
## Set to "0" to disable auto purge feature
## The time interval in hours for which the purge task has to be triggered. Set
to a positive integer (1 and above) to enable the auto purging. Defaults to 0.
autopurge.purgeInterval=1
The storm.conf at my slave nodes is as below:
########### These MUST be filled in for a storm configuration
storm.zookeeper.server:
- "10.0.0.79"
- "10.0.0.124"
- "10.0.0.84"
# - "localhost"
storm.zookeeper.port: 2181
# storm.zookeeper.port: 2888:3888
# nimbus.host: "localhost"
nimbus.host: "10.0.0.185"
# nimbus.thrift.port: 6627
#
#ui.port: 8772
#
storm.local.dir: "/home/ubuntu/storm/data"
java.library.path: "/usr/lib/jvm/java-7-oracle"
supervisor.slots.ports:
- 6700
- 6701
- 6702
- 6703
- 6704
#
# worker.childopts: "-Xmx768m"
# nimbus.childopts: "-Xmx512m"
# supervisor.childopts: "-Xmx256m"
#
# ##### These may optionally be filled in:
#
## List of custom serializations
# topology.kryo.register:
# - org.mycompany.MyType
# - org.mycompany.MyType2: org.mycompany.MyType2Serializer
#
## List of custom kryo decorators
# topology.kryo.decorators:
# - org.mycompany.MyDecorator
#
## Locations of the drpc servers
# drpc.servers:
# - "server1"
# - "server2"
## Metrics Consumers
# topology.metrics.consumer.register:
# - class: "backtype.storm.metric.LoggingMetricsConsumer"
# parallelism.hint: 1
# - class: "org.mycompany.MyMetricsConsumer"
# parallelism.hint: 1
# argument:
# - endpoint: "metrics-collector.mycompany.org"
The myid at zookeeper/conf for 10.0.0.79 is 1, for 10.0.0.124 is 2, for 10.0.0.84 is 3.
However, when I start run my Apache Zookeeper using zkServer.sh start, then I show the status using zkServer.sh status, this message shows:
JMX enabled by default
Using config: /home/ubuntu/zookeeper/bin/../conf/zoo.cfg
Error contacting service. It is probably not running.
When I issue the command from one of the slave node, e.g. from 10.0.0.124: zkCli.sh -server 10.0.0.84:2181 , the error below shown:
2015-05-27 04:44:03,745 [myid:] - INFO [main-SendThread(10.0.0.84:2181):ClientC nxn$SendThread#975] - Opening socket connection to server 10.0.0.84/10.0.0.84:21 81. Will not attempt to authenticate using SASL (unknown error)
Welcome to ZooKeeper!
2015-05-27 04:44:03,761 [myid:] - WARN [main-SendThread(10.0.0.84:2181):ClientC nxn$SendThread#1102] - Session 0x0 for server null, unexpected error, closing so cket connection and attempting reconnect
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:744 )
at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocket NIO.java:361)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)
JLine support is enabled
[zk: 10.0.0.84:2181(CONNECTING) 0]
Anyone knows what is the possible causes & how to solve it?
I think zookeeper is not running properly.
myid file should be located in dataDir=/home/ubuntu/zookeeper-data not conf directory.
Refer to the zookeeper documentation

Storm UI Internal Server Error for local cluster

Internal Server Error
org.apache.thrift7.transport.TTransportException: java.net.ConnectException: Connection refused
at org.apache.thrift7.transport.TSocket.open(TSocket.java:183)
at org.apache.thrift7.transport.TFramedTransport.open(TFramedTransport.java:81)
at backtype.storm.thrift$nimbus_client_and_conn.invoke(thrift.clj:75)
at backtype.storm.ui.core$all_topologies_summary.invoke(core.clj:515)
at backtype.storm.ui.core$fn__8018.invoke(core.clj:851)
at compojure.core$make_route$fn__6199.invoke(core.clj:93)
at compojure.core$if_route$fn__6187.invoke(core.clj:39)
at compojure.core$if_method$fn__6180.invoke(core.clj:24)
at compojure.core$routing$fn__6205.invoke(core.clj:106)
at clojure.core$some.invoke(core.clj:2443)
at compojure.core$routing.doInvoke(core.clj:106)
at clojure.lang.RestFn.applyTo(RestFn.java:139)
at clojure.core$apply.invoke(core.clj:619)
at compojure.core$routes$fn__6209.invoke(core.clj:111)
at ring.middleware.reload$wrap_reload$fn__6234.invoke(reload.clj:14)
at backtype.storm.ui.core$catch_errors$fn__8059.invoke(core.clj:909)
at ring.middleware.keyword_params$wrap_keyword_params$fn__6876.invoke(keyword_params.clj:27)
at ring.middleware.nested_params$wrap_nested_params$fn__6915.invoke(nested_params.clj:65)
at ring.middleware.params$wrap_params$fn__6848.invoke(params.clj:55)
at ring.middleware.multipart_params$wrap_multipart_params$fn__6943.invoke(multipart_params.clj:103)
at ring.middleware.flash$wrap_flash$fn__7124.invoke(flash.clj:14)
I follow the method in https://hadooptips.wordpress.com/2014/05/26/configuring-single-node-storm-cluster/ to set up my storm in Ubuntu 14.04 LTS.
When I try to connect to the Storm UI, the error as shown above.
My storm.yaml in /home/user/storm/conf is as below:
########### These MUST be filled in for a storm configuration
storm.zookeeper.servers:
- "localhost"
storm.zookeeper.port: 2181
nimbus.host: "localhost"
nimbus.thrift.port: 6627
# ui.port:8772
storm.local.dir: "/home/user/storm/data"
java.library.path: "/usr/lib/jvm/java-7-oracle"
supervisor.slots.ports:
- 6700
- 6701
- 6702
- 6703
- 6704
Anyone know how to solve this? I'm a newbie, a detail solution will be helpful.
My zoo.cfg is as below:
# The number of milliseconds of each tick
tickTime=2000
# The number of ticks that the initial synchronization phase can take
initLimit=10
# The number of ticks that can pass between sending a request and getting an acknowledgement
syncLimit=5
# the directory where the snapshot is stored.
dataDir=/home/user/zookeeper-data
# The location of the log file
dataLogDir=/home/user/zookeeper/log/data_log
# the port at which the clients will connect
clientPort=2181
server.1=10.0.0.2:2888:3888
server.2=10.0.0.3:2888:3888
server.3=10.0.0.4:2888:3888
# The number of snapshots to retain in dataDir
autopurge.snapRetainCount=3
# Purge task interval in hours
# Set to "0" to disable auto purge feature
autopurge.purgeInterval=1
I run this in VMWare, Ubuntu 14.04 LTS. What IP address should I put in the server.1 ?
I think your zookeeper is not running properly, before running zookeeper you have to create myid file that only contains id of each node.
please refer to here : Zookeeper - three nodes and nothing but errors

Resources