Standalone mesos master behind load balancer EC2 - mesos

I have a mesos master behind a load balancer and a mesos agent that tries to connect to the mesos master via the load balancer
Everything is good when the agent directly connects to the master by providing the --master flag but as soon as I change the --master to point to the load balancer (dns entry not the LB IP) I keep getting the following error repeatedly on my agent
I0223 11:16:55.776448 4945 slave.cpp:1416] Detecting new master
I0223 11:16:55.796245 4947 slave.cpp:6456] Got exited event for master#xx.xx.xx.xx:8082
W0223 11:16:55.796283 4947 slave.cpp:6461] Master disconnected! Waiting for a new master to be elected
I don't see any logs in master
mesos master port:8082
load balancer listener:8082->8082
mesos agent port:5052
We use the classic load balancer that does not preserve the IP
I then tried advertising the agent IP & Port but that didn't help either
I also tried setting --hostname, --advertise_ip & --advertise_port on master but that didn't help either
Has anyone faced this issue? What should be the right values for --advertise_ip, --advertise_port
I'm not using the standard mesos master/agent ports FYI
At this point I've tried all sorts of combinations

hostname = DNS name
advertise_ip = IP DNS resolves to
advertise_port = External port

Related

Remote Testing in JMeter with AWS EC2 instance

I try to start remote testing using my computer as Client/Master and EC2 instance as Slave
I achieve all these Points:
Disable Firewall on both Master and Slave.
I have the same version of java and JMeter on both Master and Slave.
I have set all communication to be over port 4000.
My configuration for Master:
remote_hosts=10.xx.xx.xxx
server_port=4000
server.rmi.port=4000
server.rmi.localport=4000
server.rmi.ssl.disable=true
My Configuration for Slave [EC2 instance]
server_port=4000
server.rmi.port=4000
server.rmi.localport=4000
server.rmi.ssl.disable=true
Command to start JMeter server on slave [ EC2 instance]:
./jmeter-server -Gjava.rmi.server.hostname:10.xx.xx.xxx
Command to start JMeter server on Master [ My Computer]:
./jmeter-server -Gjava.rmi.server.hostname:192.xx.xx.xxx
After Running the test from the master the test started on the slave and finished.
My issue that the Client/Master didn't get any Result or Summary, it stuck and freeze on this line:
Waiting for possible Shutdown/StopTestNow/HeapDump/ThreadDump message on port 4445.
Your 10.xx.xx.xxx and 192.xx.xx.xxx are class A and class C local networks, it means that they are not accessible from anywhere else, only from their respective local networks.
So you won't be able to reach the EC2 instance internal IP from your computer and vice versa.
In order to be able to connect to the EC2 instance you need to:
Use external (public) IP address or DNS hostname
Open the port 4000 in the AWS Security Groups
In order to get results back to your computer from the EC2 machine you need to have static external IP address, you need to reach out to your ISP or network administrator to get this configured and assigned
An example of master/slave configuration with custom ports can be found in the JMeter Distributed Testing with Docker article.
More information: Remote hosts and RMI configuration
If you have only one slave machine it doesn't make sense to invest into master/slave configuration at all, just run JMeter in command-line non-GUI mode in the EC2 instance and analyze the results locally.
If you plan to use more than 1 slave - it makes sense to transfer the master to the EC2 as well, this way you will be able to use internal IP addresses

Nomad and consul setup

Should I run consul slaves alongside nomad slaves or inside them?
The later might not make sense at all but I'm asking it just in case.
I brought my own nomad cluster up with consul slaves running alongside nomad slaves (inside worker nodes), my deployable artifacts are docker containers (java spring applications).
The issue with my current setup is that my applications can't access consul slaves (to read configurations) (none of 0.0.0.0, localhost, worker node ip worked)
Lets say my service exposes 8080, I configured docker part (in hcl file) to use bridge as network mode. Nomad maps 8080 to 43210.
Everything is fine until my service tries to reach the consul slave to read configuration. Ideally giving nomad worker node IP as consul host to Spring should suffice. But for some reason it's not.
I'm using latest version of nomad.
I configured my nomad slaves like https://github.com/bmd007/statefull-geofencing-faas/blob/master/infrastructure/nomad/client1.hcl
And the link below shows how I configured/ran my consul slave:
https://github.com/bmd007/statefull-geofencing-faas/blob/master/infrastructure/server2.yml
Note: if I use static port mapping and host as the network mode for docker (in nomad) I'll be fine but then I can't deploy more than one instance of each application in each worker node (due to port conflic)
Nomad jobs listen on a specific host/port pair.
You might want to ssh into the server and run docker ps to see what host/port pair the job is listening on.
a93c5cb46a3e image-name bash 2 hours ago Up 2 hours 10.0.47.2:21435->8000/tcp, 10.0.47.2:21435->8000/udp foo-bar
Additionally, you will need to ensure that the consul nomad job is listening on port 0.0.0.0, or the specific ip of the machine. I believe that is this config value: https://www.consul.io/docs/agent/options.html#_bind
All those will need to match up in order to consul to be reachable.
More generally, I might recommend: if you're going to run consul with nomad, you might want to switch to host networking, so that you don't have to deal with the specifics of the networking within a container. Additionally, you could schedule consul as a system job so that it is automatically present on every host.
So I managed to solve the issue like this:
nomad.job.group.network.mode = host
nomad.job.group.network.port: port "http" {}
nomad.job.group.task.driver = docker
nomad.job.group.task.config.network_mode = host
nomad.job.group.task.config.ports = ["http"]
nomad.job.group.task.service.connect: connect { native = true }
nomad.job.group.task.env: SERVER_PORT= "${NOMAD_PORT_http}"
nomad.job.group.task.env: SPRING_CLOUD_CONSUL_HOST = "localhost"
nomad.job.group.task.env: SPRING_CLOUD_SERVICE_REGISTRY_AUTO_REGISTRATION_ENABLED = "false"
Running consul agent (slaves) using docker-compose alongside nomad agent (slave) with host as network mode + exposing all required ports.
Example of nomad job: https://github.com/bmd007/statefull-geofencing-faas/blob/master/infrastructure/nomad/location-update-publisher.hcl
Example of consul agent config (docker-compose file): https://github.com/bmd007/statefull-geofencing-faas/blob/master/infrastructure/server2.yml
Disclaimer: The LAB is part of Cluster Visualization Framework called: LiteArch Trafik which I have created as an interesting exercise to understand Nomad and Consul.
It took me long time to shift my mind from K8S to Nomad and Consul,
Integration them was one of my effort I spent in the last year.
When service resolution doesn't work, I found out it's more or less the DNS configuration on servers.
There is a section for it on Hashicorp documentation called DNS Forwarding
Hashicorp DNS Forwarding
I have created a LAB which explains how to set up Nomad and Consul.
But you can use the LAB seperately.
I created the LAB after learning the hard way how to install the cluster and how to integrate Nomad and Consul.
With the LAB you need Ubuntu Multipass installed.
You execute one script and you will get full functional Cluster locally with three servers and three nodes.
It shows you as well how to install docker and integrate the services with Consul and DNS services on Ubuntu.
After running the LAB you will get the links to Nomad, Fabio, Consul.
Hopefully it will guide you through the learning process of Nomad and Consul
LAB: LAB
Trafik:Trafik Visualizer

replication server with same id in docker/openshift cluster

I'm having problems when setting up replication in openshift/docker cluster.
In openshift, each opendj server will have two ips: service ip and pod id. So when I setup two opendj service, two service ip and two pod ip will be there.
I want to set up the replication by service ip, because pod is is not accessible from other pod, but apparently OpenDJ think there are four replication server there, with each two server having same ServerId.
Log snippet:
category=SYNC severity=ERROR msgID=org.opends.messages.replication.55 msg=In Replication server Replication Server 8989 31635: replication servers 172.30.244.127(service ip):8989 and 10.129.0.1:8989(pod ip) have the same ServerId : 11281
My Question is: is it possible to just build the Replication Server cluster by Service IP, not Pod id?
Thanks a lot.
PS: seems this issue is similar with this https://bugster.forgerock.org/jira/browse/OPENDJ-567
Wayne
For anyone having the same issue, please config your opendj service to headless service, that will solve the problem

Marathon loses control over Mesos when Marathon and Mesos leaders mismatch

When mesos or marathon service restart due to some reasons and leader of mesos and marathon is not on the same machine, deployments stuck in marathon and nothing happens in mesos, that leads to terrible results when marathon can not restart failed services and do nothing with deployments until leaders will not match again.
Our cluster has 3 masters (installed through mesosphere website) and this situation happens quite often, is there any way to fix that?
Marathon v.0.9.0
Mesos v0.22.1
It sounds like either Mesos or Marathon use a private ip (localhost/127.0.0.1), thus they weren't able to talk to each other.
You should be able to solve your issue by setting a public ip using the respective --ip command line flag or LIBPROCESS_IP environment var.
One particularly useful setting is LIBPROCESS_IP, which tells the master and slave binaries which IP address to bind to; in some installations, the default interface that the hostname resolves to is not the machine’s external IP address, so you can set the right IP through this variable.
/source http://mesos.apache.org/documentation/latest/deploy-scripts/

Apache Spark error : Could not connect to akka.tcp://sparkMaster#

This is our first steps using big data stuff like apache spark and hadoop.
We have a installed Cloudera CDH 5.3. From the cloudera manager we choose to install spark. Spark is up and running very well in one of the nodes in the cluster.
From my machine I made a little application that connects to read a text file stored on hadoop HDFS.
I am trying to run the application from Eclipse and it displays these messages
15/02/11 14:44:01 INFO client.AppClient$ClientActor: Connecting to master spark://10.62.82.21:7077...
15/02/11 14:44:02 WARN client.AppClient$ClientActor: Could not connect to akka.tcp://sparkMaster#10.62.82.21:7077: akka.remote.InvalidAssociation: Invalid address: akka.tcp://sparkMaster#10.62.82.21:7077
15/02/11 14:44:02 WARN Remoting: Tried to associate with unreachable remote address [akka.tcp://sparkMaster#10.62.82.21:7077]. Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters. Reason: Connection refused: no further information: /10.62.82.21:7077
The application is has one class the create a context using the following line
JavaSparkContext sc = new JavaSparkContext(new SparkConf().setAppName("Spark Count").setMaster("spark://10.62.82.21:7077"));
where this IP is the IP of the machine spark working on.
Then I try to read a file from HDFS using the following line
sc.textFile("hdfs://10.62.82.21/tmp/words.txt")
When I run the application I got the
Check your Spark master logs, you should see something like:
15/02/11 13:37:14 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkMaster#mymaster:7077]
15/02/11 13:37:14 INFO Remoting: Remoting now listens on addresses: [akka.tcp://sparkMaster#mymaster:7077]
15/02/11 13:37:14 INFO Master: Starting Spark master at spark://mymaster:7077
Then when your connecting to the master, be sure to use exactly the same hostname as found in the logs above (do not use the IP address):
.setMaster("spark://mymaster:7077"));
Spark standalone is a bit picky with this hostname/IP stuff.
When you create your Spark master using the shell command "sbin/start-master.sh". go the the address http://localhost:8080 and check the "URL" row.
I notice no accepted answer, just for info I thought I'd mention a couple things.
First, in the spark-env.sh file in the conf directory, the SPARK_MASTER_IP and SPARK_LOCAL_IP settings can be hostnames. You don't want them to be, but they can be.
As noted in another answer, Spark can be a little picky about hostname vs. IP address, because of this resolved bug/feature: See bug here. The problem is, it's not clear if they "resolved" is simply by telling us to use IP instead of hostname?
Well I am having this same problem right now, and the first thing you do is check the basics.
Can you ping the box where the Spark master is running? Can you ping the worker from the master? More importantly, can you password-less ssh to the worker from the master box? Per 1.5.2 docs you need to be able to do that with a private key AND have the worker entered in the conf/slaves file. I copied the relevant paragraph at the end.
You can get a situation where the worker can contact the master but the master can't get back to the worker so it looks like no connection is being made. Check both directions.
Finally of all the combinations of settings, in a limited experiment just now I only found one that mattered: On the master, in spark-env.sh, set the SPARK_MASTER_IP to the IP address, not hostname. Then connect from the worker with spark://192.168.0.10:7077 and voila it connects! Seemingly none of the other config parameters are needed here.
Here's the paragraph from the docs about ssh and slaves file in conf:
To launch a Spark standalone cluster with the launch scripts, you
should create a file called conf/slaves in your Spark directory, which
must contain the hostnames of all the machines where you intend to
start Spark workers, one per line. If conf/slaves does not exist, the
launch scripts defaults to a single machine (localhost), which is
useful for testing. Note, the master machine accesses each of the
worker machines via ssh. By default, ssh is run in parallel and
requires password-less (using a private key) access to be setup. If
you do not have a password-less setup, you can set the environment
variable SPARK_SSH_FOREGROUND and serially provide a password for each
worker.
Once you have done that, using the IP address should work in your code. Let us know! This can be an annoying problem, and learning that most of the config params don't matter was nice.

Resources