We are using AWS EC2 instances to run celery on multiple nodes. But sometimes some of the nodes are going down (showing running on AWS but unable to ssh, and also those celery workers do not show up in rabbitmq management tool), Then we have to manually restart those nodes from AWS Console.
Is there any way we can avoid this server failures due to celery ? (these servers only run celery workers)
Related
I have an EKS cluster running with both Linux and Windows nodes. On the Windows nodes i am scheduling pods. They run for about 30 minutes and then get removed. The first thing any pod does is download some data from S3 using the AWS cli installed on it.
I am facing some intermittent connectivity issues. Pods get spun up on and sometimes give a fatal error:
Could not connect to the endpoint URL: "https://sts.eu-west-1.amazonaws.com
As far as i can see this only happens when I schedule more then one pod on a node. I do use a smaller instance type (M5.large) but i am not close to reaching the pod limit of this instance type. When there is 1 pod per node they can all connect and download data from S3.
Reading the documentation I can see it is possible to schedule more then 1 pod per EC2 instance. But I am unsure what the requirements are to the EC2 instance to give all those pods access to download data from S3. I did try to add more ENIs to the EC2 instances but this prevented the EC2 instances to be registered as nodes in the EKS cluster.
Should I run consul slaves alongside nomad slaves or inside them?
The later might not make sense at all but I'm asking it just in case.
I brought my own nomad cluster up with consul slaves running alongside nomad slaves (inside worker nodes), my deployable artifacts are docker containers (java spring applications).
The issue with my current setup is that my applications can't access consul slaves (to read configurations) (none of 0.0.0.0, localhost, worker node ip worked)
Lets say my service exposes 8080, I configured docker part (in hcl file) to use bridge as network mode. Nomad maps 8080 to 43210.
Everything is fine until my service tries to reach the consul slave to read configuration. Ideally giving nomad worker node IP as consul host to Spring should suffice. But for some reason it's not.
I'm using latest version of nomad.
I configured my nomad slaves like https://github.com/bmd007/statefull-geofencing-faas/blob/master/infrastructure/nomad/client1.hcl
And the link below shows how I configured/ran my consul slave:
https://github.com/bmd007/statefull-geofencing-faas/blob/master/infrastructure/server2.yml
Note: if I use static port mapping and host as the network mode for docker (in nomad) I'll be fine but then I can't deploy more than one instance of each application in each worker node (due to port conflic)
Nomad jobs listen on a specific host/port pair.
You might want to ssh into the server and run docker ps to see what host/port pair the job is listening on.
a93c5cb46a3e image-name bash 2 hours ago Up 2 hours 10.0.47.2:21435->8000/tcp, 10.0.47.2:21435->8000/udp foo-bar
Additionally, you will need to ensure that the consul nomad job is listening on port 0.0.0.0, or the specific ip of the machine. I believe that is this config value: https://www.consul.io/docs/agent/options.html#_bind
All those will need to match up in order to consul to be reachable.
More generally, I might recommend: if you're going to run consul with nomad, you might want to switch to host networking, so that you don't have to deal with the specifics of the networking within a container. Additionally, you could schedule consul as a system job so that it is automatically present on every host.
So I managed to solve the issue like this:
nomad.job.group.network.mode = host
nomad.job.group.network.port: port "http" {}
nomad.job.group.task.driver = docker
nomad.job.group.task.config.network_mode = host
nomad.job.group.task.config.ports = ["http"]
nomad.job.group.task.service.connect: connect { native = true }
nomad.job.group.task.env: SERVER_PORT= "${NOMAD_PORT_http}"
nomad.job.group.task.env: SPRING_CLOUD_CONSUL_HOST = "localhost"
nomad.job.group.task.env: SPRING_CLOUD_SERVICE_REGISTRY_AUTO_REGISTRATION_ENABLED = "false"
Running consul agent (slaves) using docker-compose alongside nomad agent (slave) with host as network mode + exposing all required ports.
Example of nomad job: https://github.com/bmd007/statefull-geofencing-faas/blob/master/infrastructure/nomad/location-update-publisher.hcl
Example of consul agent config (docker-compose file): https://github.com/bmd007/statefull-geofencing-faas/blob/master/infrastructure/server2.yml
Disclaimer: The LAB is part of Cluster Visualization Framework called: LiteArch Trafik which I have created as an interesting exercise to understand Nomad and Consul.
It took me long time to shift my mind from K8S to Nomad and Consul,
Integration them was one of my effort I spent in the last year.
When service resolution doesn't work, I found out it's more or less the DNS configuration on servers.
There is a section for it on Hashicorp documentation called DNS Forwarding
Hashicorp DNS Forwarding
I have created a LAB which explains how to set up Nomad and Consul.
But you can use the LAB seperately.
I created the LAB after learning the hard way how to install the cluster and how to integrate Nomad and Consul.
With the LAB you need Ubuntu Multipass installed.
You execute one script and you will get full functional Cluster locally with three servers and three nodes.
It shows you as well how to install docker and integrate the services with Consul and DNS services on Ubuntu.
After running the LAB you will get the links to Nomad, Fabio, Consul.
Hopefully it will guide you through the learning process of Nomad and Consul
LAB: LAB
Trafik:Trafik Visualizer
I am migrating a standard all-linux nomad/consul cluster where the nomad/consul servers use almost no resources with our workloads, and spinning up dedicated linux VMs just for them in our new environment seems a bit wasteful, when the environment I am moving to has multiple windows VMs with spare capacity which I could use for the nomad server and consul server processes to give me the necessary redundancy.
So my question boils down to: If I have the consul server and nomad server processes exclusively on windows and the nomad agent and consul agent processes exclusively on linux-- will they all just get along? The nomad jobs are all dockerized except for a native system prometheus exporter.
Both Consul and Nomad are operating system agnostic. You can use a mix of OS's within your cluster without issue. The main requirement is that you have direct IP connectivity between the agents (i.e., no NAT), low latency (sub 10ms), and the required ports opened for Consul and/or Nomad agent communication.
See https://www.consul.io/docs/install/ports and https://www.nomadproject.io/docs/install/production/requirements#ports-used for more detail.
I am using beanstalkd in a Laravel project to handle jobs on a queue. Beanstalkd is running locally. What I want to do is add one or more remote servers to handle some jobs when the queue gets bigger. I know that with Laravel I can send a job to a specific remote connection but in this way I don't know the load in each server prior to sending the job.
I was wondering if beanstalkd supports load balancing between servers and error handling when a remote job fails for example.
Thank you
Beanstalkd does't have features for load balancing.
You could setup a HAProxy on your balancer and signup multiple servers with beanstalkd installed. Then when you send jobs from Laravel code you send to the HAProxy, and HAProxy decides on which sub-server puts the job, as it knows the loading and if there is an incident with a sub system.
In the code you just need to change the IP.
In your infrastructure you need to have balancer (HAProxy) that is setup with a pool of Beanstalkd servers.
We usually have 2 machines, and they are configured like this:
- Machine 1: HAProxy, Apache, MySQL, Laravel, Beanstalkd
- Machine 2: MySQL, Laravel, Beanstalkd
After restarting my 3 masters in my DC/OS cluster, the DC/OS dashboard is showing 0 connected nodes. However from the DC/OS cli I see all 6 of my agent nodes:
$ dcos node
HOSTNAME IP ID
172.16.1.20 172.16.1.20 a7af5134-baa2-45f3-892e-5e578cc00b4d-S7
172.16.1.21 172.16.1.21 a7af5134-baa2-45f3-892e-5e578cc00b4d-S12
172.16.1.22 172.16.1.22 a7af5134-baa2-45f3-892e-5e578cc00b4d-S8
172.16.1.23 172.16.1.23 a7af5134-baa2-45f3-892e-5e578cc00b4d-S6
172.16.1.24 172.16.1.24 a7af5134-baa2-45f3-892e-5e578cc00b4d-S11
172.16.1.25 172.16.1.25 a7af5134-baa2-45f3-892e-5e578cc00b4d-S10`
I am still able to schedule tasks in Marathon both from the dcos cli and from the Marathon gui, they then are properly scheduled and executed on the agents. Also, from the mesos interface on :5050 I can see all of the agents in the slaves page.
I have restarted agent nodes and master nodes. I have also rerun the DC/OS GUI installer and run preflight check, which of course fails with an "already installed" error.
Is there a way to re-register the node with DC/OS GUI short of uninstalling/reinstalling a node?
For anyone who is running into this, my problem was related to our corporate proxy. In order to get the Universe working in my cluster I had to add proxy settings to /opt/mesosphere/environment. I then restarted the dcos-cosmos.service and life was good. However, upon server restart, dcos-history-service.service was now running with the new environment and was unable to resolve my local names with our proxy server. To solve, I added a NO_PROXY to the /opt/mesosphere/environment and DCOS dashboard is again happy.