Error connecting rabbitmq cluster on Amazon EC2 - amazon-ec2

I am experiencing some difficulties connecting two RabbitMQ nodes on amazon EC2.
The two nodes are controlled using puppet, here is my rabbit.config file:
[
{mnesia, [{dump_log_write_threshold, 1000}]},
{rabbit, [
{tcp_listeners, [5672]},
{kernel, [{inet_dist_listen_min, 55700},{inet_dist_listen_max, 55800}]} ,
{cluster_nodes, ['rabbit#server1', 'rabbit#server2']}
]
}
].
I believe the rights ports for the cluster to connect are open. I am able to telnet from server2 to server1 on both 5672 and 4369.
I have the same /var/lib/rabbitmq/.erlang.cookie on both servers.
And from erlang command line when I net_admin:ping the other node I get pang back.
However, when I run cluster_status on any node they do not look like they are aware of each other. Doing stop_app, reset,rabbitmqctl cluster rabbit#server1 I always get the following error:
Error: {no_running_cluster_nodes...
Has anybody solved a similar problem, or know how to solve it?

Have you opened the ports between 55700 and 55800?
Try checking this to understand what other ports RabbitMQ listens on:
netstat -plten | grep beam
And I'd double-check the cookie...

Like Ivan suggests, you can check which ports the servers are listening on first and then add those TCP rules to Security Groups for servers. That's a good first step.
netstat -plten | grep beam
Returns the following (if server still running and not stop_app)
tcp 0 0 0.0.0.0:37419 0.0.0.0:* LISTEN 498 118739 15519/beam
tcp 0 0 0.0.0.0:15672 0.0.0.0:* LISTEN 498 119032 15519/beam
tcp 0 0 0.0.0.0:55672 0.0.0.0:* LISTEN 498 119029 15519/beam
tcp 0 0 :::5672 :::* LISTEN 498 119018 15519/beam
Notice the common ports 5672 15672 55672 for amqp and web server and the other port is the port the cluster is listening on. Check your other instances and make sure your range includes both of them, then retry and it will work.
Security Group > Inbound > TCP Rule:
30000-65535 and the Security Group allowed sg-XXXXXX and repeat for reciprocating security groups and don't forget to "Apply Rules".
Next make sure you share the /var/lib/rabbitmq/.erlang.cookie (just copy from one server to all others and restart instances)
Then on your command line:
[root#ip-172-31-27-150 ~]# rabbitmqctl stop_app
Stopping node 'rabbit#ip-172-31-27-150' ...
...done.
[root#ip-172-31-27-150 ~]# rabbitmqctl reset
Resetting node 'rabbit#ip-172-31-27-150' ...
...done.
[root#ip-172-31-27-150 ~]# rabbitmqctl join_cluster rabbit#ip-172-31-28-79
Clustering node 'rabbit#ip-172-31-27-150' with 'rabbit#ip-172-31-28-79' ...
...done.
Lastly, don't forget to restart your instance rabbitmqctl start_app
This worked for me on 5 EC2 instances.

thanks for your answer, what I did is to remove the content of this directory except .erlang.cookie ( rm -R /var/lib/rabbitmq/ ). And the cluster connected successfully.
Cheers!

Related

kubeadm join can't connect

I created a single-node cluster on a Ubuntu 18.04 node on EC2, using kubeadm init. However I am unable to join (unable to connect to the API) from another node.
Note: this is an EC2 instance.
Kubectl is working fine on the master itself.
I used the following command where MASTER_PRIVATE_IP is 172.31.25.111.
kubeadm init --pod-network-cidr=10.244.0.0/16 --apiserver-advertise-address=${MASTER_PRIVATE_IP} --apiserver-cert-extra-sans=${MASTER_PUBLIC_IP}
When I try to join a second node on the same private network to the cluster with kubeadm join it just times out. I can ssh to the master no problem and when performing a netstat on the master I see that it only seems to be listening to port 6443 on ipv6 addresses - why?
I provided the private IPv4 address as the advertise address.
(kubeconfig has htat private ipv4 address of course).
kube-apiserver --authorization-mode=Node,RBAC --advertise-address=172.31.25.111 --allow-privileged=true --client-ca-file=/etc/kubernetes/pki/ca.crt --enable-admission-plugins=NodeRestriction --enable-bootstrap-token-auth=true --etcd-cafile=/etc/kubernetes/pki/etcd/ca.crt --etcd-certfile=/etc/kubernetes/pki/apiserver-etcd-client.crt --etcd-keyfile=/etc/kubernetes/pki/apiserver-etcd-client.key --etcd-servers=https://127.0.0.1:2379 --insecure-port=0 --kubelet-client-certificate=/etc/kubernetes/pki/apiserver-kubelet-client.crt --kubelet-client-key=/etc/kubernetes/pki/apiserver-kubelet-client.key --kubelet-preferred-address-types=InternalIP,ExternalIP,Hostname --proxy-client-cert-file=/etc/kubernetes/pki/front-proxy-client.crt --proxy-client-key-file=/etc/kubernetes/pki/front-proxy-client.key --requestheader-allowed-names=front-proxy-client --requestheader-client-ca-file=/etc/kubernetes/pki/front-proxy-ca.crt --requestheader-extra-headers-prefix=X-Remote-Extra- --requestheader-group-headers=X-Remote-Group --requestheader-username-headers=X-Remote-User --secure-port=6443 --service-account-key-file=/etc/kubernetes/pki/sa.pub --service-cluster-ip-range=10.96.0.0/12 --tls-cert-file=/etc/kubernetes/pki/apiserver.crt --tls-private-key-file=/etc/kubernetes/pki/apiserver.key
netstat -tulpn | grep -E ":(22|6443)" | grep LISTEN
(Not all processes could be identified, non-owned process info
will not be shown, you would have to be root to see it all.)
tcp 0 0 0.0.0.0:22 0.0.0.0:* LISTEN -
tcp6 0 0 :::22 :::* LISTEN -
tcp6 0 0 :::6443 :::* LISTEN -
kubeadm init --pod-network-cidr=10.244.0.0/16 --apiserver-advertise-address=${MASTER_PRIVATE_IP} --apiserver-cert-extra-sans=${MASTER_PUBLIC_IP}
Any ideas?
add security group to Master ec2, for instance this way
Port range:
0 - 6555 or just 6443
source ip
172.31.0.0/16
To add to #Markownikow's answers, in this instance, your issue appears to be limited to a missing security group. I was able to reproduce this on my end and adding the above rule allowed worker to reach master on port 6443. For the sake of completion, below are a few additional checks you could do:
Make sure both master and worker are in the same VPC and subnet.
If master and worker are in different subnets, make sure appropriate rules are in place to allow both instances talk to each other.

Can't connect to public IP for EC2 instance

I have an EC2 instance which is running with the following security groups:
HTTP - TCP - 80 - 0.0.0.0/0
Custom UDP Rule - UDP - 1194 - 0.0.0.0/0
SSH - TCP - 22 - 0.0.0.0/0
Custom TCP Rule - TCP - 943 - 0.0.0.0/0
HTTPS - TCP - 443 - 0.0.0.0/0
However, when I try to access http://{PUBLIC_IP} or https://{PUBLIC_IP} in the browser, I get a "{IP} refused to connect" error. I'm new to AWS. Am I missing something here? What should I do to debug?
One way to debug this particular class of problem is to use netcat in order to determine where the problem lies.
If you run netcat against port 80 on the public IP address of your instance and just get a hang (no output at all), then most likely your security group isn't allowing traffic through. Here is an example from an EC2 instance that is in a security group that doesn't allow port 80 traffic inbound:
% nc -v 55.35.300.45 80
<just hangs>
Whereas if the security group is changed to allow port 80, but the EC2 instance doesn't have any process listening on port 80, you'll get the following:
% nc -v 55.35.300.45 80
nc: connectx to 52.38.300.43 port 80 (tcp) failed: Connection refused
Given that your browser gave you a similar "connection refused", most likely the problem is that there is no web server running on your instance. You can verify this by ssh'ing into the instance and seeing if you can connect to port 80 there:
ssh ec2-user#55.35.300.45
% nc -v localhost 80
nc: connect to localhost port 80 (tcp) failed: Connection refused
If you get something like the above, you're definitely not running a webserver.
I'm not sure if it's too late to help but I was stuck with a similar issue with my test server
SG Inbound: ssh -> 22
HTTP -> 80
NACL: default allow/deny settings
but still couldn't ping to the server from my browser, then I realize there's nothing running on the server that can serve the request, and I started httpd server (webserver) and it worked.
sudo yum -y install httpd
sudo service httpd start
this way you can test the connectivity if you are playing with SGs and NACLs and of course it's not the only way, just an example if you're figuring your System N/W out.
Have you installed webserver(ngingx/apache) to serve your requests. If so please share your the config files. (So that it will help to troubleshoot)
I think the reason is probably that you did not set up a web server for your EC2 instance, because if you try to access http://{PUBLIC_IP} or https://{PUBLIC_IP}, you need to have a background server to serve the http request as #Niranj Rajasekaran said.
By the way, by simply pinging the {PUBLIC_IP}, you could see if your connection to your EC2 instance is normal or not.
In command prompt or terminal, type
ping {PUBLIC_IP}
In my case, the server was running but available on just 127.0.0.1 so it refused connections from external hosts. To see if this is your situation, you can run
netstat -an | grep <port number>
If it says 127.0.0.1:<port number> instead of 0.0.0.0:<port number>, you have this problem.
Usually there's a flag or an argument in your server code somewhere to set the host to 0.0.0.0:
app.run(host='0.0.0.0') # flask example
However, in my case, I had already set this so I thought that couldn't possibly be the issue, which is how I ended up on this thread, which asks more generally about the problem. Unfortunately, I was using docker, and had set 0.0.0.0 on the container but was mapping that explicitly to 127.0.0.1 on the host in the docker-compose port-mapping:
ports:
- "127.0.0.1:<port number>:<port number>"
Changing that line to remove the host IP specification fixed the problem upon re-deploy:
ports:
- "<port number>:<port number>"

Cannot connect to Cloudera Manager, not listening on port 7180

I'd really appreciate some help to get cloudera manager running on AWS EC2.
Its my first install, and I'm aiming to use the AWS Free Tier to spin up a few nodes and do some training on Hadoop cluster and the cloudera distribution. I'm using the RedHat RHEL 7.2 image on AWS EC2.
I am following the instructions here... Cloudera Manager installation
I have installed cloudera manager OK, and get to the screen where it invites you to use a browser to log-in to the cloudera manager server. But that's where the problem starts. It seems the app is not listening on port 7180, so there's no hope of connecting from another machine across the network. I can't even connect locally, on the server, yet the service appears to be running OK. But its not listening on port 7180.
Q1 - How can I confirm the config is set to use port 7180.?
Q2 - are there obvious steps that I'm missing here ?
Thanks in advance,
[Edit..]
I'm beginning to wonder if the Free EC2 host is running short on memory to run cloudera manager. I saw one comment that implied that....AWS Forum post . But the process doesn't crash or report any problems in its logfile. So it must be OK, right?
[Edit.... with more diagnostic info....]
Here's a list of the diagnostics I've checked:-
SELinux is not running [for install and testing purposes],
WAN firewalls,
EC2 firewall/Security group,
Local firewall on server,
Cloudera manager log,
Is the service up and running?
Can you connect locally?
Securtity group on the EC2 instance, it contains:-
SSH and Port 7180,
Firewall/iptables/firewalld on the RedHat instance, tried:-
adding ports to iptables, then
dissabling iptables, then
adding ports to firewalld, then
dissabling the firewalld service,
$ sudo iptables -L
Chain INPUT (policy ACCEPT)
target prot opt source destination
ACCEPT all -- anywhere anywhere ctstate RELATED,ESTABLISHED
ACCEPT tcp -- anywhere anywhere tcp dpt:ssh
ACCEPT tcp -- anywhere anywhere state NEW tcp dpt:7180
ACCEPT tcp -- anywhere anywhere state NEW tcp dpt:7182
But I'm getting the feeling that the installation of cloudera manager is not happy, or not running correctly.
I've checked the cloudera manager log, and it ends with the following.
$ tail /var/log/cloudera-scm-server/cloudera-scm-server.log
2016-02-25 11:02:23,581 INFO main:com.cloudera.cmon.components.MetricSchemaUpdate: persisting 19264 new metrics
2016-02-25 11:02:28,920 INFO main:com.cloudera.cmon.components.MetricSchemaUpdate: persisting 0 updated metrics
2016-02-25 11:02:28,924 INFO main:com.cloudera.cmon.components.MetricSchemaManager: Cross entity aggregates processed.
And when I use tail -f, and restart the cloudera-scm-server service, the log scrolls a lot, and comes back the same state. If I search for ERROR, there are no lines with "ERR".
$ sudo service cloudera-scm-server start
Starting cloudera-scm-server (via systemctl): [ OK ]
$ sudo systemctl status cloudera-scm-server
● cloudera-scm-server.service - LSB: Cloudera SCM Server
Loaded: loaded (/etc/rc.d/init.d/cloudera-scm-server)
Active: active (exited) since Thu 2016-02-25 12:23:03 EST; 44s ago
Docs: man:systemd-sysv-generator(8)
Process: 747 ExecStart=/etc/rc.d/init.d/cloudera-scm-server start (code=exited, status=0/SUCCESS)
So, if I try to test the service, by connecting from the local machine I get the sort of behavious that makes me thing its just not listening, and maybe not started correctly.
Try poke it with a curl from the same shell as the cloudera-scm-server service was started
$ curl localhost:7180
curl: (7) Failed connect to localhost:7180; Connection refused
$ wget localhost:7180
--2016-02-25 08:00:16-- http://localhost:7180/
Resolving localhost (localhost)... ::1, 127.0.0.1
Connecting to localhost (localhost)|::1|:7180... failed: Connection refused.
Connecting to localhost (localhost)|127.0.0.1|:7180... failed: Connection refused.
Try check what ports are listening on that machine, no 7180 , what's up with that???
$ netstat -nltp
(No info could be read for "-p": geteuid()=1000 but you should be root.)
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp 0 0 0.0.0.0:7432 0.0.0.0:* LISTEN -
tcp 0 0 0.0.0.0:22 0.0.0.0:* LISTEN -
tcp 0 0 127.0.0.1:25 0.0.0.0:* LISTEN -
tcp6 0 0 :::7432 :::* LISTEN -
tcp6 0 0 :::22 :::* LISTEN -
tcp6 0 0 ::1:25 :::* LISTEN -
Here's what to look for, and a possible solution - give it more memory...
Check the status of the cloudera-scm-server service using [depending on your flavour of linux]
$ sudo service cloudera-scm-server status
OR
$ sudo systemctl status cloudera-scm-server
Look for the status - Active: active (running)
But if you find - Active: active (exited)
you may have a problem during the startup of the cloudera-scm-server.
In which case, look at the log files for cloudera-scm-server
$sudo ls -l /var/log/cloudera-scm-server
$sudo cat /var/log/cloudera-scm-server/cloudera-scm-server.out
JAVA_HOME=/usr/java/jdk1.7.0_67-cloudera
Java HotSpot(TM) 64-Bit Server VM warning: INFO: os::commit_memory(0x000000078dc58000, 265809920, 0) failed; error='Cannot allocate memory' (errno=12)
#
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (malloc) failed to allocate 265809920 bytes for committing reserved memory.
# An error report file with more information is saved as:
# /tmp/hs_err_pid831.log
[ec2-user#ip-172-31-31-166 ~]$ sudo tail -100 /var/log/cloudera-scm-server/cloudera-scm-server.out
JAVA_HOME=/usr/java/jdk1.7.0_67-cloudera
Java HotSpot(TM) 64-Bit Server VM warning: INFO: os::commit_memory(0x000000078dc58000, 265809920, 0) failed; error='Cannot allocate memory' (errno=12)
Use the command top to indicate how much memory is available to your system.
Possible solution - have a look at this discussion at Cloudera forum
In this case the java heap size was too small.
As we see that heap was exhausted, assuming this is not a memory leak
or something of the sort, Cloudera Manager may need more heap to
operate. This can be configured in:
/etc/default/cloudera-scm-server You could, for instance, change "-Xmx2G" to "-Xmx3G" or "-Xmx4G" If the problem still
happens, perhaps the heap dumps will yeild some clues.
I'd suggest you tail the logs. If you are using the free tier, cloudera manager will take a while to come up... possibly up to 5 minutes or more after you start the cloudera-scm-server.
The logs should show if there are any errors, possibly issues with memory allocation since the free tier servers have limited memory available. The little snippet of log entries looks fine and typical - it will go through a long list of processes before the UI comes up on 7180.
Also while that is going on, run top or even free -g to see how much resources are being used - particularly memory.
I was having the exact same issue, cannot hit the CM login using public DNS or IP on port 7180.
Following steps will help you :
iptables stopped (service iptables stop)
SELinux disabled (got to /etc/selinux/config and disbaled the selinux)
curl/wget localhost:7180 works (check the curl status)
ufw allow 7180
service httpd status should be running.
check va/log/cloudera-scm-server log : if any error found then troubleshoot the error
cloudera-scm-server status (should be running state)
netstat -nap | grep 7180 returns (if running other service then kill it)
telnet localhost 7180 (should be connected)
Cannot connect to Cloudera Manager, not listening on port 7180
1] Check the status:
sudo service cloudera-scm-server status
*cloudera-scm-server.service - LSB: Cloudera SCM Server Loaded: loaded (/etc/rc.d/init.d/cloudera-scm-server; bad; vendor preset: disabled) Active: active (exited) since UTC; 47min ago Docs: man:systemd-sysv-generator(8) rm /var/run/cloudera-scm-server.pid
NOTE : The Cloudera Manager service will not be running as it exited abnormally.
Running service cloudera-scm-server status will print following message "cloudera-scm-server dead but pid file exists".
Reason: Out of memory.
Solution : Examine the heap dump that the Cloudera Manager Server creates when it runs out of memory. The heap dump file is created in the /tmp directory, has file extension .hprof and file permission of 600. Its owner and group will be the owner and group of the Cloudera Manager server process, normally cloudera-scm:cloudera-scm.
Link : http://www.cloudera.com/documentation/manager/5-0-x/Cloudera-Manager-Diagnostics-Guide/cm5dg_troubleshooting_cluster_config.html
Check the status of `cloudera-scm-server` and follow the instructions ahead:
[root#quickstart ~]# `service cloudera-scm-server status`
By default, Cloudera's QuickStart VM manages CDH using Linux's configuration
and service management. To use Cloudera Manager instead, you must shut down
and disable the existing CDH services and then start Cloudera Manager. You can
do this by running the following command:
`sudo /home/cloudera/cloudera-manager`
[root#quickstart ~]# `sudo /home/cloudera/cloudera-manager `
`[QuickStart] Shutting down CDH services via init scripts...
JMX enabled by default
Using config: /etc/zookeeper/conf/zoo.cfg
[QuickStart] Disabling CDH services on boot...
[QuickStart] Starting Cloudera Manager services...
[QuickStart] Deploying client configuration...
[QuickStart] Starting CM Management services...
[QuickStart] Enabling CM services on boot...
[QuickStart] Starting CDH services...`
________________________________________________________________________________
Success! You can now log into Cloudera Manager from the QuickStart VM's browser:
http://quickstart.cloudera:7180
Username: cloudera
Password: cloudera

What does it mean that my resource manager does not have an open port 8032?

I have my YARN resource manager on a different node than my namenode, and I can see that something is running, which I take to be the resource manager. Ports 8031 and 8030 are bound, but not port 8032, to which my client tries to connect.
I am on CDH 5.3.1, and the following is part of the output of lsof -i
java 12478 yarn 230u IPv4 61325 0t0 TCP hadoop2.adastragrp.com:48797->hadoop2.adastragrp.com:8031 (ESTABLISHED)
java 13753 yarn 159u IPv4 61302 0t0 TCP hadoop2.adastragrp.com:8031 (LISTEN)
java 13753 yarn 170u IPv4 61308 0t0 TCP hadoop2.adastragrp.com:8030 (LISTEN)
java 13753 yarn 191u IPv4 61326 0t0 TCP hadoop2.adastragrp.com:8031->hadoop2.adastragrp.com:48797 (ESTABLISHED)
How do I diagnose what's wrong here? I suspect that the resource manager is running, but can't bind to port 8032, but I have no idea why that could be.
In the cloudera manager, the ResourceManager is shown as having good health, but at the same time I get this report:
ResourceManager summary: hadoop2.adastragrp.com (Availability:
Unknown, Health: Good). This health test is bad because the Service
Monitor did not find an active ResourceManager.
[Edit]
I can execute yarn application -list locally on the resource manager node, but when I do the same on a different node, it tries to connect to the resource manager correctly, but fails to do so. Both nodes are connected, can ping each other, and so on. I disabled the iptables service on the VM.
nmap output:
PORT STATE SERVICE REASON
8032/tcp filtered unknown host-prohibited
Wether the port was occupied by other process? For example, you stop your hadoop cluster abnormally, result in some process still running. If so, try to ps -e|grep java,and kill it.
Gotcha, on CentOS 6 stopping the iptables service didn't really disable the firewall. I had to disable it with system-config-firewall.

RabbitMQ starts on a different port every time?

I am running rabbitmq servers over Ec2. I am trying to create a cluster, and have the ports: 4369 and 25672 and 5672 open as specified in the rabbitmq docs : https://www.rabbitmq.com/clustering.html
Whenever I start my rabbitmq server:
rabbitmq-server -detached
The server starts on a different port. Output of epmd -names gives:
epmd: up and running on port 4369 with data:
name rabbit at port 50696
Where '50696' changes each time I stop the server and start it again. This is making it impossible for me to cluster my instances without allowing all ports inbound on my aws firewall rules.
Any ideas on what is going on?
Take a look at RABBITMQ_DIST_PORT here: https://www.rabbitmq.com/ec2.html
rabbitmq.config:
[
{kernel,
[
{inet_dist_listen_min, 55555},
{inet_dist_listen_max, 55560}
]}
].
epmd -names
epmd: up and running on port 4369 with data:
name rabbit at port 55555

Resources