RabbitMQ HA Cluster on Amazon EC2 - amazon-ec2

In Amazon VPC, on two nodes I have installed rabbitmq
On Node 1, I ran the following commands
#Node 1
/etc/init.d/rabbitmq-server stop
rabbitmq-server -detached
rabbitmqctl start_app
rabbitmqctl set_policy HA '^(?!amq\.).*' '{"ha-mode": "all"}'
On Node 2, I ran the following commands to setup the cluster
/etc/init.d/rabbitmq-server stop
rabbitmq-server -detached
rabbitmqctl stop_app
rabbitmqctl join_cluster rabbit#<PrivateIP>
rabbitmqctl start_app
rabbitmqctl set_policy HA '^(?!amq\.).*' '{"ha-mode": "all"}'
RabbitMQ nodes are behind a Elastic Load Balancer. I ran a java program to keep pushing messages into rabbitmq.
Case 1: rabbitmqctl list-queues -- showed the quename and queue message count same while the java program was pushing messages to the queue.
Case 2: I stopped rabbitmq on node 2 and then started it again. Checked the cluster status and queue message counts. The message count was correct ( 3330 on both node 1 and node 2 )
Case 3 : I stopped rabbitmq on node 1 while the java program was pushing messages to the queue.
I checked the queue message count in node 2 , count was 70.
I started rabbitmq on node 1, and then checked queue count was 75.
I want to setup a rabbitmq high availability cluster and ensure no message loss. I have enabled sync_queue on rabitmq start in /etc/init.d/rabbitmq-server.
Appreciate if you can point out, why the message counts dropped from approx 3330 to 70. And also what's the best way to setup and ensure HA.

A few tips:
Does your app use publisher confirms? If you don't want to lose messages, it should.
Is automatic syncing of queues enabled? If not, you have to manually initiate queue syncing for any queue.
You should not restart any node while queues are being synced, or messages might be lost.

If you don't want lose messages you should considerer to use tx-transtaction
channel.txSelect();
channel.basicPublish("", youQueue, MessageProperties.PERSISTENT_TEXT_PLAIN,
message.getBytes());
channel.txCommit();
This could be kill the performance, if you have a high messages rate.
Visit
http://www.rabbitmq.com/blog/2011/02/10/introducing-publisher-confirms/

Related

'partitions have leader brokers without a matching listener' on kubernetes after kafka restart

i have several spring boot apps communicating via kafka running inside a kubernetes cluster. Using the bitnami/kafka helm chart for deploying kafka.
Everything works fine until the kafka broker (i only have a single instance) is restarting. After that i get for producer 'X partitions have leader brokers without a matching listener' ... to fix that i must setup the whole cluster again to make it working means kill all apps, remove kafka and the volumes and put everything back.
Found some stuff regarding "advertised.listeners" but nothing worked yet.
For example that one:
https://medium.com/#tsuyoshiushio/configuring-kafka-on-kubernetes-makes-available-from-an-external-client-with-helm-96e9308ee9f4
The question for me is, why is it working in the beginning and only after crashing it stops ....
Thx
Oliver
The question for me is, why is it working in the beginning and only after crashing it stops ....
When your Kafka broker restarts, it gets a new IP address. You must make sure that the new IP address is reflected in your broker properties i.e. it must be included in the advertised listeners.
You can have a Kubernetes service on top of your Kafka deployment and include that in your advertised listeners as a workaround.
The other way is to have stateful set so that your broker pod always gets the same IP address. Bring down your broker add the advertised.listeners and set it to the broker pod IP and start it.
One rule of thumb, is that your kafkacat topic metadata must not return an unreachable address.

ElasticSearch Client Node Loses Connection on AWS EC2 with Kernel Log "Setting Capacity to 83886080"

I have an ElasticSearch 2.4.4 cluster with 3 client nodes, 3 master nodes, and 4 data nodes, all on AWS EC2. Servers are running Ubuntu 16.04.3 LTS (GNU/Linux 4.4.0-104-generic x86_64). An AWS Application ELB is in front of the client nodes.
At random times, one of the clients will suddenly write this message to the kernel log:
Jan 17 05:54:51 localhost kernel: [2101268.191447] Setting capacity to 83886080
Note, this is the size of the primary boot drive in sectors (it's 40GB). After this message is received, the client node loses its connection to the other nodes in the cluster, and reports:
[2018-01-17 05:56:21,483][INFO ][discovery.zen ] [prod_es_233_client_1] master_left [{prod_es_233_master_2}{0Sat6dx9QxegO2rM03_o9A}{172.31.101.13}{172.31.101.13:9300}{data=false, master=true}],
reason [failed to ping, tried [3] times, each with maximum [30s] timeout]
The kernel message seems to be coming from xen-blkfront.c
https://github.com/analogdevicesinc/linux/blob/8277d2088f33ed6bffaafbc684a6616d6af0250b/drivers/block/xen-blkfront.c#L2383
This problem seems unrelated to the number or type of requests to ES at the time, or any other load-related parameter. It just occurs randomly.
The Load Balancer will record 504s and 460s when attempting to contact the bad client. Other client nodes are not affected and return with normal speed.
Is this a problem with EC2's implementation of Xen?

Replace ZooKeeper servers

I want to replace current 3 ZooKeeper servers with 3 new ZooKeeper servers. I have added:
new Zoo to Ambari,
add new Zoo to variables:
hbase.zookeeper.quorum
ha.zookeeper.quorum
zookeeper.connect
hadoop.registry.zk.quorum
yarn.resourcemanager.zk-address
Restart services, restart RM, and still I can't connect to any new Zoo when I turn off all old Zoo servers.
zookeeper-client -server zoo-new1
I get the following error:
"Unable to read additional data from server sessionid 0x0, likely server has closed socket"
And on new Zoo server in logs (zookeeper.out):
"Exception causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not running"
When I run one of the old ZooKeepers, then everything is working, and I can connect also to the new ZooKeeper servers.
My best guess is that this has to do with one of the most important properties in zookeeper, namely leader election. If you start with a zookeeper quorum with 3 servers and add 3 more servers to it. You will have to have at least 4 servers running for the quorum to be accessible. When a zookeeper node was unable to elect a leader it will look as if it's down.
This is also the reason why your setup works when you start one of the old zookeepers, because they are now 4 alive of 6 possible. If you want the new setup to work you need to remove the old servers from the config, so that the quorum only knows about the three new ones. To simply shut a zookeeper server down will not remove it from the quorum.

Kafka Spout fails to acknowledge message in storm while setting multiple workers

I have a storm topology that subscribes events from Kafaka queue. The topology works fine while the number of workers config.setNumWorkers is set to 1. When I update the number of workers to more than one or 2, the KafkaSpout fails to acknowledge the messages while looking at storm UI. What might be the possible cause, I am not able to figure out, the exactness of problem.
I have a 3 node cluster running one nimbus and 2 supervisors.
My Problem got resolve. The reason being kafka unable to acknowledge the spout message was the conflict with the Hostname. I had mistakenly the same host name in /etc/hostname and /etc/hosts file of the both workers. When I check the worker I was able to get the exception - unable to communicate with host. So there by I figured out, the problem was was host name .I updated the host name in etc/hosts mapping and /etc/host name file. The message started to be acknowledged. Thank you.

Unsubscribe from rabbit queue when machine is shut down

I am running rabbitmq through AMQP gem on 3 worker machines. When the machines are rebooted, my queue shows that workers are only added, not unsubscribed. For example, say each machine runs 5 workers:
When I boot 3 machines, I have 15 workers subscribed to the queue
When I shut down all 3 machines, I still have 15 workers subscribed to the queue
When I reboot the 3 machines, I now have 30 workers subscribed to the queue
In reality, I should only have 15 workers.
How can I ensure that my connection to my task queue closes when the machine reboots/shuts down? I have tried:
Signal.trap("INT") do #handles the ctrl c case
connection.close do
EM.stop { exit }
end
end
Signal.trap("TERM") do #handles the reboot and shut down case
connection.close do
EM.stop { exit }
end
end
This does NOT work.
I think what you are looking for is a Consumer cancellation Notification extension.
In your case, the clients has not been notified of the machines rebooting (in other words they have not received a 'basic.cancel' notification from the rabbitmq broker when the machines rebooted).
See the excerpts taken from the link above:
an extension in which the broker will send to the client a basic.cancel in the case of such unexpected consumer cancellations. This is not sent in the case of the broker receiving a basic.cancel from the client. AMQP 0-9-1 clients don't by default expect to receive basic.cancel methods from the broker asynchronously, and so in order to enable this behaviour, the client must present a capabilities table in its client-properties in which there is a key consumer_cancel_notify and a boolean value true
I'm not a ruby programmer, but I reckon the java example in the link above should give you the full picture.

Resources