Openfire Cluster Hazelcast Plugin Issues - windows

Windows Server 2003R2/2008R2/2012, Openfire 3.8.1, Hazelcast 1.0.4, MySQL 5.5.30-ndb-7.2.12-cluster-gpl-log
We've set up 5 servers in Openfire Cluster. Each of them in a different subnet, subnets are located in different cities and interconnected with each other through VPN routers (2-8 Mbps):
192.168.0.1 - node0
192.168.1.1 - node1
192.168.2.1 - node2
192.168.3.1 - node3
192.168.4.1 - node4
Openfire configured to use MySQL database which is successfully replicating from the master node0 to all slave nodes (each node uses it's own local database server, functioning as slave).
In Openfire Web Admin > Server Manager > Clustering we are able to see all cluster nodes.
Openfire custom settings for Hazelcast:
hazelcast.max.execution.seconds - 30
hazelcast.startup.delay.seconds - 3
hazelcast.startup.retry.count - 3
hazelcast.startup.retry.seconds - 10
Hazelcast config for node0 (similar on other nodes except for interface section) (%PROGRAMFILES%\Openfire\plugins\hazelcast\classes\hazelcast-cache-config.xml):
<join>
<multicast enabled="false" />
<tcp-ip enabled="true">
<hostname>192.168.0.1:5701</hostname>
<hostname>192.168.1.1:5701</hostname>
<hostname>192.168.2.1:5701</hostname>
<hostname>192.168.3.1:5701</hostname>
<hostname>192.168.4.1:5701</hostname>
</tcp-ip>
<aws enabled="false" />
</join>
<interfaces enabled="true">
<interface>192.168.0.1</interface>
</interfaces>
These are the only settings changed from default ones.
The problem is that XMPP clients are authorizing too long, about 3-4 minutes, after authorization other users in roster are inactive for 5-7 minutes, during this time logged in user in Openfire Web Admin > Sessions is marked as Offline. Even after user is able to see other logged in users as active, messages are not delivered, or delivered after 5-10 minutes or after few Openfire restarts...
We appreciate any help. We spent about 5 days trying to set up this monster, and are out of any ideas... :(
Thanks a lot in advance!
UPD 1: Installed Openfire 3.8.2 alpha with Hazelcast 2.5.1 Build 20130427 same problem
UPD 2: Tried starting the cluster on two servers that are in the same city, separated by probably 1-2 hops # 1-5ms ping. Everything works perfectly! Then we stopped one of those servers and started one in another city (3-4 hops # 80-100 ms ping) the problem occured again... Slow authorizations, logged off users in roster, messages are not delivered on time etc.
UPD 3: Installed Openfire 3.8.2 without JRE, and Java SDK 1.70_25.
Here are JMX screenshots:
node 0:
node 1:
Red line is the first client connection (after Openfire restart). Tested on two users. Same thing... First user (node0) connected instantly, second user (node1) spent 5 seconds on connection.
Rosters have been showing offline users on both sides for 20-30 seconds, then online users start appearing in them.
First user sends message to second user. Second user waits for 20 seconds, then receives first message. Reply and all other messages are transfered instantly.
UPD 4:
Durring the diggin through JConsole "Threads" tab we've discovered these various states:
For example hz.openfire.cached.thread-3:
WAITING on java.util.concurrent.SynchronousQueue$TransferStack#8a5325
Total blocked: 0 Total waited: 449
Maybe this could help... We actually don't know where to look for.
Thanks!

[UPDATE] Note per the Hazelcast documentation - WAN replication is supported in their enterprise version only, not in the community version that is shipped with Openfire. You must obtain an enterprise license key from Hazelcast if you would like to use this feature.
You may opt to setup multiple LAN-based Openfire clusters and then federate them using the S2S integration across separate XMPP domains. This is the preferred approach for scaling up Openfire for a very large user base.
[Original post follows]
My guess is that the longer network latency in your remote cluster configuration might be tying up the Hazelcast executor threads (for queries and events). Some of these events and queries are invoked synchronously within an Openfire cluster. Try tuning the following properties:
hazelcast.executor.query.thread.count (default: 8)
hazelcast.executor.event.thread.count (default: 16)
I would start by setting these values to 40/80 (5x) respectively to see if there is any improvement in the overall application responsiveness, and potentially even higher based on your expected load. Additional Hazelcast settings (including other thread pools) plus instructions for adding these properties into the configuration XML can be found here:
Hazelcast configuration properties
Hope that helps ... and good luck!

Related

Defining servers from HA cluster to Consul

I have a cluster of (two) database servers (HA/ High Availability). My application connects to one of them (active) at a time. The other one remains passive and always ready to get connected when the active one fails over.
It’s a typical Windows cluster mechanism. Now I have a challenge to handle these two servers, but how can I let the my app know which one to be connected, since both (active & passive) ned to be registered in consul.

How to handle 4000 sip users and 10000 calls with same ip?

which technology I should use to handle 4000 sip users and 10000 calls with same ip with billing? I want it to configure so that all the sip users will use same ip and with proper billing .
Hi load is not something that can be easy setuped by reading one page of answer, or even any single book.
It require years of experience to understand issues that can arise.
From opensource stack can be used opensips/kamailio and cluster of some of opensource billing or 2600hz platform or custom billing.
In order to handle such a load you should use Kamailio + a cluster of RTPProxy servers. The following repository contains a set of Ansible playbooks for deploying an Active-Passive Kamailio cluster with a cluster of load-balanced RTPProxy servers. I think it is a good point to start:
https://github.com/ghrst/Kamailio-HA

Reconnection in Elasticsearch Cluster

I have a question about the clustering respectively the reconnection in the clustering in Elasticsearch.
I have 2 Elasticsearch-Server on 2 different servers within a network. Both Elasticsearch's are in the same cluster.
In an error scenario the network connection could be broken. I simulate this behaviour while pulling the network cable on one server.
After reconnecting the server to the network the clustering won't be working. When I put some data to one Elasticsearch, the data would not be transferred to the other Elasticsearch.
Does anybody know if there are some settings about the reconnecting?
Best Regards
Thomas
Why dont just put all Elasticsearch servers behind the load balancer with single DNS name, there could be issue in server which go down and need manual intervention , after correcting problem in server it will be available under load balancer automatically.
Did you check if all nodes join the cluster again?
You may want to try following APIs:
Check nodes status
http://es-host:9200/_nodes
Check cluster status
http://es-host:9200/_cluster/health

Slow Apache response

I have a high performance softlayer server. I am only running a (php-based. It's not an IRC server) chat room on this server. It works all fine. On average server response (for chat room) is 100MS with 100+ concurrent users. Some days ago a user threat to ddos our server. Now the server is so slow. On average ping time is 1500-2000MS with just 50-60 users. There is no high resource usage or bandwidth usage. I did following things to protect my server:
1 - DDOS protection (softlayer providers it)
2 - Install mod qos and evassive for appache
3 - Disabled ping of death and Syn packets
I performed following analysis:
1 - Analyzed apache logs. There isn't any frequent request from same IP or CLRF packets.
2 - Not many UDP packets
3 - Checked connections per IP and they are all normal.
However, nothing is working. That user threats and kills our time whenever he says/wants. Is there any other thing I should look into to protect my server? What kind of attack he could make to do this?
My guess is going to be they are exhausting your apache workers (usually a default of 150), you might want to check to see how many apache threads are currently running, and if its ~150 that might be why you have slow response times.
Some good reading on apache performance tuning.
http://httpd.apache.org/docs/2.2/misc/perf-tuning.html
http://www.monitis.com/blog/2011/07/05/25-apache-performance-tuning-tips/
https://www.devside.net/articles/apache-performance-tuning
The output from the following commands might also be useful in figuring out whats going on.
See whats running
ps auxf
See what apache is doing by turning on server-status (http://httpd.apache.org/docs/2.2/mod/mod_status.html)
apachectl fullstatus
See whats going on with network connections
netstat -npl
Anyway, I hope that helps point you in the right direction.

Full Clustering in Apache Traffic Server

I followed the steps mentioned in the official documentation for full clustering of multiple ATS instances. I installed 2 instances of ATS on 2 different Ubuntu machines (having the same specs, OS versions and hardware), and both of these act as a reverse proxy for web service hosted on a Tomcat server in a different machine. I wasnt able to set up the cluster. Here are some of the queries that I have.
They are on the same switch or same VLAN : The two Ubuntu machines on which I installed the ATS are connected to the same switch. They have the same interface mentioned in the /etc/network/interfaces. Are these enough or there is something else that has to be done to get the clustering?.
Running the comment traffic_line -r proxy.process.cluster.nodes : This returned 1 after I ran the traffic_line -x and traffic_line -L commands. But, in the cluster.config file, there isnt any additions or changes.
Moreover, when I make a query to one of these ATS instances (I have mapped the URLs in the remap.config file), both of them cache the responses locally and is not shared across.
From this information, can anyone tell me if I am doing something wrong. Let me know if anymore info is required.
Are these on virtual machines? I almost wasted 2 days trying to figure out what is wrong, when I initially set it up on openvz containers. Out of a wild guess, I decided to migrate to 2 physical nodes, and it went well. See Apache Traffic Server Clustering not working
proxy.process.cluster.nodes returns 1
means that it is just the standalone single node, and the second node on the cluster is not discovered.
Try a tcp dump for multicast and broadcast messages. If the other server's IP is not showing in the discovery packet, it has something to do at the network level, where the netops might have disabled multicast packet forwarding across switches.

Resources