Zabbix Performance tunning (with Proxies) - performance

We setup a distributed monitoring cluster with one Zabbix (3.4.7) Server and 8 Proxies:
Zabbix:
OS: Debian Stretch
CPU: 16*2.27GHz
RAM: 48GB
Disk: Raid1 10K (Non-SSD)
LogFile=/var/log/zabbix/zabbix_server.log
PidFile=/var/run/zabbix/zabbix_server.pid
DBName=zabbix
DBUser=zabbix
DBHost=127.0.0.1
DBPort=3307
LogFileSize=0
DBPassword=****
Timeout=4
AlertScriptsPath=/etc/zabbix/alert.d/
FpingLocation=/usr/bin/fping
LogSlowQueries=3000
Include=/etc/zabbix/zabbix_server.conf.d/*.conf
StartAlerters=10
StartPollers=80
StartPollersUnreachable=80
StartTrappers=20
StartPingers=30
StartEscalators=5
CacheSize=8G
StartDBSyncers=16
HistoryCacheSize=2048M
TrendCacheSize=256M
ValueCacheSize=10G
HistoryIndexCacheSize=2G
ExternalScripts=/etc/zabbix/alert.d/
SSHKeyLocation=/nonexistent/.ssh
Proxy:
OS: Debian Stretch
CPU: 15*2.5GHz
RAM: 6GB
Disk: Raid1 10K (Non-SSD)
Server=XXXX
Hostname=zbx-lte
LogFile=/var/log/zabbix/zabbix_proxy.log
LogFileSize=0
PidFile=/var/run/zabbix/zabbix_proxy.pid
SocketDir=/var/run/zabbix
DBName=zabbix
DBUser=zabbix
DBPassword=159753
ConfigFrequency=600
DataSenderFrequency=1
StartPollers=240
StartPollersUnreachable=80
StartTrappers=20
StartPingers=80
SNMPTrapperFile=/var/log/snmptrap/snmptrap.log
CacheSize=1G
StartDBSyncers=16
HistoryCacheSize=2048M
HistoryIndexCacheSize=2G
Timeout=6
ExternalScripts=/usr/lib/zabbix/externalscripts
FpingLocation=/usr/bin/fping
LogSlowQueries=3000
We're monitoring near 1650 nodes (snmp, icmp, agent, ssh ,external scripts and external applications) by Zabbix
Since (about ) 2 month ago ,  we saw so many lags in non-icmp graphs  (Pic: 5.png) on specific zabbix proxy (zbx-lte in pics)
(source: sassan.co)
this graph is related to a device (with snmp v2).  (there are many other graph like this on this proxy)
I capture network traffic of relevant zabbix proxy , For one of items  , zabbix proxy sends only 16 queries instead of 60 queries (with interval 1m for 1 hour) .
(source: sassan.co)
If I change this device to monitor by Zabbix Server or any other proxies everything work properly
It sounds there is a problem on this proxy
(source: sassan.co)
(source: sassan.co)
Please help me to find the root cause.
(source: sassan.co)
(source: sassan.co)
(source: sassan.co)
(source: sassan.co)
(source: sassan.co)
(source: sassan.co)

I got it!
Surprisingly , The answer is very funny!
Server address in zabbix_proxy was an FQDN address , Our performance issues was solved when I add it to /etc/hosts.

Related

Slow Apache response

I have a high performance softlayer server. I am only running a (php-based. It's not an IRC server) chat room on this server. It works all fine. On average server response (for chat room) is 100MS with 100+ concurrent users. Some days ago a user threat to ddos our server. Now the server is so slow. On average ping time is 1500-2000MS with just 50-60 users. There is no high resource usage or bandwidth usage. I did following things to protect my server:
1 - DDOS protection (softlayer providers it)
2 - Install mod qos and evassive for appache
3 - Disabled ping of death and Syn packets
I performed following analysis:
1 - Analyzed apache logs. There isn't any frequent request from same IP or CLRF packets.
2 - Not many UDP packets
3 - Checked connections per IP and they are all normal.
However, nothing is working. That user threats and kills our time whenever he says/wants. Is there any other thing I should look into to protect my server? What kind of attack he could make to do this?
My guess is going to be they are exhausting your apache workers (usually a default of 150), you might want to check to see how many apache threads are currently running, and if its ~150 that might be why you have slow response times.
Some good reading on apache performance tuning.
http://httpd.apache.org/docs/2.2/misc/perf-tuning.html
http://www.monitis.com/blog/2011/07/05/25-apache-performance-tuning-tips/
https://www.devside.net/articles/apache-performance-tuning
The output from the following commands might also be useful in figuring out whats going on.
See whats running
ps auxf
See what apache is doing by turning on server-status (http://httpd.apache.org/docs/2.2/mod/mod_status.html)
apachectl fullstatus
See whats going on with network connections
netstat -npl
Anyway, I hope that helps point you in the right direction.

Full Clustering in Apache Traffic Server

I followed the steps mentioned in the official documentation for full clustering of multiple ATS instances. I installed 2 instances of ATS on 2 different Ubuntu machines (having the same specs, OS versions and hardware), and both of these act as a reverse proxy for web service hosted on a Tomcat server in a different machine. I wasnt able to set up the cluster. Here are some of the queries that I have.
They are on the same switch or same VLAN : The two Ubuntu machines on which I installed the ATS are connected to the same switch. They have the same interface mentioned in the /etc/network/interfaces. Are these enough or there is something else that has to be done to get the clustering?.
Running the comment traffic_line -r proxy.process.cluster.nodes : This returned 1 after I ran the traffic_line -x and traffic_line -L commands. But, in the cluster.config file, there isnt any additions or changes.
Moreover, when I make a query to one of these ATS instances (I have mapped the URLs in the remap.config file), both of them cache the responses locally and is not shared across.
From this information, can anyone tell me if I am doing something wrong. Let me know if anymore info is required.
Are these on virtual machines? I almost wasted 2 days trying to figure out what is wrong, when I initially set it up on openvz containers. Out of a wild guess, I decided to migrate to 2 physical nodes, and it went well. See Apache Traffic Server Clustering not working
proxy.process.cluster.nodes returns 1
means that it is just the standalone single node, and the second node on the cluster is not discovered.
Try a tcp dump for multicast and broadcast messages. If the other server's IP is not showing in the discovery packet, it has something to do at the network level, where the netops might have disabled multicast packet forwarding across switches.

Network, improve connecting time

I noticed that the connecting time for my site is slower than for the other sites that I have tried. 100 - 200 ms.
I am referring to the connecting time on the Network tab (dns lookup, connecting,waiting, etc.)
How can I improve it? Is it just something that is controlled by my host (Webfaction) or can I change some settings? I am the only person on my site at this time. DNS lookup is fast, not sure if that's relevant.
Site opening slow for that there are so many reason or parameter affect.
Bandwidth on server.
traffic on Server in term of request with its data size.
some Network issue like DNS is resolving your query quite slow. (use 4.2.2.2 or 8.8.8.8 DNS server)
Last but not have much probability that some on attack on network of doing flooding.
my suggestion to verify your Server Bandwidth and new HTTP connections per second.
also look that some uploading or downloading is going on or not.

Openfire Cluster Hazelcast Plugin Issues

Windows Server 2003R2/2008R2/2012, Openfire 3.8.1, Hazelcast 1.0.4, MySQL 5.5.30-ndb-7.2.12-cluster-gpl-log
We've set up 5 servers in Openfire Cluster. Each of them in a different subnet, subnets are located in different cities and interconnected with each other through VPN routers (2-8 Mbps):
192.168.0.1 - node0
192.168.1.1 - node1
192.168.2.1 - node2
192.168.3.1 - node3
192.168.4.1 - node4
Openfire configured to use MySQL database which is successfully replicating from the master node0 to all slave nodes (each node uses it's own local database server, functioning as slave).
In Openfire Web Admin > Server Manager > Clustering we are able to see all cluster nodes.
Openfire custom settings for Hazelcast:
hazelcast.max.execution.seconds - 30
hazelcast.startup.delay.seconds - 3
hazelcast.startup.retry.count - 3
hazelcast.startup.retry.seconds - 10
Hazelcast config for node0 (similar on other nodes except for interface section) (%PROGRAMFILES%\Openfire\plugins\hazelcast\classes\hazelcast-cache-config.xml):
<join>
<multicast enabled="false" />
<tcp-ip enabled="true">
<hostname>192.168.0.1:5701</hostname>
<hostname>192.168.1.1:5701</hostname>
<hostname>192.168.2.1:5701</hostname>
<hostname>192.168.3.1:5701</hostname>
<hostname>192.168.4.1:5701</hostname>
</tcp-ip>
<aws enabled="false" />
</join>
<interfaces enabled="true">
<interface>192.168.0.1</interface>
</interfaces>
These are the only settings changed from default ones.
The problem is that XMPP clients are authorizing too long, about 3-4 minutes, after authorization other users in roster are inactive for 5-7 minutes, during this time logged in user in Openfire Web Admin > Sessions is marked as Offline. Even after user is able to see other logged in users as active, messages are not delivered, or delivered after 5-10 minutes or after few Openfire restarts...
We appreciate any help. We spent about 5 days trying to set up this monster, and are out of any ideas... :(
Thanks a lot in advance!
UPD 1: Installed Openfire 3.8.2 alpha with Hazelcast 2.5.1 Build 20130427 same problem
UPD 2: Tried starting the cluster on two servers that are in the same city, separated by probably 1-2 hops # 1-5ms ping. Everything works perfectly! Then we stopped one of those servers and started one in another city (3-4 hops # 80-100 ms ping) the problem occured again... Slow authorizations, logged off users in roster, messages are not delivered on time etc.
UPD 3: Installed Openfire 3.8.2 without JRE, and Java SDK 1.70_25.
Here are JMX screenshots:
node 0:
node 1:
Red line is the first client connection (after Openfire restart). Tested on two users. Same thing... First user (node0) connected instantly, second user (node1) spent 5 seconds on connection.
Rosters have been showing offline users on both sides for 20-30 seconds, then online users start appearing in them.
First user sends message to second user. Second user waits for 20 seconds, then receives first message. Reply and all other messages are transfered instantly.
UPD 4:
Durring the diggin through JConsole "Threads" tab we've discovered these various states:
For example hz.openfire.cached.thread-3:
WAITING on java.util.concurrent.SynchronousQueue$TransferStack#8a5325
Total blocked: 0 Total waited: 449
Maybe this could help... We actually don't know where to look for.
Thanks!
[UPDATE] Note per the Hazelcast documentation - WAN replication is supported in their enterprise version only, not in the community version that is shipped with Openfire. You must obtain an enterprise license key from Hazelcast if you would like to use this feature.
You may opt to setup multiple LAN-based Openfire clusters and then federate them using the S2S integration across separate XMPP domains. This is the preferred approach for scaling up Openfire for a very large user base.
[Original post follows]
My guess is that the longer network latency in your remote cluster configuration might be tying up the Hazelcast executor threads (for queries and events). Some of these events and queries are invoked synchronously within an Openfire cluster. Try tuning the following properties:
hazelcast.executor.query.thread.count (default: 8)
hazelcast.executor.event.thread.count (default: 16)
I would start by setting these values to 40/80 (5x) respectively to see if there is any improvement in the overall application responsiveness, and potentially even higher based on your expected load. Additional Hazelcast settings (including other thread pools) plus instructions for adding these properties into the configuration XML can be found here:
Hazelcast configuration properties
Hope that helps ... and good luck!

Is it possible for me to do the performance testing in localhost with actual network environment?

I need to test the performance of application running on localhost as if it were in the online environment. I mean the performance test conducted by the network traffic simulation, limited bandwidth simulation, or other parameter as if it were online.
Could Apache Ab do the simulation?
We've used Charles and Firefox Throttle in the past to simulate slow networks.
Why can't you connect to a different PC, or even use a virtual machine and rate limit the virtual network connection?
Yes, but you will need to connect to your application by IP address, not "localhost" or 127.0.0.1. Typically for web applications (HTTP) I use Fiddler which can simulate limited bandwidth, but only if you connect as I have noted. Other bandwidth limiters for non-HTTP I'm not sure/aware of.
Ab won't simulate low level network errors. If you're on linux, you can simulate some networking with 'tc'. See http://www.kdedevelopers.org/node/1878 for a small example.
You can set up a local tunnel to expose your localhost to the world, using Ngrok. From there you can use any number of online performance tools
You can throttle bandwidth via Chrome Dev Tools

Resources