How delete zero counts and don't display them to graphs on Zabbix? - snmp

I have to monitor temp/cpu usage and uptime of devices. I've created discovery items for this, but there is a problem, that the device has a several {ENT_NAME}.
One of the ENT_NAME have information about uptime, but haven't about cpu and temp (shows 0).
Another one have information about cpu and temp, but haven't about uptime.
Because of this, Zabbix shows some unusaged graphs. How can I delete these graphs ?

Related

Choosing a safe number of members for a CP Subsystem

Tried scouring the documentation, but I'm still uncertain about the CP Subsystem setup for my current situation.
We have a Hazelcast cluster spread across 2 data centers, each data center having an even number of members, say 4, but can have as many as double during rollout.
The boxes in each data center are configured to be part of a separate partition group => 2 data centers - 2 partition groups, with 4-8 members each at a snapshot in time.
What would be the best number to set as CP Subsystem member count, considering that one data center might be decoupled as part of BAU?
I initially thought of setting the count to 5, to enforce having at least one box from each data center in the Raft consensus as a general situation (rollover happens only for a short amount of time during redeployment, so maybe it is not that big of a deal), but that might mean that consensus will not be possible when one data center will be decoupled. On the other hand, if I set up a value smaller than the box count in one dc, say 3, what would happen if all the boxes in the consensus group were to be assigned in the same dc and that dc would go away abruptly due to network conditions? These are mostly assumptions, since CP is a relatively new topic for me, so please correct me if I am wrong.
We prefer three datacenters, but sometimes a third is not available.
My team was faced with this same decision several years ago when expanding into a new jurisdiction. There were a lot of options, here are some. In all of these scenarios we did extensive testing for how the system behaved with network partitions.
Make a primary datacenter and a secondary datacenter
This is the option we ended up going with. We put 2/3 of the hosts in one datacenter and 1/3 in the secondary data-center. As much as possible, we weighted client traffic towards the primary datacenter. We also communicated with our customers about this preference so they could do the same if they wanted.
If the datacenter had multiple rooms, we made sure to have hosts spread across the different rooms to help mitigate power/network outages within the datacenter. At the minimum, we ensured the hosts are on different racks.
We also had multiple clusters and for each cluster we usually switched which datacenter was the primary and which was the secondary. We didn't do this in some jurisdictions with notorious power troubles.
Split half and half
It's up to the gods what happens when a datacenter goes down. This is why we chose the first option: we wanted the choice of what happens when each datacenter goes down.
Have a tie-breaker in a different region
Put a host in an entirely different region from the two datacenters. Most of the time the latency will be too high for this host to fully participate in making consensus decisions, but in the case of a network partition it can help move the majority to one of the partitions.
The tie-breaker host must be a part of the quorum and cannot be kicked out because of latency delays.
Build a new datacenter
These things are very expensive, but it makes the durability story much nicer. Not always an option.

Why TSync(Time Synchronization) is needed in Adaptive AUTOSAR?

I'm a rookie in Adaptive AUTOSAR.
I can't imagine why Time Synchronization(Tysnc) is needed. System time of ECUs can be synchronized by PTP.
Could you explain why Tsync is needed even though PTP synchronize time across a distributed system? Or I welcome any documents or materials for me to understand Tsync's usages or use-cases.
The reason for the existence time sync along with the definition of time domains is that you need to be able to define different time domains across different bus systems within the vehicle. One example for a not directly obvious definition of a time domain could be the metering of operation-hours.
On top of that, the time domains can cross AUTOSAR platforms, i.e. a time domain may consists of both CP and AP nodes.
You can find explanations for time sync in (e.g) the AUTOSAR documents TPS Manifest and TPS System Template.
There need to be different time bases in vehicle.
Examples of Time Bases in vehicles are:
• Absolute, which is based on a GPS based time.
• Relative, which represents the accumulated overall operating time of a vehicle,
i.e. this Time Base does not start with a value of zero whenever the vehicle starts
operating.
• Relative, starting at zero when the ECU begins its operation.

Algorithms for establishing baselines from time series data

In my app I collect a lot of metrics: hardware/native system metrics (such as CPU load, available memory, swap memory, network IO in terms of packets and bytes sent/received, etc.) as well as JVM metrics (garbage collectins, heap size, thread utilization, etc.) as well as app-level metrics (instrumentations that only have meaning to my app, e.g. # orders per minute, etc.).
Throughout the week, month, year I see trends/patterns in these metrics. For instance when cron jobs all kick off at midnight I see CPU and disk thrashing as reports are being generated, etc.
I'm looking for a way to assess/evaluate metrics as healthy/normal vs unhealthy/abnormal but that takes these patterns into consideration. For instance, if CPU spikes around (+/- 5 minutes) midnight each night, that should be considered "normal" and not set off alerts. But if CPU pins during a "low tide" in the day, say between 11:00 AM and noon, that should definitely cause some red flags to trigger.
I have the ability to store my metrics in a time-series database, if that helps kickstart this analytical process at all, but I don't have the foggiest clue as to what algorithms, methods and strategies I could leverage to establish these cyclical "baselines" that act as a function of time. Obviously, such a system would need to be pre-seeded or even trained with historical data that was mapped to normal/abnormal values (which is why I'm learning towards a time-series DB as the underlying store) but this is new territory for me and I don't even know what to begin Googling so as to get back meaningful/relevant/educated solution candidates in the search results. Any ideas?
You could categorize each metric (CPU load, available memory, swap memory, network IO) with the day and time as bad or good for each metric.
Come up with a set of data for a given time frame with metric values and whether they are good or bad. Train a model using 70% of the data with the good and bad answers in the data.
Then test the trained model using the other 30% of data without the answers to see if you get the predicted results (good,bad) from the model. You could use a classification algorithm.

Performance Counters - Tool for monitoring in Windows Server 2008

I am able to get Performance counters for every two seconds in Windows Server 2008 machine using Powershell script. But when i go to Task Manager and check for the CPU Usage, powershell.exe is taking 50% of CPU. So i am trying to get those Performance counters using other third party tools. I have searched and found this and this. Those two are need to refresh manually and not getting automatically for every two seconds. Can anyone Please suggest some tool which gives the Performance Counters for every two seconds and analyzes the Maximum, Average of those counters and stores the results in text/xls or any other format. Please help me.
I found some Performance tools from here, listed below:
Apache JMeter
NeoLoad
LoadRunner
LoadUI
WebLOAD
WAPT
Loadster
LoadImpact
Rational Performance Tester
Testing Anywhere
OpenSTA
QEngine (ManageEngine)
Loadstorm
CloudTest
Httperf.
There are a number of tools that do this -- Google for "server monitor". Off the top of my head:
PA Server Monitor
Tembria FrameFlow
ManageEngine
SolarWinds Orion
GFI Max Nagios
SiteScope. This tool leverages either the perfmon API or the SNMP interface to collect the stats without having to run an additional non-native app on the box. If you go the open source route then you might consider Hyperic. Hyperic does require an agent to be on the box.
In either case I would look to your sample window as part of the culprit for the high CPU and not powershell. The higher your sample rate the higher you will drive the CPU, independent of tool. You can see this yourself just by running perfmon. Use the same sets of stats and what what happens to the CPU as you adjust the sample rate from once every 30 seconds, to once in 20, then ten, 5 and finally 2 seconds as the interval. When engaged in performance testing we rarely go below ten seconds on a host as this will cause the sampling tool to distort the performance of the host. If we have a particularly long term test, say 24 hours, then adjusting the interval to once in 30 seconds will be enough to spot long term trends in resource utilization.
If you are looking to collect information over a long period of time, 12 hours to more, consider going to a longer term interval. If you are going for a short period of sampling, an hour for instance, you may want to run a couple of different periods of one hour at lesser and greater levels of sampling (2 seconds vs 10 seconds) to ensure that the shorter sample interval is generating additional value for the additional overhead to the system.
To repeat, tools just to collect OS stats:
Commercial: SiteScope (Agentless). Leverages native interfaces
Open Source: Hyperic (Agent)
Native: Perfmon. Can dump data to a file for further analysis
This should be possible without third party tools. You should be able to collect the data using Windows Performance Monitor (see Creating Data Collector Sets) and then translate that data to a custom format using Tracerpt.
If you are still looking for other tools, I have compiled a list of windows server performance monitoring tools that also includes third party solutions.

Block Replication Limits in HDFS

I'm currently rebuilding our servers that have our region-servers and data nodes. When I take down a data node, after 10 minutes the blocks that it had are being re-replicated among other data nodes, as it should. We have 10 data-nodes, so I see heavy network traffic as the blocks are being re-replicated. However, I'm seeing that traffic to be about only 500-600mbps per server (the machines all have gigabit interfaces) so it's definitely not network-bound. I'm trying to figure out what is limiting the speed that the data-nodes send and receive blocks. Each data-node has six 7200 rpm sata drives, and the IO usage is very low during this, only peaking to 20-30% per drive. Is there a limit built into hdfs that limits the speed at which blocks are replicated?
The rate of replication work is throttled by HDFS to not interfere with cluster traffic when failures happen during regular cluster load.
The properties that control this are dfs.namenode.replication.work.multiplier.per.iteration (2), dfs.namenode.replication.max-streams (2) and dfs.namenode.replication.max-streams-hard-limit (4). The foremost controls the rate of work to be scheduled to a DN at every heartbeat that occurs, and the other two further limit the maximum parallel threaded network transfers done by a DataNode at a time. The values in () indicate their defaults. Some description of this is available at https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml
You can perhaps try to increase the set of values to (10, 50, 100) respectively to spruce up the network usage (requires a NameNode restart), but note that your DN memory usage may increase slightly as a result of more blocks information being propagated to it. A reasonable heap size for these values for the DN role would be about 4 GB.
P.s. These values were not tried by me on production systems personally. You will also not want to max out the re-replication workload such that it affects regular cluster work, as recovery of 1/3 replicas may be of lesser priority than missing job/query SLAs due to lack of network resources (unless you have a really fast network that's always under-utilised even under loaded periods). Try to tune it till you're satisfied with the results.

Resources