NTP - debugging weird offset stats - debugging

Setup
I have a setup with 2 routers acting like NTP servers for an NTP client (Meinberg NTP) installed on a test server on the network. I have been synchronizing with both servers but after experiencing weird behavior, I switched to only use a single NTP server to synchronize with to be able to debug a problem I seem to have.
It seems, that the offset varies a hole lot during the day and it has values between (+10ms and -150ms). This is way off and much much more than the required values for our setup wich is a few ms a maximum.
Screenshots of statistics
I have configured the NTP Client to drop log files with statistics and then I have used a graphical dot print tool to create some graphs of the offset, jitter etc. over time. The following shows the average time of the logfile for today:
Overview statistics
Following are the offset graph, the delay, dispersion and the jitter graphs:
Offset
Delay
Dispersion
Jitter
My observations
It seems that four times today there have been some remarkable spikes on the graphs. And each time a spike has occured, that offset seems to be in sync, hitting a very few ms offset. But then it seems to be drifting away again.
The NTP Client has been running for a week, so I would expect any initial calibrations to be done.
Does anyone have the abilities to point out some obvious reasons for this behavior?
Thankyou in advance!

I found out, that the NTP servers I was synchronizing with, had delayed responses depending on the load on the servers. After correcting this to have a higher priority for the NTP servers, the problem was gone.

Related

Kafka message timestamps for request/response

I am building a performance monitoring tool which works in a cluster with Kafka topics.
For example, I am monitoring two topics: request, response. I.e. I need to have two timestamps - one from request and another from response. Then I could calculate difference to see how much time spent in a service which received a request and produced a response.
Please take in the account that it is working on a cluster, so different components may run on different hosts, hence - different physical clocks - so they could be out-of-sync and it will distort results significantly.
Also, I could not reliably use the clock of the monitoring tool itself, as this will influence timing results by its own processing times.
So, I would like to design a proper way which is reliably calculate time difference. What is most reliable way to measure time difference between two events in Kafka?
Solution 1:
We had similar problem before and solution we had was setting up NTP ( network time protocol).
In this one of your node act as NTP server and runs demons to keep time in sync across all your nodes we kept UTC and all other nodes has NTP clients which kept same time across all the servers
Solution 2:
Build a clock common API for all your components which will provide current time. This will make your system design independent of node local clock.

How to account for clock offsets in a distributed system?

Background
I have a system consisting of several distributed services, each of which is continuously generating events and reporting these to a central service.
I need to present a unified timeline of the events, where the ordering in the timeline corresponds to the moment event occurred. The frequency of event occurrence and the network latency is such that I cannot simply use time of arrival at the central collector to order the events.
E.g. in the following scenario:
E1 needs to be rendered in the timeline above E2, despite arriving at the collector afterwards, which means the events need to come with timestamp metadata. This is where the problem arises.
Problem
Due to constraints on how the environment is set up, it is not possible to ensure that the local time services on each machine are reliably aware of current UTC time. I can assume that each machine can accurately gauge relative time, i.e. the clock speeds are close enough to make measurement of short timespans identical, but problems like NTP misconfiguration/partitioning make it impossible to guarantee that every machine agrees on the current UTC time.
This means that a naive approach of simply generating a local timestamp for each event as it occurs, then ordering events using that will not work: every machine has its own opinion of what universal time is.
So the question is: how can I recover an ordering for events generated in a distributed system where the clocks do not agree?
Approaches I've considered
Most solutions I find online go down the path of trying to synchronize all the clocks, which is not possible for me since:
I don't control the machines in question
The reason the clocks are out of sync in the first place is due to network flakiness, which I can't fix
My own idea was to query some kind of central time service every time an event is generated, then stamp that event with the retrieved time minus network flight time. This gets hairy, because I have to add another service to the system and ensure its availability (I'm back to square zero if the other services can't reach this one). I was hoping there is some clever way to do this that doesn't require me to centralize timekeeping in this way.
A simple solution, somewhat inspired by your own at the end, is to periodically ping what I'll call the time-source server. In the ping include the service's chip clock; the time-source echos that and includes its timestamp. The service can then deduce the round-trip-time and guess that the time-source's clock was at the timestamp roughly round-trip-time/2 nanoseconds ago. You can then use this as an offset to the local chip clock to determine a globalish time.
You don't have to use a different service for this; the Collector server will do. The important part is that you don't have to ask call the time-source server at every request; it removes it from the critical path.
If you don't want a sawtooth function for the time, you can smooth the time difference
Congratulations, you've rebuilt NTP!

How to speed up nagios to monitor hosts over the cloud

while using nagios with multiple hosts spread over the network,hosts status shows a recognizable lag and takes a long time to reflect on nagios server cgi.Thus what is the optimal nrpe/nagios configration to speed up the status process for a distributed host environment.
In my case I use nagios core 4.1
nrpe 1.5
server/clients: Amazon ec2
The GUI is usually only updated once each minute (automatically), though clicking refresh can provide you with 'nearly' the latest information. I say nearly because there is a distinct processing loop inside of the Nagios core that causes it to never be real time. NRPE is going to run at the speed of your network connection - it does little else besides sending and receiving tiny amounts of data. About the only delay here is the time it takes to actually perform the check and send back the response - which, of course, has way to many factors to mention. Try looking at the output of
[nagioshome]/bin/nagiostats
There are several entries that tell you:
'Latency' - the time between when the check was scheduled to start, and the actual start time.
'Execution Time' - the amount of time checks are actually taking to run.
These entries will have three numbers, which are; Min / Max / Avg
High latency numbers (in my book that means Avg is greater than 1 second) usually means your Nagios server is over worked. There are a few things you can do to improve latency times, and these are outlined in the 'nagios.cfg' file. This latency has nothing to do with network speed or the speed of NRPE - it is primarily hardware speed. If you're already using the optimal values specified in nagios.cfg, then its time to find some faster hardware.
High execution times (for me an Avg greater than 5 seconds) can be blamed on just about everything except your Nagios system. This can be caused by faulty networks (improper packet routing), over loaded networks, faulty and/or poorly designed checks, slow target systems, ... the list is endless. Nothing you do with the Nagios and/or NRPE configs will help lower these values. Well, you could disable NRPE's encryption to improve wire time; but if you have encryption enabled in the first place, then its not likely you'd want it disabled.

Spread waiting time among connection requests and performance issues

I developed a server for a custom protocol based on tcp/ip-stack with Netty. Writing this was a pleasure.
Right now I am testing performance. I wrote a test-application on netty that simply connects lots (20.000+) of "clients" to the server (for-loop with Thread.wait(1) after each bootstrap-connect). As soon as a client-channel is connected it sends a login-request to the server, that checks the account and sends a login-response.
The overall performance seems to be quite OK. All clients are logged in below 60s. But what's not so good is the spread waiting time per connections. I have extremely fast logins and extremely slow logins. Variing from 9ms to 40.000ms spread over the whole test-time. Is it somehow possible to share waiting time among the requesting channels (Fifo)?
I measured a lot of significant timestamps and found a strange phenomenon. I have a lot of connections where the server's timestamp of "channel-connected" is way after the client's timestamp (up to 19 seconds). I also do have the "normal" case, where they match and just the time between client-sending and server-reception is several seconds. And there are cases of everything in between those two cases. How can it be, that client and server "channel-connected" are so much time away from each other?
What is for sure is, that the client immediatly receives the server's login-response after it has been send.
Tuning:
I think I read most of the performance-articles around here. I am using the OrderMemoryAwareThreadPool with 200 Threads on a 4CPU-Hyper-Threading-i7 for the incoming connections and also do start the server-application with the known aggressive-options. I also completely tweaked my Win7-TCP-Stack.
The server runs very smooth on my machine. CPU-usage and memory consumption is ca. at 50% from what could be used.
Too much information:
I also started 2 of my test-apps from 2 seperate machines "attacking" the server in parallel with 15.000 connections each. There I had about 800 connections that got a timeout from the server. Any comments here?
Best regards and cheers to Netty,
Martin
Netty has a dedicated boss thread that accepts an incoming connection. If the boss thread accepts a new connection, it forwards the connection to a worker thread. The latency between the acceptance and the actual socket read might be larger than expected under load because of this. Although we are looking into different ways to improve the situation, meanwhile, you might want to increase the number of worker threads so that a worker thread handles less number of connections.
If you think it's performing way worse than non-Netty application, please feel free to file an issue with reproducing test case. We will try to reproduce and fix the problem.

sync client time to server time, i.e. to make client application independant of the local computer time

Ok, so the situation is as follows.
I have a server with services for a game, a particular command from the server sends a timestamp for when the next game round should commence. To get this perfectly synced on all connected clients I also have a webbservice that returns a timestamp of the servers current time.
What I know: the time between request sent and answer recieved.
What I dont know: where the latency lies, on client processing or server processing or bandwidth issues.
What is the best practice to get a reasonable result here. I guess that GPS must have solved this in some fashion but I´ve been unable to find a good pattern.
What I do now is to add half the latency of the request to the server timestamp, but it's not quite good enough. This may have to do that the time between send and recieve can be as high as 11 seconds.
Suggestions?
There're many common solutions to sync time between machines, including correct PLL implementation done by NTPD with RTP. This is useful to you if you can change machine's local time. If not, perhaps you should do more or less what you did, but drop sync points where the latency is unreasonable.
The best practice is usually not to synchronise the absolute times but to work with relative times instead.

Resources