Kafka message timestamps for request/response - performance

I am building a performance monitoring tool which works in a cluster with Kafka topics.
For example, I am monitoring two topics: request, response. I.e. I need to have two timestamps - one from request and another from response. Then I could calculate difference to see how much time spent in a service which received a request and produced a response.
Please take in the account that it is working on a cluster, so different components may run on different hosts, hence - different physical clocks - so they could be out-of-sync and it will distort results significantly.
Also, I could not reliably use the clock of the monitoring tool itself, as this will influence timing results by its own processing times.
So, I would like to design a proper way which is reliably calculate time difference. What is most reliable way to measure time difference between two events in Kafka?

Solution 1:
We had similar problem before and solution we had was setting up NTP ( network time protocol).
In this one of your node act as NTP server and runs demons to keep time in sync across all your nodes we kept UTC and all other nodes has NTP clients which kept same time across all the servers
Solution 2:
Build a clock common API for all your components which will provide current time. This will make your system design independent of node local clock.

Related

JMeter - in different geographical location

I need some advise. I always worked on prim setup and had my JMeters in the same data center as application server to be tested. Now we got a cloud set up and I can ask for JMeters in different geographical locations in cloud to mimic real production behavior of the load. So is that what I should do? Will the response times of the transactions have network disturbances in them or it will be in fact like production? ...in on-prem testing when we have JMeters in the same data center as test application servers we totally eliminate network issues from the response times!!
There are 2 different types of the performance metrics for the websites:
Perceived System Performance
Perceived User Experience
Perceived System Performance won't be impacted by geo-distributed JMeter setup, the backend doesn't really care where the requests are originating from and how much time it takes the request/response to travel over the wire back and forth. In fact the system will receive less load comparing to the same test scenario running in the same network.
Perceived User Experience will be different and you will experience larger response times as it takes some time for the packet to physically travel around the globe and pass through all routers/switches on its way towards the system under test and back.
In terms of JMeter Glossary
Latency will be higher
Delta between latency and elapsed time will be higher

How to account for clock offsets in a distributed system?

Background
I have a system consisting of several distributed services, each of which is continuously generating events and reporting these to a central service.
I need to present a unified timeline of the events, where the ordering in the timeline corresponds to the moment event occurred. The frequency of event occurrence and the network latency is such that I cannot simply use time of arrival at the central collector to order the events.
E.g. in the following scenario:
E1 needs to be rendered in the timeline above E2, despite arriving at the collector afterwards, which means the events need to come with timestamp metadata. This is where the problem arises.
Problem
Due to constraints on how the environment is set up, it is not possible to ensure that the local time services on each machine are reliably aware of current UTC time. I can assume that each machine can accurately gauge relative time, i.e. the clock speeds are close enough to make measurement of short timespans identical, but problems like NTP misconfiguration/partitioning make it impossible to guarantee that every machine agrees on the current UTC time.
This means that a naive approach of simply generating a local timestamp for each event as it occurs, then ordering events using that will not work: every machine has its own opinion of what universal time is.
So the question is: how can I recover an ordering for events generated in a distributed system where the clocks do not agree?
Approaches I've considered
Most solutions I find online go down the path of trying to synchronize all the clocks, which is not possible for me since:
I don't control the machines in question
The reason the clocks are out of sync in the first place is due to network flakiness, which I can't fix
My own idea was to query some kind of central time service every time an event is generated, then stamp that event with the retrieved time minus network flight time. This gets hairy, because I have to add another service to the system and ensure its availability (I'm back to square zero if the other services can't reach this one). I was hoping there is some clever way to do this that doesn't require me to centralize timekeeping in this way.
A simple solution, somewhat inspired by your own at the end, is to periodically ping what I'll call the time-source server. In the ping include the service's chip clock; the time-source echos that and includes its timestamp. The service can then deduce the round-trip-time and guess that the time-source's clock was at the timestamp roughly round-trip-time/2 nanoseconds ago. You can then use this as an offset to the local chip clock to determine a globalish time.
You don't have to use a different service for this; the Collector server will do. The important part is that you don't have to ask call the time-source server at every request; it removes it from the critical path.
If you don't want a sawtooth function for the time, you can smooth the time difference
Congratulations, you've rebuilt NTP!

JMeter breakup of response time

is there any way so that I can get the breakup of response time provided by JMeter. i.e.
Travel time of total request
processing time
Travel time of total response
I know JMeter works entirely on client side, and the response is the TTLB. But any plugin or by any means to achieve the same?
Thanks in advance.
You are asking what you should know.
There is no plugin which will give you such breakdown (getting processing time of server is impossible unless you have jmeter agents installed on target server. Monitoring agents are not part of Jmeter till now)
You can get approximate request travel time by using new Connect Time feature of Jmeter.
In practice,
Response time = processing time + latency
You can again find latency with multiple network tools or rough idea using ping (JMeter also gives latency. cross verify with ping or wanem)
Once you know latency you can get processing time.
I think you should get breakdown from this.
Add a listener to the thread group:
jp#gc - Composite Graph
jp#gc - Connect Times Over Time
jp#gc - Response Times Over Time
2.jp#gc - Composite Graph Configuration Connect Times Over Time and Response Times Over Time
3.The result after running:
The larger the difference between the two listeners is, the bottleneck is at the network layer, and the smaller the difference is at the server layer.
4.You can also view specific data by adding a View Results in Table listener
Server processing time =Latency - Connect Time
The larger the difference is, the bottleneck is at the service layer, and the smaller the difference is, the bottleneck is at the network layer.
Server processing time covers program processing time, queue waiting time, database query time and so on. This method can confirm whether the bottleneck of response time is at the network layer or the service layer. If it is at the service layer, we may need to analyze further. So the term server processing time seems inaccurate.

NTP - debugging weird offset stats

Setup
I have a setup with 2 routers acting like NTP servers for an NTP client (Meinberg NTP) installed on a test server on the network. I have been synchronizing with both servers but after experiencing weird behavior, I switched to only use a single NTP server to synchronize with to be able to debug a problem I seem to have.
It seems, that the offset varies a hole lot during the day and it has values between (+10ms and -150ms). This is way off and much much more than the required values for our setup wich is a few ms a maximum.
Screenshots of statistics
I have configured the NTP Client to drop log files with statistics and then I have used a graphical dot print tool to create some graphs of the offset, jitter etc. over time. The following shows the average time of the logfile for today:
Overview statistics
Following are the offset graph, the delay, dispersion and the jitter graphs:
Offset
Delay
Dispersion
Jitter
My observations
It seems that four times today there have been some remarkable spikes on the graphs. And each time a spike has occured, that offset seems to be in sync, hitting a very few ms offset. But then it seems to be drifting away again.
The NTP Client has been running for a week, so I would expect any initial calibrations to be done.
Does anyone have the abilities to point out some obvious reasons for this behavior?
Thankyou in advance!
I found out, that the NTP servers I was synchronizing with, had delayed responses depending on the load on the servers. After correcting this to have a higher priority for the NTP servers, the problem was gone.

sync client time to server time, i.e. to make client application independant of the local computer time

Ok, so the situation is as follows.
I have a server with services for a game, a particular command from the server sends a timestamp for when the next game round should commence. To get this perfectly synced on all connected clients I also have a webbservice that returns a timestamp of the servers current time.
What I know: the time between request sent and answer recieved.
What I dont know: where the latency lies, on client processing or server processing or bandwidth issues.
What is the best practice to get a reasonable result here. I guess that GPS must have solved this in some fashion but I´ve been unable to find a good pattern.
What I do now is to add half the latency of the request to the server timestamp, but it's not quite good enough. This may have to do that the time between send and recieve can be as high as 11 seconds.
Suggestions?
There're many common solutions to sync time between machines, including correct PLL implementation done by NTPD with RTP. This is useful to you if you can change machine's local time. If not, perhaps you should do more or less what you did, but drop sync points where the latency is unreasonable.
The best practice is usually not to synchronise the absolute times but to work with relative times instead.

Resources