Netty TrafficCounter - performance

I am currently using the io.netty.handler.traffic.ChannelTrafficShapingHandler & io.netty.handler.traffic.TrafficCounter to measure performance across a netty client and server. I am consistently see a discrepancy for the value Current Write on the server and Current Read on the client. How can I account for this difference considering the Write/Read KB/s are close to matching all the time.
2014-10-28 16:57:50,099 [Timer-4] INFO PerfLogging 130 - Netty Traffic stats TrafficShaping with Write Limit: 0 Read Limit: 0 and Counter: Monitor ChannelTC431885482 Current Speed Read: 3049 KB/s, Write: 0 KB/s Current Read: 90847 KB Current Write: 0 KB
2014-10-28 16:57:42,230 [ServerStreamingLogging] DEBUG c.f.s.r.l.ServerStreamingLogger:115 - Traffic Statistics WKS226-39843-MTY6NDU6NTAvMDAwMDAw TrafficShaping with Write Limit: 0 Read Limit: 0 and Counter: Monitor ChannelTC385810078 Current Speed Read: 0 KB/s, Write: 3049 KB/s Current Read: 0 KB Current Write: 66837 KB
Is there some sort of compression between client and server?
I can see that my client side value is approximately 3049 * 30 = 91470KB where 30 is the number of seconds where the cumulative figure is calculated

Scott is right, there are some fix around that are also taken this into consideration.
Some explaination:
read is actually real read bandwidth and read bytes account (since the system is not the origin of read reception)
for write events, the system is the source of them and managed them, so there are 2 kinds of writes (and will be in the next fix):
proposed writes which are not yet sent but before the fix taken into account in the bandwidth (lastWriteThroughput) and in the current write (currentWrittenBytes)
real writes when they are effectively pushed to the wire
Currently the issue is that currentWrittenBytes could be higher than real writes since they are mostly scheduled in the future, so they depend on the write speed from the handler which is the source of the write events.
After the fix, we will be more precise on what is "proposed/scheduled" and what is really "sent":
proposed writes taken into consideration into lastWriteThroughput and currentWrittenBytes
real writes operations taken into consideration into realWriteThroughput and realWrittenBytes when the writes occur on the wire (at least on the pipeline)
Now there is a second element, if you set the checkInterval to 30s, this implies the following:
the bandwidth (global average and so control of the traffic) is computed according to those 30s (read or write)
every 30s the "small" counters are reset to 0, while the cumulative counters are not: if you use cumulative counters, you should see that bytes received/sent should be almost the same, while every 30s the "small" counters (currentXxxx) are reset to 0
The smaller the value of this checkInterval, the better the bandwidth, but not too small to prevent too frequent reset and too many thread activities on bandwidth computations. In general, a default of 1s is quite efficient.
The difference seen could be for instance because the 30s event of the sender is not "synchronized" with 30s event of the receiver (and shall not be). So according to your numbers: when receiver (read) is resetting its counters with the 30s event, the writer will resetting its own counters 8s later (24 010 KB).

Related

Synchronisation for audio decoders

There's a following setup (it's basically a pair of TWS earbuds and a smartphone):
2 audio sink devices (or buds), both are connected to the same source device. One of these devices is primary (and is responsible for handling connection), other is secondary (and simply sniffs data).
Source device transmits a stream of encoded data and sink device need to decode and play it in sync with each other. There problem is that there's a considerable delay between each receiver (~5 ms # 300 kbps, ~10 ms # 600 kbps and # 900 kbps).
It seems that synchronisation mechanism which is already implemented simply doesn't want to work, so it seems that my only option is to implement another one.
It's possible to send messages between buds (but because this uses the same radio interface as sink-to-source communication, only small amount of bytes at relatively big interval could be transferred, i.e. 48 bytes per 300 ms, maybe few times more, but probably not by much) and to control the decoder library.
I tried the following simple algorithm: secondary will send every 50 milliseconds message to primary containing number of decoded packets. Primary would receive it and update state of decoder accordingly. The decoder on primary only decodes if the difference between number of already decoded frame and received one from peer is from 0 to 100 (every frame is 2.(6) ms) and the cycle continues.
This actually only makes things worse: now latency is about 200 ms or even higher.
Is there something that could be done to my synchronization method or I'd be better using something other? If so, what would be the best in such case? Probably fixing already existing implementation would be the best way, but it seems that it's closed-source, so I cannot modify it.

Performance Counters and IMC Counter Not Matching

I have an Intel(R) Core(TM) i7-4720HQ CPU # 2.60GHz (Haswell) processor. In a relatively idle situation, I ran the following Perf commands and their outputs are shown, below. The counters are offcore_response.all_data_rd.l3_miss.any_response and mem_load_uops_retired.l3_miss:
sudo perf stat -a -e offcore_response.all_data_rd.l3_miss.any_response,mem_load_uops_retired.l3_miss sleep 10
Performance counter stats for 'system wide':
3,713,037 offcore_response.all_data_rd.l3_miss.any_response
2,909,573 mem_load_uops_retired.l3_miss
10.016644133 seconds time elapsed
These two values seem consistent, as the latter excludes prefetch requests and those not targeted at DRAM. But they do not match the read counter in the IMC. This counter is called UNC_IMC_DRAM_DATA_READS and documented here. I read the counter reread it 1 second later. The difference was around 30,000,000 (EDITED). If multiplied by 10 (to estimate for 10 seconds) the resulting value will be around 300 million (EDITED), which is 100 times the value of the above-mentioned performance counters (EDITED). It is nowhere near 3 million! What am I missing?
P.S.: The difference is much smaller (but still large), when the system has more load.
The question is also asked, here:
https://community.intel.com/t5/Software-Tuning-Performance/Performance-Counters-and-IMC-Counter-Not-Matching/m-p/1288832
UPDATE:
Please note that PCM output matches my IMC counter reads.
This is the relevant PCM output:
The values for columns READ, WRITE and IO are calculated based on UNC_IMC_DRAM_DATA_READS, UNC_IMC_DRAM_DATA_WRITES and UNC_IMC_DRAM_IO_REQUESTS, respectively. It seems that requests classified as IO will be either READ or WRITE. In other words, during the depicted one second interval, almost (because of the inaccuracy reported in the above-mentioned doc) 2.01GB of the 2.42GB READ and WRITE requests belong to IO. Based on this explanation, the above three columns seem consistent with each other.
The problem is that there still exists a LARGE gap between the IMC and PMC values!
The situation is the same when I boot in runlevel 1. The processes on the scheduler are one of swapper, kworker and migration. Disk IO is almost 85KB/s. I'm wondering what leads to such a (relatively) huge amount of IO. Is it possible to detect that (e.g., using a counter or a tool)?
UPDATE 2:
I think that there is something wrong with the IO column. It is always something in the range [1.99,2.01], regardless of the amount of load in the system!
UPDATE 3:
In runlevel 1, the average number of occurrences of the uops_retired.all event in a 1-second interval is 15,000,000. During the same period, the number of read requests recorded by the associated IMC counter is around 30,000,000. In other words, assuming that all memory accesses are directly caused by cpu instructions, for each retired micro-operation, there exists two memory accesses. This seems impossible specially concerning the fact that there exist multiple levels of caches. Therefore, in the idle scenario, perhaps, the read accesses are caused by IO.
Actually, it was mostly caused by the GPU device. This was the reason for exclusion from performance counters. Here is the relevant output for a sample execution of PCM on a relatively idle system with resolution 3840x2160 and refresh rate 60 using xrandr:
And this is for the situation with resolution 800x600 and the same refresh rate (i.e., 60):
As can be seen, changing screen resolution reduced read and IO traffic considerably (more than 100x!).

How to prevent SD card from creating write delays during logging?

I've been working on an Arduino (ATMega328p) prototype that has to log data during certain events. An LSM6DS33 sensor is used to generate 6 values (2 bytes each) at a sample rate of 104 Hz. This data needs to be logged for a period of 500-20000ms.
In my code, I generate an interrupt every 1/104 sec using Timer1. When this interrupt occurs, data is read from the sensor, calibrated and then written to an SD card. Normally, this is not an issue. Reading the data from the sensor takes ~3350us, calibrating ~5us and writing ~550us. This means a total cycle takes ~4000us, whereas 9615us is available.
In order to save power, I wish to lower the voltage to 3.3V. According to the atmel datasheet, this also means that the clock frequency should be lowered to 8MHz. Assuming everything will go twice as slow, a measurement cycle would still be possible because ~8000us < 9615us.
After some testing (still 5V#16MHz), however, it occured to me that every now and then, a write cycle would take ~1880us instead of ~550us. I am using the library SdFat to write and test SD cards (RawWrite example). The following results came in when I tested the card:
Start raw write of 100000 KB
Target rate: 100 KB/sec
Target time: 100 seconds
Min block write time: 1244 micros
Max block write time: 12324 micros
Avg block write time: 1247 micros
As seen, the average time to write is fairly consistent, but sometimes a peak duration of 10x average occurs! According to the writer of the library, this is because the SD card needs some erase cycles in between x amount of write cycles. This causes a write delay (src:post#18&#22). This delay, however, pushes the time required for a cycle out of the available 9615us bracket, because the total measure cycle would be 10672us.
The data I am trying to write, is first put into a string using sprintf:
char buf[20] = "";
sprintf(buf,"%li\t%li\t%li\t%li\t%li\t%li",rawData[0],rawData[1],rawData[2],rawData[3],rawData[4],rawData[5]);
myLog.println(buf);
This writes the data to a txt file. But at my speed rate, only 21*104=2184 B/s would suffice. Lowering the speed of the RawWrite example to 6 KB/s, causes the SD card to write without getting an extended write delay. Yet my code still has them, even though less data is written.
My question is: how do I prevent this delay from occurring (if possible)? And if not possible, how can I work around it? It would help if I understood why exactly the delay occurs, because the interval is not always the same (every 10-15 writes).
Some additional info:
The sketch currently uses 69% of RAM (2kB) with variables. Creating two 512 byte buffers - like suggested in the same forum - is not possible for me.
Initially, I used two strings. Merging them into one, didn't affect the write speed with any significance.
I don't know how to work around the delay, but I experience a more stable and faster writing time, if I wrote to a binary file instead of a ".csv" or .txt" file.
The following link provide a fine script to write data as a binary struct to the SD card. (There are some small typo in his example, it is easily fixed)
https://hackingmajenkoblog.wordpress.com/2016/03/25/fast-efficient-data-storage-on-an-arduino/
This will not help you with the time variation, but it might minimize the writing time, and thus negleting the time issue.

Application not running at full speed?

I have the following scenario:
machine 1: receives messages from outside and processes them (via a
Java application). For processing it relies on a database (on machine
2)
machine 2: an Oracle DB
As performance metrics I usually look at the value of processed messages per time.
Now, what puzzles me: none of the 2 machines is working on "full speed". If I look at typical parameters (CPU utilization, CPU load, I/O bandwidth, etc.) both machines look as they have not enough to do.
What I expect is that one machine, or one of the performance related parameters limits the overall processing speed. Since I cannot observe this I would expect a higher message processing rate.
Any ideas what might limit the overall performance? What is the bottleneck?
Here are some key values during workload:
Machine 1:
CPU load average: 0.75
CPU Utilization: System 12%, User 13%, Wait 5%
Disk throughput: 1 MB/s (write), almost no reads
average tps (as reported by iostat): 200
network: 500 kB/s in, 300 kB/s out, 1600 packets/s in, 1600 packets/s out
Machine 2:
CPU load average: 0.25
CPU Utilization: System 3%, User 15%, Wait 17%
Disk throughput: 4.5 MB/s (write), 3.5 MB/s (read)
average tps (as reported by iostat): 190 (very short peaks to 1000-1500)
network: 250 kB/s in, 800 kB/s out, 1100 packets/s in, 1100 packets/s out
So for me, all values seem not to be at any limit.
PS: for testing of course the message queue is always full, so that both machines have enough work to do.
To find bottlenecks you typically need to measure also INSIDE the application. That means profiling the java application code and possibly what happens inside Oracle.
The good news is that you have excluded at least some possible hardware bottlenecks.

Using time command for benchmarking

I'm trying to use the time command as a simple solution for benchmarking some scripts that do a lot of text processing and makes a number of network calls. To evaluate if its a good fit, I tried doing:
/usr/bin/time -f "\n%E elapsed,\n%U user,\n%S system, \n %P CPU, \n%M
max-mem footprint in KB, \n%t avg-mem footprint in KB, \n%K Average total
(data+stack+text) memory,\n%F major page faults, \n%I file system
inputs by the process, \n%O file system outputs by the process, \n%r
socket messages received, \n%s socket messages sent, \n%x status" yum
install nmap
and got:
1:35.15 elapsed,
3.17 user,
0.40 system,
3% CPU,
0 max-mem footprint in KB,
0 avg-mem footprint in KB,
0 Average total (data+stack+text) memory,
127 major page faults,
0 file system inputs by the process,
0 file system outputs by the process,
0 socket messages received,
0 socket messages sent,
0 status
which is not exactly what I was expecting - specially the 0 values. Even when I change the command to say ping google.com, the socket messages are 0. What's going on? Is there any alternative?
[And I'm confused if it should stay here or be posted in serverfault]
I think it's not working with Linux; I assume you're using Linux since you said "strace". The manual page says:
Bugs
Not all resources are measured by all versions of Unix,
so some of the values might be reported as zero. The present
selection was mostly inspired by the data provided by 4.2 or
4.3BSD.
I tried "wget" on an OSX system (which is BSD-ish) to check if it report socket statistics, and there at least socket works:
0.00 user,
0.01 system,
1% CPU,
0 max-mem footprint in KB,
0 avg-mem footprint in KB,
0 Average total (data+stack+text) memory,
0 major page faults,
0 file system inputs by the process,
0 file system outputs by the process,
151 socket messages received,
8 socket messages sent,
0 status
Hope that helps,
Alex.
Do not use time to benchmark. Some of the fields of the time command is broken as specified in [1]. However the basic functionality of time (real , user and cpu time) are still intact.
[1] Maximum resident set size does not make sense

Resources