NFC APDU READ command performance tuning - performance

I am reading several hundred bytes from a DESFire card using APDU commands.
The data application is authenticated, and the response MAC'ed.
I submit a series of READ_DATA commands (0xBD), each retrieving 54 bytes+MAC while increasing read offset for each command.
Will this operation go much quicker if I use a long READ with ADDITIONAL_FRAME (AF) instead of many sequential reads?
I understand that a simple AF is 1 byte vs 8 bytes for a full READ DATA command, thus reducing the number of bytes transferred by ~10%.
But will use of AF give additional performance benefits, for example because of less processing needed by the card?
I am asking this since I am getting only ~220kbit/s effective transfer rate when the theoretical limit is 424kbit/s. See my question on this here

I modified my reads to use ADDITIONAL FRAME.
This reduced the total bytes sent+received from 1628 to 1549 bytes, a 5% reduction.
The time used by tranciecve() was reduced from 602ms to 593ms, a 1.5% reduction.
The conclusion is that use of AF will not give additional performance other that the reduced time for bytes transfered.
The finding also indicates that since the time was reduced by a much lower factor than the data reduction, there must be operations that introduce significant latency that is not dependent on the data trancieved, but ReadFile is not the one.
Authenticate, SelectApplication or ReadRecord may be significantly more expensive than the time used for data transfer.

(Wanted to write a comment, but it got quite long...)
I would use the ADDITIONAL FRAME (AF) way, some reasoning follows:
in addition to the mentioned 7 bytes saved in the command, you save 4 MAC bytes in all of the responses (but the last one, of course)
every time you read 54 bytes, you wastefully MAC 2 zero bytes of padding, which might have been MACed as data (given by DES block size of 8). In the "AF way" there is only a single MAC run covering all the data, so this does not happen here
you are not enforcing the actual frame size. It is up to the reader and the card to select the right frame size (here I am not 100% sure how DESFire handles the ISO 14443-4 chaining and FSD)
some readers can handle the AF situation on their own and (magically?) give you the complete answer (doing the AF somehow in their firmware -- I have seen at least one reader doing this)
If my thoughts are (at least partially) correct, these points make only 9ms difference in your scenario. But under another scenario they might make much more.
Additional notes:
I would exclude the SELECT APPLICATION and AUTHENTICATE from the benchmark and measure them separately. It is up to you, but I would say that they only interfere with the desired "raw" data transfer measurement
I would recommend to benchmark the pure "plain data transfer" mode which is (presumably) the fastest "raw" data transfer possible
Thank you for sharing your results, not many people do so...Good luck!

Related

Performance Counters and IMC Counter Not Matching

I have an Intel(R) Core(TM) i7-4720HQ CPU # 2.60GHz (Haswell) processor. In a relatively idle situation, I ran the following Perf commands and their outputs are shown, below. The counters are offcore_response.all_data_rd.l3_miss.any_response and mem_load_uops_retired.l3_miss:
sudo perf stat -a -e offcore_response.all_data_rd.l3_miss.any_response,mem_load_uops_retired.l3_miss sleep 10
Performance counter stats for 'system wide':
3,713,037 offcore_response.all_data_rd.l3_miss.any_response
2,909,573 mem_load_uops_retired.l3_miss
10.016644133 seconds time elapsed
These two values seem consistent, as the latter excludes prefetch requests and those not targeted at DRAM. But they do not match the read counter in the IMC. This counter is called UNC_IMC_DRAM_DATA_READS and documented here. I read the counter reread it 1 second later. The difference was around 30,000,000 (EDITED). If multiplied by 10 (to estimate for 10 seconds) the resulting value will be around 300 million (EDITED), which is 100 times the value of the above-mentioned performance counters (EDITED). It is nowhere near 3 million! What am I missing?
P.S.: The difference is much smaller (but still large), when the system has more load.
The question is also asked, here:
https://community.intel.com/t5/Software-Tuning-Performance/Performance-Counters-and-IMC-Counter-Not-Matching/m-p/1288832
UPDATE:
Please note that PCM output matches my IMC counter reads.
This is the relevant PCM output:
The values for columns READ, WRITE and IO are calculated based on UNC_IMC_DRAM_DATA_READS, UNC_IMC_DRAM_DATA_WRITES and UNC_IMC_DRAM_IO_REQUESTS, respectively. It seems that requests classified as IO will be either READ or WRITE. In other words, during the depicted one second interval, almost (because of the inaccuracy reported in the above-mentioned doc) 2.01GB of the 2.42GB READ and WRITE requests belong to IO. Based on this explanation, the above three columns seem consistent with each other.
The problem is that there still exists a LARGE gap between the IMC and PMC values!
The situation is the same when I boot in runlevel 1. The processes on the scheduler are one of swapper, kworker and migration. Disk IO is almost 85KB/s. I'm wondering what leads to such a (relatively) huge amount of IO. Is it possible to detect that (e.g., using a counter or a tool)?
UPDATE 2:
I think that there is something wrong with the IO column. It is always something in the range [1.99,2.01], regardless of the amount of load in the system!
UPDATE 3:
In runlevel 1, the average number of occurrences of the uops_retired.all event in a 1-second interval is 15,000,000. During the same period, the number of read requests recorded by the associated IMC counter is around 30,000,000. In other words, assuming that all memory accesses are directly caused by cpu instructions, for each retired micro-operation, there exists two memory accesses. This seems impossible specially concerning the fact that there exist multiple levels of caches. Therefore, in the idle scenario, perhaps, the read accesses are caused by IO.
Actually, it was mostly caused by the GPU device. This was the reason for exclusion from performance counters. Here is the relevant output for a sample execution of PCM on a relatively idle system with resolution 3840x2160 and refresh rate 60 using xrandr:
And this is for the situation with resolution 800x600 and the same refresh rate (i.e., 60):
As can be seen, changing screen resolution reduced read and IO traffic considerably (more than 100x!).

Data Oriented Design with Mike Acton - Are 'loops per cache line' calculations right?

I've watched Mike Acton's talks about DOD a few times now to better understand it (it is not an easy subject to me). I'm referring to CppCon 2014: Mike Acton "Data-Oriented Design and C++"
and GDC 2015: How to Write Code the Compiler Can Actually Optimize.
But in both talks he presents some calculations that I'm confused with:
This shows that FooUpdateIn takes 12 bytes, but if you stack 32 of them you will get 6 fully packed cache lines. Same goes for FooUpdateOut, it takes 4 bytes and 32 of them gives you 2 fully packed cache lines.
In the UpdateFoos function, you can do ~5.33 loops per each cache line (assuming that count is indeed 32), then he proceeds by assuming that all the math done takes about 40 cycles which means that each cache line would take about 213.33 cycles.
Now here's where I'm confused, isn't he forgetting about reads and writes? Even though he has 2 fully packed data structures they are in different memory spaces.
In my head this is what's happening:
Read in[0].m_Velocity[0] (which would take about 200 cycles based on his previous slides)
Since in[0].m_Velocity[1] and in[0].m_Foo are in the same cache line as in[0].m_Velocity[0] their access is free
Do all the calculation
Write the result to out[0].m_Foo - Here is what I don't know what happens, I assume that it would discard the previous cache line (fetched in 1.) and load the new one to write the result
Read in[1].m_Velocity[0] which would discard again another cache line (fetched in 4.) (which would take again about 200 cycles)
...
So jumping from in and out the calculations goes from ~5.33 loops/cache line to 0.5 loops/cache line which would do 20 cycles per cache line.
Could someone explain why wasn't he concerned about reads/writes? Or what is wrong in my thinking?
Thank you.
If we assume L1 cache is 64KB and one cache line is 64 bytes then there are total 1000 cache lines. So, in step 4 write to the result out[0].m_Foo will not discard the data cache in step 2 as they both are in different memory locations. This is the reason why he is using separate structure for updating out m_Foo instead directly mutating it in inplace like in his first implementation. He is just talking till point of calculation value. Updating value/writing value will have same cost as in his first implementation. Also, processor can optimize loops quite well as it can do multiple calculations in parallel(not sequential as result of first loop and second loop are not dependent). I hope this helps

Writing a full cache line at an uncached address before reading it again on x64

On x64 if you first write within a short period of time the contents of a full cache line at a previously uncached address, and then soon after read from that address again can the CPU avoid having to read the old contents of that address from memory?
As effectively it shouldn't matter what the contents of the memory was previously because the full cache line worth of data was fully overwritten? I can understand that if it was a partial cache line write of an uncached address, followed by a read then it would incur the overhead of having to synchronise with main memory etc.
Looking at documentation regards write allocate, write combining and snooping has left me a little confused about this matter. Currently I think that an x64 CPU cannot do this?
In general, the subsequent read should be fast - as long as store-to-load forwarding is able to work. In fact, it has nothing to do with writing an entire cache line at all: it should also work (with the same caveat) even for smaller writes!
Basically what happens on normally (i.e., WB memory regions) mapped memory is that the store(s) will add several entries to the store buffer of the CPU. Since the associated memory isn't currently cached, these entries are going to linger for some time, since an RFO request will occur to pull that line into cache so that it can be written.
In the meantime, you issue some loads that target the same memory just written, and these will usually be satisfied by store-to-load forwarding, which pretty much just notices that a store is already in the store buffer for the same address and uses it as the result of the load, without needing to go to memory.
Now, store forwarding doesn't always work. In particular, it never works on any Intel (or likely, AMD) CPU when the load only partially overlaps the most recent involved store. That is, if you write 4 bytes to address 10, and then read 4 bytes from addresss 9, only 3 bytes come from that write, and the byte at 9 has to come from somewhere else. In that case, all Intel CPUs simply wait for all the involved stores to be written and then resolve the load.
In the past, there were many other cases that would also fail, for example, if you issued a smaller read that was fully contained in an earlier store, it would often fail. For example, given a 4-byte write to address 10, a 2-byte read from address 12 is fully contained in the earlier write - but often would not forward as the hardware was not sophisticated enough to detect that case.
The recent trend, however, is that all the cases other than the "not fully contained read" case mentioned above successfully forward on modern CPUs. The gory details are well-covered, with pretty pictures, on stuffedcow and Agner also covers it well in his microarchitecture guide.
From the above linked document, here's what Agner says about store-forwarding on Skylake:
The Skylake processor can forward a memory write to a subsequent read
from the same address under certain conditions. Store forwarding is
one clock cycle faster than on previous processors. A memory write
followed by a read from the same address takes 4 clock cycles in the
best case for operands of 32 or 64 bits, and 5 clock cycles for other
operand sizes.
Store forwarding has a penalty of up to 3 clock cycles extra when an
operand of 128 or 256 bits is misaligned.
A store forwarding usually takes 4 - 5 clock cycles extra when an
operand of any size crosses a cache line boundary, i.e. an address
divisible by 64 bytes.
A write followed by a smaller read from the same address has little or
no penalty.
A write of 64 bits or less followed by a smaller read has a penalty of
1 - 3 clocks when the read is offset but fully contained in the
address range covered by the write.
An aligned write of 128 or 256 bits followed by a read of one or both
of the two halves or the four quarters, etc., has little or no
penalty. A partial read that does not fit into the halves or quarters
can take 11 clock cycles extra.
A read that is bigger than the write, or a read that covers both
written and unwritten bytes, takes approximately 11 clock cycles
extra.
The last case, where the read is bigger than the write is definitely a case where the store forwarding stalls. The quote of 11 cycles probably applies to the case that all of the involved bytes are in L1 - but the case that some bytes aren't cached at all (your scenario) it could of course take on the order of a DRAM miss, which can be hundreds of cycles.
Finally, note that none of the above has to do with writing an entire cache line - it works just as well if you write 1 byte and then read that same byte, leaving the other 63 bytes in the cache line untouched.
There is an effect similar to what you mention with full cache lines, but it deals with write combining writes, which are available either by marking memory as write-combining (rather than the usual write-back) or using the non-temporal store instructions. The NT instructions are mostly targeted towards writing memory that won't soon be subsequently read, skipping the RFO overhead, and probably don't forward to subsequent loads.

How to prevent SD card from creating write delays during logging?

I've been working on an Arduino (ATMega328p) prototype that has to log data during certain events. An LSM6DS33 sensor is used to generate 6 values (2 bytes each) at a sample rate of 104 Hz. This data needs to be logged for a period of 500-20000ms.
In my code, I generate an interrupt every 1/104 sec using Timer1. When this interrupt occurs, data is read from the sensor, calibrated and then written to an SD card. Normally, this is not an issue. Reading the data from the sensor takes ~3350us, calibrating ~5us and writing ~550us. This means a total cycle takes ~4000us, whereas 9615us is available.
In order to save power, I wish to lower the voltage to 3.3V. According to the atmel datasheet, this also means that the clock frequency should be lowered to 8MHz. Assuming everything will go twice as slow, a measurement cycle would still be possible because ~8000us < 9615us.
After some testing (still 5V#16MHz), however, it occured to me that every now and then, a write cycle would take ~1880us instead of ~550us. I am using the library SdFat to write and test SD cards (RawWrite example). The following results came in when I tested the card:
Start raw write of 100000 KB
Target rate: 100 KB/sec
Target time: 100 seconds
Min block write time: 1244 micros
Max block write time: 12324 micros
Avg block write time: 1247 micros
As seen, the average time to write is fairly consistent, but sometimes a peak duration of 10x average occurs! According to the writer of the library, this is because the SD card needs some erase cycles in between x amount of write cycles. This causes a write delay (src:post#18&#22). This delay, however, pushes the time required for a cycle out of the available 9615us bracket, because the total measure cycle would be 10672us.
The data I am trying to write, is first put into a string using sprintf:
char buf[20] = "";
sprintf(buf,"%li\t%li\t%li\t%li\t%li\t%li",rawData[0],rawData[1],rawData[2],rawData[3],rawData[4],rawData[5]);
myLog.println(buf);
This writes the data to a txt file. But at my speed rate, only 21*104=2184 B/s would suffice. Lowering the speed of the RawWrite example to 6 KB/s, causes the SD card to write without getting an extended write delay. Yet my code still has them, even though less data is written.
My question is: how do I prevent this delay from occurring (if possible)? And if not possible, how can I work around it? It would help if I understood why exactly the delay occurs, because the interval is not always the same (every 10-15 writes).
Some additional info:
The sketch currently uses 69% of RAM (2kB) with variables. Creating two 512 byte buffers - like suggested in the same forum - is not possible for me.
Initially, I used two strings. Merging them into one, didn't affect the write speed with any significance.
I don't know how to work around the delay, but I experience a more stable and faster writing time, if I wrote to a binary file instead of a ".csv" or .txt" file.
The following link provide a fine script to write data as a binary struct to the SD card. (There are some small typo in his example, it is easily fixed)
https://hackingmajenkoblog.wordpress.com/2016/03/25/fast-efficient-data-storage-on-an-arduino/
This will not help you with the time variation, but it might minimize the writing time, and thus negleting the time issue.

What's the reason behind ZigZag encoding in Protocol Buffers and Avro?

ZigZag requires a lot of overhead to write/read numbers. Actually I was stunned to see that it doesn't just write int/long values as they are, but does a lot of additional scrambling. There's even a loop involved:
https://github.com/mardambey/mypipe/blob/master/avro/lang/java/avro/src/main/java/org/apache/avro/io/DirectBinaryEncoder.java#L90
I don't seem to be able to find in Protocol Buffers docs or in Avro docs, or reason myself, what's the advantage of scrambling numbers like that? Why is it better to have positive and negative numbers alternated after encoding?
Why they're not just written in little-endian, big-endian, network order which would only require reading them into memory and possibly reverse bit endianness? What do we buy paying with performance?
It is a variable length 7-bit encoding. The first byte of the encoded value has it high bit set to 0, subsequent bytes have it at 1. Which is the way the decoder can tell how many bytes were used to encode the value. Byte order is always little-endian, regardless of the machine architecture.
It is an encoding trick that permits writing as few bytes as needed to encode the value. So an 8 byte long with a value between -64 and 63 takes only one byte. Which is common, the range provided by long is very rarely used in practice.
Packing the data tightly without the overhead of a gzip-style compression method was the design goal. Also used in the .NET Framework. The processor overhead needed to en/decode the value is inconsequential. Already much lower than a compression scheme, it is a very small fraction of the I/O cost.

Resources