How to prevent SD card from creating write delays during logging? - performance

I've been working on an Arduino (ATMega328p) prototype that has to log data during certain events. An LSM6DS33 sensor is used to generate 6 values (2 bytes each) at a sample rate of 104 Hz. This data needs to be logged for a period of 500-20000ms.
In my code, I generate an interrupt every 1/104 sec using Timer1. When this interrupt occurs, data is read from the sensor, calibrated and then written to an SD card. Normally, this is not an issue. Reading the data from the sensor takes ~3350us, calibrating ~5us and writing ~550us. This means a total cycle takes ~4000us, whereas 9615us is available.
In order to save power, I wish to lower the voltage to 3.3V. According to the atmel datasheet, this also means that the clock frequency should be lowered to 8MHz. Assuming everything will go twice as slow, a measurement cycle would still be possible because ~8000us < 9615us.
After some testing (still 5V#16MHz), however, it occured to me that every now and then, a write cycle would take ~1880us instead of ~550us. I am using the library SdFat to write and test SD cards (RawWrite example). The following results came in when I tested the card:
Start raw write of 100000 KB
Target rate: 100 KB/sec
Target time: 100 seconds
Min block write time: 1244 micros
Max block write time: 12324 micros
Avg block write time: 1247 micros
As seen, the average time to write is fairly consistent, but sometimes a peak duration of 10x average occurs! According to the writer of the library, this is because the SD card needs some erase cycles in between x amount of write cycles. This causes a write delay (src:post#18&#22). This delay, however, pushes the time required for a cycle out of the available 9615us bracket, because the total measure cycle would be 10672us.
The data I am trying to write, is first put into a string using sprintf:
char buf[20] = "";
sprintf(buf,"%li\t%li\t%li\t%li\t%li\t%li",rawData[0],rawData[1],rawData[2],rawData[3],rawData[4],rawData[5]);
myLog.println(buf);
This writes the data to a txt file. But at my speed rate, only 21*104=2184 B/s would suffice. Lowering the speed of the RawWrite example to 6 KB/s, causes the SD card to write without getting an extended write delay. Yet my code still has them, even though less data is written.
My question is: how do I prevent this delay from occurring (if possible)? And if not possible, how can I work around it? It would help if I understood why exactly the delay occurs, because the interval is not always the same (every 10-15 writes).
Some additional info:
The sketch currently uses 69% of RAM (2kB) with variables. Creating two 512 byte buffers - like suggested in the same forum - is not possible for me.
Initially, I used two strings. Merging them into one, didn't affect the write speed with any significance.

I don't know how to work around the delay, but I experience a more stable and faster writing time, if I wrote to a binary file instead of a ".csv" or .txt" file.
The following link provide a fine script to write data as a binary struct to the SD card. (There are some small typo in his example, it is easily fixed)
https://hackingmajenkoblog.wordpress.com/2016/03/25/fast-efficient-data-storage-on-an-arduino/
This will not help you with the time variation, but it might minimize the writing time, and thus negleting the time issue.

Related

Is it possible to do a perf stat with an interval which would be not a time?

I would like to know if it is possible to modify easily perf linux with stat module to create an interval of cycles (or instructions by cycle) instead of an interval of time ? The goal is to optimize the precision of the counters got by interval. The time unit is not accurate enough.
I have a friend which submits this idea but I looked the source code a little and I understood (maybe I have wrong) that we :
create a condition for a time calculated with the rdtsc library (some clock_gettime)
create a "wait" in the perf processus
launch the program to test
test if we respect the time condition : we continue or we break the wait function with a save on the mapped register system information in perf (and call the wait function if it is not over)
I would like this result :
cycles counts unit events
10000 25000 instructions
10000 450 branch-misses
20000 21000 instructions
20000 850 branch-misses
Unfortunately, I'm seeing a big problem if I want to use the result of a counter like a condition I have not yet. Or should I get all the time this or these counter(s) which define my "interval condition" ? I also saw that for a time interval, we shouldn't get counters with a frequency lower than 100ms because it generates overhead. If I get some counters every 10000 cycles, would I have the same problems ? I don't know where is this overhead (calls system ?).

Performance Counters and IMC Counter Not Matching

I have an Intel(R) Core(TM) i7-4720HQ CPU # 2.60GHz (Haswell) processor. In a relatively idle situation, I ran the following Perf commands and their outputs are shown, below. The counters are offcore_response.all_data_rd.l3_miss.any_response and mem_load_uops_retired.l3_miss:
sudo perf stat -a -e offcore_response.all_data_rd.l3_miss.any_response,mem_load_uops_retired.l3_miss sleep 10
Performance counter stats for 'system wide':
3,713,037 offcore_response.all_data_rd.l3_miss.any_response
2,909,573 mem_load_uops_retired.l3_miss
10.016644133 seconds time elapsed
These two values seem consistent, as the latter excludes prefetch requests and those not targeted at DRAM. But they do not match the read counter in the IMC. This counter is called UNC_IMC_DRAM_DATA_READS and documented here. I read the counter reread it 1 second later. The difference was around 30,000,000 (EDITED). If multiplied by 10 (to estimate for 10 seconds) the resulting value will be around 300 million (EDITED), which is 100 times the value of the above-mentioned performance counters (EDITED). It is nowhere near 3 million! What am I missing?
P.S.: The difference is much smaller (but still large), when the system has more load.
The question is also asked, here:
https://community.intel.com/t5/Software-Tuning-Performance/Performance-Counters-and-IMC-Counter-Not-Matching/m-p/1288832
UPDATE:
Please note that PCM output matches my IMC counter reads.
This is the relevant PCM output:
The values for columns READ, WRITE and IO are calculated based on UNC_IMC_DRAM_DATA_READS, UNC_IMC_DRAM_DATA_WRITES and UNC_IMC_DRAM_IO_REQUESTS, respectively. It seems that requests classified as IO will be either READ or WRITE. In other words, during the depicted one second interval, almost (because of the inaccuracy reported in the above-mentioned doc) 2.01GB of the 2.42GB READ and WRITE requests belong to IO. Based on this explanation, the above three columns seem consistent with each other.
The problem is that there still exists a LARGE gap between the IMC and PMC values!
The situation is the same when I boot in runlevel 1. The processes on the scheduler are one of swapper, kworker and migration. Disk IO is almost 85KB/s. I'm wondering what leads to such a (relatively) huge amount of IO. Is it possible to detect that (e.g., using a counter or a tool)?
UPDATE 2:
I think that there is something wrong with the IO column. It is always something in the range [1.99,2.01], regardless of the amount of load in the system!
UPDATE 3:
In runlevel 1, the average number of occurrences of the uops_retired.all event in a 1-second interval is 15,000,000. During the same period, the number of read requests recorded by the associated IMC counter is around 30,000,000. In other words, assuming that all memory accesses are directly caused by cpu instructions, for each retired micro-operation, there exists two memory accesses. This seems impossible specially concerning the fact that there exist multiple levels of caches. Therefore, in the idle scenario, perhaps, the read accesses are caused by IO.
Actually, it was mostly caused by the GPU device. This was the reason for exclusion from performance counters. Here is the relevant output for a sample execution of PCM on a relatively idle system with resolution 3840x2160 and refresh rate 60 using xrandr:
And this is for the situation with resolution 800x600 and the same refresh rate (i.e., 60):
As can be seen, changing screen resolution reduced read and IO traffic considerably (more than 100x!).

How to measure CPU/GPU data transfer overhead in Metal

We have been porting some of our CPU pipeline to Metal to speed up some of the slowest parts with success. However since it is only parts of it we are transferring data back and forth to the GPU and I want to know how much time this actually takes.
Using the frame capture in XCode it informs me that the kernels take around 5-20 ms each, for a total of 149.5 ms (all encoded in the same Command Buffer).
Using Instruments I see some quite different numbers:
The entire thing operations takes 1.62 seconds (Points - Code 1).
MTLTexture replaceRegion takes up the first 180 ms, followed with the CPU being stalled the next 660 ms at MTLCommandBuffer waitUntilCompleted (highlighted area), and then the last 800 ms gets used up in MTLTexture getBytes which maxes out that CPU thread.
Using the Metal instruments I'm getting a few more measurements, 46ms for "Compute Command 0", 460 ms for "Command Buffer 0", and 210 ms for "Page Off". But I'm not seeing how any of this relates to the workload.
The closest thing to an explanation of "Page off" I could find is this:
Texture Page Off Data (Non-AGP)
The number of bytes transferred for texture page-off operations. Under most conditions, textures are not paged off but are simply thrown away since a backup exists in system memory. Texture page-off traffic usually happens when VRAM pressure forces a page-off of a texture that only has valid data in VRAM, such as a texture created using the function glCopyTexImage, or modified using the functiona glCopyTexSubImage or glTexSubImage.
Source: XCode 6 - OpenGL Driver Monitor Parameters
This makes me think that it could be the part that copies the memory off the GPU, but then there wouldn't be a reason getBytes takes that long. And I can't see where the 149.5 ms from XCode should fit into the data from Instruments.
Questions
When exactly does it transfer the data? If this cannot be inferred from the measurements I did, how do I acquire those?
Does the GPU code actually only take 149.5 ms to execute, or is XCode lying to me? If not, then where is the remaining 660-149.5 ms being used?

NFC APDU READ command performance tuning

I am reading several hundred bytes from a DESFire card using APDU commands.
The data application is authenticated, and the response MAC'ed.
I submit a series of READ_DATA commands (0xBD), each retrieving 54 bytes+MAC while increasing read offset for each command.
Will this operation go much quicker if I use a long READ with ADDITIONAL_FRAME (AF) instead of many sequential reads?
I understand that a simple AF is 1 byte vs 8 bytes for a full READ DATA command, thus reducing the number of bytes transferred by ~10%.
But will use of AF give additional performance benefits, for example because of less processing needed by the card?
I am asking this since I am getting only ~220kbit/s effective transfer rate when the theoretical limit is 424kbit/s. See my question on this here
I modified my reads to use ADDITIONAL FRAME.
This reduced the total bytes sent+received from 1628 to 1549 bytes, a 5% reduction.
The time used by tranciecve() was reduced from 602ms to 593ms, a 1.5% reduction.
The conclusion is that use of AF will not give additional performance other that the reduced time for bytes transfered.
The finding also indicates that since the time was reduced by a much lower factor than the data reduction, there must be operations that introduce significant latency that is not dependent on the data trancieved, but ReadFile is not the one.
Authenticate, SelectApplication or ReadRecord may be significantly more expensive than the time used for data transfer.
(Wanted to write a comment, but it got quite long...)
I would use the ADDITIONAL FRAME (AF) way, some reasoning follows:
in addition to the mentioned 7 bytes saved in the command, you save 4 MAC bytes in all of the responses (but the last one, of course)
every time you read 54 bytes, you wastefully MAC 2 zero bytes of padding, which might have been MACed as data (given by DES block size of 8). In the "AF way" there is only a single MAC run covering all the data, so this does not happen here
you are not enforcing the actual frame size. It is up to the reader and the card to select the right frame size (here I am not 100% sure how DESFire handles the ISO 14443-4 chaining and FSD)
some readers can handle the AF situation on their own and (magically?) give you the complete answer (doing the AF somehow in their firmware -- I have seen at least one reader doing this)
If my thoughts are (at least partially) correct, these points make only 9ms difference in your scenario. But under another scenario they might make much more.
Additional notes:
I would exclude the SELECT APPLICATION and AUTHENTICATE from the benchmark and measure them separately. It is up to you, but I would say that they only interfere with the desired "raw" data transfer measurement
I would recommend to benchmark the pure "plain data transfer" mode which is (presumably) the fastest "raw" data transfer possible
Thank you for sharing your results, not many people do so...Good luck!

Random write Vs Seek time

I have a very weird question here...
I am trying to write the data randomly to a file of 100 MB.
data size is 4KB and the the random offset is page alligned.(4KB ).
I am trying to write 1 GB of data at random offset on 100 MB file.
If I remove the actual code that writes the data to the disk, the entire operation takes less than a second (say 0.04 sec).
If I keep the code that writes the data its takes several seconds .
In case of random write operation, what happens internally? whether the cost is seek time or the write time? From above scenario its really confusing.. !!!!
Can anybody explain in depth please....
The same procedure applied with a sequential offset, write is very fast.
Thank you ......
If you're writing all over the file, then the disk (I presume this is on a disk) needs to seek to a new place on every write.
Also, the write speed of hard disks isn't particularly stunning.
Say for the sake of example (taken from a WD Raptor EL150) that we have a 5.9 ms seek time. If you are writing 1GB randomly everywhere in 4KB chunks, you're seeking 1,000,000,000 ÷ 4,000 × 0.0059 seconds = a total seeking time of ~1,400 seconds!

Resources