mmv_stats_interval_start usage example in c - performance

I am using MMV libraries provided by Performance Co-Pilot (PCP) framework. I need to reset the value in the MMV instance on a regular interval basis. I came across the mmv_stats_interval_start() api of PCP. Please explain how this api is working and how to get notification when the interval is elapsed.
mmv_stats_interval_start(void *addr, pmAtomValue *value,
const char *metric, const char *instance)
{
if (addr) {
if (value == NULL)
value = mmv_lookup_value_desc(addr, metric, instance);
if (value) {
struct timeval tv;
__pmtimevalNow(&tv);
mmv_inc_value(addr, value, -(tv.tv_sec*1e6 + tv.tv_usec));
}
}
return value;
}
An example in the github page of PCP is mmv_genstats.c

The mmv_stats_interval_start function pairs with mmv_stats_interval_end to accumulate time intervals (microseconds) between matching pairs of those calls. See for example http://git.pcp.io/cgi-bin/gitweb.cgi?p=pcp.git;a=blob;f=qa/src/mmv_genstats.c;hb=HEAD
MMV is not about notifications, only about efficiently exposing internal statistics of the program that uses -lpcp_mmv. Notifications would have to arise from another PCP client such as pmie.

Related

Send an struct array across a FREERTOS queue

I am starting with ESP32 and FREERTOS, and I am having problems sending an Struct array across a queue. I have already sent another kind of variables but never an array of Structs and I am getting an exception.
The sender and the receiver are in different source files and I am starting to thing that maybe is the problem (or at least part of the problem).
My simplified code looks like this:
common.h
struct dailyWeather {
// Day of the week starting in Monday (1)
int dayOfWeek;
// Min and Max daily temperature
float minTemperature;
float maxTemperature;
int weather;
};
file1.h
#pragma once
#ifndef _FILE1_
#define _FILE1_
// Queue
extern QueueHandle_t weatherQueue;
#endif
file1.cpp
#include "common.h"
#include "file1.h"
// Queue
QueueHandle_t weatherQueue = xQueueCreate( 2, sizeof(dailyWeather *) ); // also tried "dailyWeather" without pointer and "Struct dailyWeather"
void task1(void *pvParameters) {
for (;;) {
dailyWeather weatherDATA[8] = {};
// Code to fill the array of structs with data
if (xQueueSend( weatherQueue, &weatherDATA, ( TickType_t ) 0 ) == pdTRUE) {
// The message was sent sucessfully
}
}
}
file2.cpp
#include "common.h"
#include "file1.h"
void task2(void *pvParameters) {
for (;;) {
dailyWeather *weatherDATA_P; // Also tried without pointer and as an array of Structs
if( xQueueReceive(weatherQueue, &( weatherDATA_P ), ( TickType_t ) 0 ) ) {
Serial.println("Received");
dailyWeather weatherDATA = *weatherDATA_P;
Serial.println(weatherDATA.dayOfWeek);
}
}
}
When I run this code on my ESP32 it works until I try to print the data with Serial.println. The "Received" message is printed, but it crash in the next Serial.println with this error.
Guru Meditation Error: Core 1 panic'ed (LoadProhibited). Exception was unhandled.
I am locked with this problem and I am not able to find a way to fix it, so any help will be very apreciated.
EDIT:
I am thinking that maybe a solution will be just to add an order item to the struct, make the queue bigger (in number) and send all the Structs separately to the queue. Then use that order in reader to order it again.
Anyway, will be nice to learn what I am doing wrong with the above code.
freeRTOS queues operate by using the buffer and data size you specify during initialization, when you call xQueueCreate(), to make copies of the data you want to send-receive.
When you call xQueueSend(), which is equivalent to xQueueSendToBack(), it makes a copy into that buffer.
If another task is awaiting for the queue in a call to xQueueReceive(), at the moment it becomes ready to run, xQueueReceive() will make a copy of the item in front of the queue's buffer into the destination buffer you specify in your call to xQueueReceive().
If the data you want to send is of pointer/array type, dailyWeather * in your case, then you need to make sure the memory pointed to by the pointer does not get out of scope before being read by the task that receives the pointer by calling xQueueReceive(). Local variables are created in the calling task's stack and will certainly get out of scope, and very likely overwritten, after the function returns.
IMO, best solution if you really need to pass pointers is to allocate the structures array in the function that generates the data and deallocate it in the task that consumes the data.
For many scenarios it is highly desirable not to abuse of dynamic memory handling, so in several communications stacks you will find the use of buffer pools, which at the end are also queues that are initialized during application startup. Operation is approximately as follows:
Initialization:
Initialize the buffer pool queues (simple queues of pointers).
Fill the buffer pools with dynamically allocated buffers of appropriated sizes.
Initialize the queues for inter- task communications.
Task that provides the data:
Get (Receive) a buffer pointer from one buffer pool.
Fill the buffer with data.
Send the buffer pointer to the communications queue.
Task that receives the data:
Get (Receive) the data buffer pointer from the communications queue.
Use the data.
Return (Send) the buffer pointer to the buffer pool.
In case your structures are small, so you have a more or less constrained copy-then-copy overhead, it makes more sense to create the queue so you work directly with structure instances and structure copies instead of structure buffer pointers.
Firstly it's not a good idea to create the queue in the global scope like you do. A global queue handle is OK. But run xQueueCreate() in the same function that creates task1 and task2 (queue must be created before the tasks), something like this:
QueueHandle_t weatherQueue = NULL;
void main() {
weatherQueue = xQueueCreate(32, sizeof(struct dailyWeather));
if (!weatherQueue) {
// Handle error
}
if (xTaskCreate(task1, ...) != pdPASS) {
// Handle error
}
if (xTaskCreate(task2, ...) != pdPASS) {
// Handle error
}
}
Secondly, the code in task1() does the following in a loop:
Create a new array of 8 dailyWeather structs in stack (in the scope of a single loop iteration)
Copy a pointer to first item in weatherDATA[] to the queue (task2 will receive it a bit later, when it's time to switch tasks)
Release the array of 8 dailyWeather (because we're exiting loop scope)
A bit later task2() executes and tries to read the pointer to first item in weatherDATA[]. However this memory has probably been released already. You can't dereference it.
So you're passing pointers to invalid memory over the queue.
It's much, much easier to work with a queue if you just pass the data you want to send instead of a pointer. Your structure is small and consists of elementary data types, so it's a good idea to pass it over the queue in its entirety, one at a time (you can pass an entire array if you want, but this way is simpler).
Something like this:
void task1(void *pvParameters) {
for (;;) {
dailyWeather weatherDATA[8] = {};
// Code to fill the array of structs with data
for (int i = 0; i < 8; i++) {
// Copy the structs to queue, one at a time
if (xQueueSend( weatherQueue, &(weatherDATA[i]), ( TickType_t ) 0 ) == pdTRUE) {
// The message was sent successfully
}
}
}
}
On the receiver side:
void task2(void *pvParameters) {
for (;;) {
dailyWeather weatherDATA;
if( xQueueReceive(weatherQueue, &( weatherDATA ), ( TickType_t ) 0 ) ) {
Serial.println("Received");
Serial.println(weatherDATA.dayOfWeek);
}
}
}
I cannot recommend the official FreeRTOS book enough, it's a great resource for beginners.
Thanks for all for the answers.
Finally I have added a variable to track the item position and I have passed all the data through the queue to the destination task. Then I put all those structs back into another array.
common.h
#include <stdint.h>
struct dailyWeather {
// Day of the week starting in Monday (1)
uint8_t dayOfWeek;
// Min and Max daily temperature
float minTemperature;
float maxTemperature;
uint8_t weather;
uint8_t itemOrder;
};
file1.h
// Queue
extern QueueHandle_t weatherQueue;
file1.cpp
#include "common.h"
#include "file1.h"
// Queue
QueueHandle_t weatherQueue = xQueueCreate( 2 * 8, sizeof(struct dailyWeather) );
void task1(void *pvParameters) {
for (;;) {
dailyWeather weatherDATA[8];
// Code to fill the array of structs with data
for (uint8_t i = 0; i < 8; i++) {
weatherDATA[i].itemOrder = i;
if (xQueueSend( weatherQueue, &weatherDATA[i], ( TickType_t ) 0 ) == pdTRUE) {
// The message was sent sucessfully
}
}
}
}
file2.cpp
#include "common.h"
#include "file1.h"
void task2(void *pvParameters) {
dailyWeather weatherDATA_D[8];
for (;;) {
dailyWeather weatherDATA;
if( xQueueReceive(weatherQueue, &( weatherDATA ), ( TickType_t ) 0 ) ) {
Serial.println("Received");
weatherDATA_D[weatherDATA.itemOrder] = weatherDATA;
}
}
}
Best regards.

Why does the ZeroMQ ROUTER-DEALER pattern have high latency?

Using libzmq 4.2.5 on centos 7. Getting very high latency when messages are sent from DEALER to ROUTER and even from ROUTER to DEALER. So I wrote a simple client-server program using tcp and sent messages between them just for comparison. Tcp appears to be fast.
Sending single byte from DEALER to ROUTER, zmq takes 900 microseconds.
Sending single byte from client to server, tcp takes 150 microseconds.
What am I doing wrong. I thought zmq will be at least as fast as tcp. Is there any tuning I can do to make zmq faster?
Update
router.cpp
#include <zmq.hpp>
struct data
{
char one[21];
unsigned long two;
};
data * pdata;
std::size_t counter=0;
int main()
{
zmq::context_t context(1);
zmq::socket_t Device(context,ZMQ_ROUTER);
int iHighWaterMark=0;
Device.setsockopt(ZMQ_SNDHWM,&iHighWaterMark,sizeof(int));
Device.setsockopt(ZMQ_RCVHWM,&iHighWaterMark,sizeof(int));
Device.bind("tcp://0.0.0.0:5555");
pdata=new data[10000];
struct timespec ts_dtime;
unsigned long sec;
zmq::message_t message;
zmq::pollitem_t arrPollItems[]={{Device, 0, ZMQ_POLLIN, 0},{NULL,
0, ZMQ_POLLIN, 0}};
while(counter < 10000)
{
try
{
int iAssert = zmq::poll(arrPollItems, 1, -1);
if (iAssert <= 0)
{
if (-1 == iAssert)
{
printf("zmq_poll failed errno: %d error:%s", errno,
zmq_strerror(errno));
}
continue;
}
if (arrPollItems[0].revents == ZMQ_POLLIN)
{
while(true)
{
if(! Device.recv(&message,ZMQ_DONTWAIT))
break;
Device.recv(&message);
strncpy(pdata[counter].one,
(char*)message.data(),message.size());
clock_gettime(CLOCK_REALTIME, &ts_dtime);
pdata[counter].two = (ts_dtime.tv_sec*1e9)+
ts_dtime.tv_nsec;
++counter;
}
}
}
catch(...)
{
}
}
for(int i=0;i<counter;++i)
printf("%d %s %lu\n",i+1,pdata[i].one,pdata[i].two);
return 0;
}
dealer.cpp
#include <zmq.hpp>
#include<unistd.h>
int main()
{
zmq::context_t context(1);
zmq::socket_t Device(context,ZMQ_DEALER);
int iHighWaterMark=0;
Device.setsockopt(ZMQ_SNDHWM,&iHighWaterMark,sizeof(int));
Device.setsockopt(ZMQ_RCVHWM,&iHighWaterMark,sizeof(int));
Device.setsockopt(ZMQ_IDENTITY,"TEST",4);
Device.connect("tcp://0.0.0.0:5555");
usleep(100000);
struct timespec ts_dtime;
unsigned long sec;
for(std::size_t i=0;i<10000;++i)
{
clock_gettime(CLOCK_REALTIME, &ts_dtime);
sec=(ts_dtime.tv_sec*1e9)+ ts_dtime.tv_nsec;
zmq::message_t message(21);
sprintf((char *)message.data(),"%lu",sec);
Device.send(message);
usleep(500);
}
return 0;
}
update 2:
router.cpp
#include <zmq.hpp>
#include <stdio.h>
#include <stdlib.h>
int main (int argc, char *argv[])
{
const char *bind_to;
int roundtrip_count;
size_t message_size;
int rc;
int i;
if (argc != 4) {
printf ("usage: local_lat <bind-to> <message-size> "
"<roundtrip-count>\n");
return 1;
}
bind_to = argv[1];
message_size = atoi (argv[2]);
roundtrip_count = atoi (argv[3]);
zmq::context_t ctx(1);
zmq::socket_t s(ctx,ZMQ_ROUTER);
zmq::message_t msg,id;
int iHighWaterMark=0;
s.setsockopt(ZMQ_SNDHWM , &iHighWaterMark,
sizeof (int));
s.setsockopt(ZMQ_RCVHWM , &iHighWaterMark,
sizeof (int));
s.bind( bind_to);
struct timespec ts_dtime;
unsigned long sec;
for (i = 0; i != roundtrip_count; i++) {
rc =s.recv(&id);
if (rc < 0) {
printf ("error in zmq_recvmsg: %s\n", zmq_strerror (errno));
return -1;
}
rc = s.recv(&msg, 0);
if (rc < 0) {
printf ("error in zmq_recvmsg: %s\n", zmq_strerror (errno));
return -1;
}
clock_gettime(CLOCK_REALTIME, &ts_dtime);
sec=((ts_dtime.tv_sec*1e9)+ ts_dtime.tv_nsec);
printf("%.*s %lu\n",20,(char *)msg.data(),sec);
}
s.close();
return 0;
}
dealer.cpp
#include <zmq.hpp>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
int main (int argc, char *argv[])
{
const char *connect_to;
int roundtrip_count;
size_t message_size;
int rc;
int i;
void *watch;
unsigned long elapsed;
double latency;
if (argc != 4) {
printf ("usage: remote_lat <connect-to> <message-size> "
"<roundtrip-count>\n");
return 1;
}
connect_to = argv[1];
message_size = atoi (argv[2]);
roundtrip_count = atoi (argv[3]);
zmq::context_t ctx(1);
zmq::socket_t s(ctx,ZMQ_DEALER);
struct timespec ts_dtime;
unsigned long sec;
int iHighWaterMark=0;
s.setsockopt(ZMQ_SNDHWM , &iHighWaterMark,
sizeof (int));
s.setsockopt(ZMQ_RCVHWM , &iHighWaterMark,
sizeof (int));
s.connect(connect_to);
for (i = 0; i != roundtrip_count; i++) {
zmq::message_t msg(message_size+20);
clock_gettime(CLOCK_REALTIME, &ts_dtime);
sec=(ts_dtime.tv_sec*1e9)+ ts_dtime.tv_nsec;
sprintf((char *)msg.data(),"%lu",sec);
rc = s.send(msg);
if (rc < 0) {
printf ("error in zmq_sendmsg: %s\n", zmq_strerror (errno));
return -1;
}
sleep(1);
}
s.close();
return 0;
}
output :
1562125527489432576 1562125527489773568
1562125528489582848 1562125528489961472
1562125529489740032 1562125529490124032
1562125530489944832 1562125530490288896
1562125531490101760 1562125531490439424
1562125532490261248 1562125532490631680
1562125533490422272 1562125533490798080
1562125534490555648 1562125534490980096
1562125535490745856 1562125535491161856
1562125536490894848 1562125536491245824
1562125537491039232 1562125537491416320
1562125538491229184 1562125538491601152
1562125539491375872 1562125539491764736
1562125540491517184 1562125540491908352
1562125541491657984 1562125541492027392
1562125542491816704 1562125542492193536
1562125543491963136 1562125543492338944
1562125544492103680 1562125544492564992
1562125545492248832 1562125545492675328
1562125546492397312 1562125546492783616
1562125547492543744 1562125547492926720
1562125564495211008 1562125564495629824
1562125565495372032 1562125565495783168
1562125566495515904 1562125566495924224
1562125567495660800 1562125567496006144
1562125568495806464 1562125568496160000
1562125569495896064 1562125569496235520
1562125570496080128 1562125570496547584
1562125571496235008 1562125571496666624
1562125572496391424 1562125572496803584
1562125573496532224 1562125573496935680
1562125574496652800 1562125574497053952
1562125575496843776 1562125575497277184
1562125576496997120 1562125576497417216
1562125577497182208 1562125577497726976
1562125578497336832 1562125578497726464
1562125579497549312 1562125579497928704
1562125580497696512 1562125580498115328
1562125581497847808 1562125581498198528
1562125582497998336 1562125582498340096
1562125583498140160 1562125583498622464
1562125584498295296 1562125584498680832
1562125585498445312 1562125585498842624
1562125586498627328 1562125586499025920
All are in the range for 350-450us
Q1: What am I doing wrong?I thought zmq will be at least as fast as tcp.
Code-wise, nothing.
Performance-wise, the ZeroMQ is fantastic plus has so many features that tcp does not and will not provide right out of the box:
Test-setup "Sending single byte..." seems to step right into the left edge of the high-performance / low-latency messaging service:
Lets first understand the Latency and where did it come from:
The observed resulting latency figures are product of the overall sum of the resources-usage ( resources allocations + resources pools management operations + data manipulations ) and processing-efforts ( all we try to do with the data, here including times, that our task had to spend in a waiting queue, due to the system-scheduler planned multi-tasking workunits scheduling, that are not from our testing workload, but the operating system has to schedule and execute, according to the fair-scheduling-policy and actual process-priority settings ) and communications channels transport-delays ( comms E2E transport latency )
Lets next understand what do we try to compare with:
A difference between a Transmission Control Protocol ( raw tcp ) and a ZeroMQ zmq framework of smart Scalable Formal Communication Archetypes with a rich set of high-level, distributed behaviours, is about a few galaxies big.
ZeroMQ was designed as rather a Signalling and Messaging infrastructure using some of these feature-rich set of behaviours that match together - often depicted by some human-alike behaviour-archetype:
One PUSH-es, any number of joined counterparties PULL
One REQ-ests, someone from a group on the other end of the phone REP-lies
One, even potentially a one from some larger group of agents, PUB-lishes, any amount of already subscribed subscribers receive such a SUB-scribed message.
For details, one may kindly read a brief overview about the main conceptual differences in [ ZeroMQ hierarchy in less than a five seconds ] Section.
This is nothing a TCP-protocol will ever provide on its own.
This is a comfort one likes to pay for by some negligible amount of latency. Negligible? Yes, negligible once compared to the many man*years of ultimate software craftsmanship anyone would have to pay for designing another at least similarly smart messaging framework to compete with ZeroMQ.
Q2: Is there any tuning I can do to make zmq faster?
Maybe yes, maybe not.
Update:- try avoiding Identity management ( tcp has no such thing either, so measured RTT-s are the lesser comparable or meaningful )
- try avoiding the blocking manner of the HWM-configurations ( tcp has no such thing either )
- may try to measure the same over a non-tcp-protocol ( a PAIR/PAIR Formal Scalable Communication Archetype, best over the least complex protocol data-pumps as inproc:// is or ipc:// in case your SandBox test bed needs to still keep distributed non-local copies etc. ) ZeroMQ context-instance's internal overheads spent on the .send() resp. .receive() methods
- may try to allow for slight increase in performance by using more threads available for the Context instance
( other performance demasking tricks depend on the nature of real-world usage - as a robustness to dropped messages, feasibility to use a conflated mode of operations, better buffer-alignment with O/S, zero-copy tricks - all being of some interest here, yet have to let and keep the smart ZeroMQ infrastructure of distributed behaviours operational, which is by far more complex task to execute, than a trivial serial sequence of otherwise blind and isolated tcp-socket byte-level operations - so, comparing times is possible, but comparing individual draconic dragster-class car ( well, better a vehicle, not even a car ) with something like a globally operating infrastructure of distributed behaviour ( like Taxify or Uber, named here just to make a use of a trivialised (dis-)similarity of approximately same scale of magnitude ) leaves the numbers reporting about phenomena, that do not provide the similar comfort, scalability of use-cases, almost linear performance scaling and robustness of the real-world use )
- may add more scheduling determinism with making the Context-instance's respective IoTHREADs-hard-wired onto respective CPU-core(s), so that the overall I/O-performance never gets evicted from CPU-schedule and remains deterministically mapped / pre-locked on even exclusively administratively dedicated CPU-core(s) - depends on a level of need and administrative policies if trying to do this ultimate performance hack
For any performance related tweaking, one will need to post an MCVE + a fully described benchmark test suite. The ZeroMQ Latency test results report shows:
Conclusion
In a controlled environment RDTSC instruction can be used to measure
time rapidly. This allows us to measure latency/density for individual
messages instead of computing averages for the whole test.
We've used this approach to get performance figures of ØMQ lightweight
messaging kernel (version 0.1) and we've got following results:
-- In low-volume case the latency is almost the same as the latency of the underlying transport (TCP): 50 microseconds.
-- The average jitter of latency is minimal: 0.225 microsecond.
-- The throughput on sender side is 4.8 million messages a second.
-- The density on sender side is mostly about 0.140 microsecond, however, with occasional peaks the mean density is 0.208 microsecond.
-- The throughput on receiver side is 2.7 million messages a second.
-- The density on receiver side is mostly about 0.3 microsecond. Approximately each 100 messages new batch is received causing density
to grow to 3-6 microseconds. The mean density is 0.367 microsecond.
If in an ultimate need for latency shaving, one may try nanomsg, the ZeroMQ's younger sister originated by Martin SUSTRIK, the co-father of ZeroMQ ( now maintained afaik by someone else )

MQL4 Function pointer / function callback solution

As far as i have seen function pointers do not exist in MQL4.
As a workaround i use:
// included for both caller as callee side
class Callback{
public: virtual void callback(){ return; }
}
Then in the source where a callback is passed from:
class mycb : Callback{
public: virtual void callback(){
// call to whatever function needs to be called back in this source
}mcbi;
now mcbi can be passed as follows:
afunction(){
fie_to_receive_callback((Callback *)mycbi);
}
and the receiver can callback as:
fie_to_receive_callback(mycb *mcbi){
mcbi.callback(); // call the callback function
}
is there a simpler way to pass a function callback in mql4 ?
Actually there is a way, using function pointers in MQL4.
Here is an example:
typedef int(*MyFuncType)(int,int);
int addition (int a, int b)
{ return (a+b); }
int subtraction (int a, int b)
{ return (a-b); }
int operation (int x, int y, MyFuncType myfunc)
{
int g;
g = myfunc(x,y);
return (g);
}
int OnInit()
{
int m,n;
m = operation (7, 5, addition);
n = operation (20, m, subtraction);
Print(n);
return(INIT_FAILED); //just to close the expert
}
No. Fortunately there is not. ( . . . . . . . however MQL4 language syntax creeps * )
MQL4 Runtime Execution Engine ( MT4 ) has rather fragile process/thread handling and adding more ( and smarter ) constructs ( beyond rudimentary { OnTimer() | OnTick() | OnCalculate() } event-bound callbacks ) constitutes rather a threat to the already unguaranteed RealTime Execution of the main MT4-duties. While "New"-MQL4.56789 may provide hacks into doing so, there might be safer rather an off-loading strategy to go distributed and let MT4-legacy handlers receive "pre-baked" results from external processing Cluster, rather than trying to hang more and more and more flittering gadgets on a-years-old-poor-Xmas-tree.
To realise how brute this danger-avoidance is, just notice that original OnTimer() used 1 second resolution ( yes 1.000.000.000 ns steps in the world, where stream-providers label events in nano-seconds ... )
* ): Yes, since "new"-MQL4 introduction, there were many stealth-mode changes in the original MQL4-language. After each update it is more than recommendable to review "new"-Help file, as there might be both new options & nasty surprises. Maintaining an MQL4 Code-Base with more than a few hundreds man*years, this is indeed a very devastating experience.

What's the correct method for CoreAudio realtime thread to communicate with UI thread?

I need to pass data between CoreAudio's realtime thread and the UI thread (one way, RT->UI). I know I can't use any Cocoa/Objective C methods like performSelectorOnMainThread or NSNotification and I can't use anything that will allocate memory as this will potentially block the RT thread.
What is the correct method for communicating between threads? Can I use GCD message queues or is there a more basic system to use?
Edit:
Thinking about this a bit more, I suppose I could use a lock free ring buffer, which the RT thread puts a message into, and the UI thread checks for messages to pull out. Is this the best way and if so is there a system already to do this in CoreAudio or available elsewhere or do I need to code it up myself?
It turns out this was a lot simpler than I expected and the solution I came up with was just to use the Portaudio ring buffer. I needed to add pa_ringbuffer.[ch] and pa_memorybarrier.h to my project and then define a MessageData structure to store in the ring buffer.
typedef struct MessageData {
MessageType type;
union {
struct {
NSUInteger position;
} position;
} data;
} MessageData;
Then I allocated some space to store 32 messages and created the ring buffer.
_playbackData->RTToMainBuffer = malloc(sizeof(MessageData) * 32);
PaUtil_InitializeRingBuffer(&_playbackData->RTToMainRB, sizeof(MessageData),
32, _playbackData->RTToMainBuffer);
Finally I started an NSTimer for every 20ms to pull data from the ring buffer
while (PaUtil_GetRingBufferReadAvailable(&_playbackData->RTToMainRB)) {
MessageData *dataPtr1, *dataPtr2;
ring_buffer_size_t sizePtr1, sizePtr2;
// Should we read more than one at a time?
if (PaUtil_GetRingBufferReadRegions(&_playbackData->RTToMainRB, 1,
(void *)&dataPtr1, &sizePtr1,
(void *)&dataPtr2, &sizePtr2) != 1) {
continue;
}
// Parse message
switch (dataPtr1->type) {
case MessageTypeEOS:
break;
case MessageTypePosition:
break;
default:
break;
}
PaUtil_AdvanceRingBufferReadIndex(&_playbackData->RTToMainRB, 1);
}
Then in the realtime thread, pushing a message to the ringbuffer was simply
MessageData *dataPtr1, *dataPtr2;
ring_buffer_size_t sizePtr1, sizePtr2;
if (PaUtil_GetRingBufferWriteRegions(&data->RTToMainRB, 1,
(void *)&dataPtr1, &sizePtr1,
(void *)&dataPtr2, &sizePtr2)) {
dataPtr1->type = MessageTypePosition;
dataPtr1->data.position.position = currentPosition;
PaUtil_AdvanceRingBufferWriteIndex(&data->RTToMainRB, 1);
}
A ringbuffer is a good solution. Two if you need to communicate both ways ie. inbox/outbox message passing.
This is a good implementation for iOS/Mac if you don't want to use Portaudio.
https://github.com/michaeltyson/TPCircularBuffer

How to put my structure variable into CPU caches to eliminate main memory page access time? Options

It's clear that there is no explicit way or certain system calls that
help programmers to put a variable into the CPU cache.
But I think that a certain programming style or well designed
algorithm can make it possible to increase the possibilities that the
variable can be cached into the CPU caches.
Here is my example:
I want to append an 8 byte structure at the end of an array consisting
of the same type of structures, declared in the global main memory
region.
This process is continuously repeated for 4 million operations. This process takes 6 seconds, 1.5 us for each operation. I think this result tells that the two memory areas have not been cached.
I got some clues from a cache-oblivious algorithm, so I tried several
ways to enhance this. Until now, no enhancement.
I think some clever codes can reduce the elapsed time, up to 10 to 100
times. Please show me the way.
-------------------------------------------------------------------------
Appended (2011-04-01)
Damon~ thank you for your comment!
After reading your comment, I analyzed my code again, and found several things
that I missed. The following code that I attached is the abbreviated version of my original code.
To accurately measure each operation's execution time (in the original code, there are several different types of operations), I inserted the time measuring code using clock_gettime() function. I thought if I measure each operation's execution time and accumulate them, the additional cost by the main loop can be avoided.
In the original code, the time measuring code was hidden by a macro function, so I totally forgot about it.
The running time of this code is almost 6 seconds. But if I get rid of the time measuring function in the main loop, it becomes 0.1 seconds.
Since the clock_gettime() function supports very high precision (upto 1 nano second), executed on the basis of an independent thread, and also it requires very big structure,
I think the function caused the cache-out of the main memory area where the consecutive insertions are performed.
Thank you again for your comment. For further enhancement, any suggestion will be very helpful for me to optimize my code.
I think the hierachically defined structure variable might cause unnecessary time cost,
but first I want to know how much it would be, before I change it to the more C-style code.
typedef struct t_ptr {
uint32 isleaf :1, isNextLeaf :1, ptr :30;
t_ptr(void) {
isleaf = false;
isNextLeaf = false;
ptr = NIL;
}
} PTR;
typedef struct t_key {
uint32 op :1, key :31;
t_key(void) {
op = OP_INS;
key = 0;
}
} KEY;
typedef struct t_key_pair {
KEY key;
PTR ptr;
t_key_pair() {
}
t_key_pair(KEY k, PTR p) {
key = k;
ptr = p;
}
} KeyPair;
typedef struct t_op {
KeyPair keyPair;
uint seq;
t_op() {
seq = 0;
}
} OP;
#define MAX_OP_LEN 4000000
typedef struct t_opq {
OP ops[MAX_OP_LEN];
int freeOffset;
int globalSeq;
bool queueOp(register KeyPair keyPair);
} OpQueue;
bool OpQueue::queueOp(register KeyPair keyPair) {
bool isFull = false;
if (freeOffset == (int) (MAX_OP_LEN - 1)) {
isFull = true;
}
ops[freeOffset].keyPair = keyPair;
ops[freeOffset].seq = globalSeq++;
freeOffset++;
}
OpQueue opQueue;
#include <sys/time.h>
int main() {
struct timespec startTime, endTime, totalTime;
for(int i = 0; i < 4000000; i++) {
clock_gettime(CLOCK_REALTIME, &startTime);
opQueue.queueOp(KeyPair());
clock_gettime(CLOCK_REALTIME, &endTime);
totalTime.tv_sec += (endTime.tv_sec - startTime.tv_sec);
totalTime.tv_nsec += (endTime.tv_nsec - startTime.tv_nsec);
}
printf("\n elapsed time: %ld", totalTime.tv_sec * 1000000LL + totalTime.tv_nsec / 1000L);
}
YOU don't put the structure into any cache. The CPU does that automatically for you. The CPU is even more clever than that; if you access sequential memory, it will start putting things from memory into the cache before you read them.
And really, it should be common sense that for a simple bit of code like this, the time you spend on measuring is ten times more than the time to perform the code (apparently 60 times in your case).
Since you put so much confidence in clock_gettime (): I suggest you call it five times in a row and store the results, then print the differences. There's resolution, there's precision, and there's how long it takes to return the current time, which is pretty damned long.
I have been unable to force caching, but you can force memory to be uncache-able. If you have large other datastructures you might exclude these so that they will not pollute your caches. This can be done by specifying PAGE_NOCACHE for the Windows VirutalAllocXXX functions.
http://msdn.microsoft.com/en-us/library/windows/desktop/aa366786(v=vs.85).aspx

Resources