fastest way to pass data inside process using zeromq - zeromq

Writing my first zeromq application so sorry about simple question.
In my financial software I receive quotes from stock exchange. Each update looks like that:
struct OrderUpdate {
uint32_t instrumentId;
uint32_t MDEntryID;
uint32_t MDUpdateAction; // 0 - New 1 - Change 2 -Delete
double /*decimal*/ MDEntryPx;
double /*decimal*/ MDEntrySize;
uint32_t RptSeq;
uint32_t MDEntryTime;
uint32_t OrigTime;
char MDEntryType;
I do not allocate this structures at runtime, instead i pre-allocate and then reuse (reconfigure) them.
I need to pass this structures from c++ to c# (later c++ to c++ cause will move to Linux).
What zeromq techniques should I use? As far as I understand I should:
use PUB-SUB because i have one reader and one writer
use inproc as a fasters transport (I understand limitations and OK to have them)
use zero-copy to deliver my OrderUpdate structure to zeromq publisher buffer
Am I correct about that and probaly you can suggest more?


CGO: converting Go byte array into C char* and back, issue with null terminator in byte array

Brief Context:
I am trying to create a cgo cache from uint64->[]byte for my company. Golang's map[uint64][]byte incur considerable latency due to the garbage collector as []byte is considered a pointer. As such, I would like to try to use CGO to avoid the issue. I am currently implementing a C++ unordered_map<unsigned long long, char*> to handle that. I managed to get the C++ wrappers to work but I am facing a major issue.
Currently, I am converting my Go byte array using
b := []byte{1,0,3,32,2,2,2,2}
str := C.String(b)
bb := []byte(C.GoString(str))
However, it turns out my my bb is []byte{1}. The 0 in the byte array is seen as the '/0' and thus shorten the string. Furthermore, it seems to have cause out of memory issue when I delete entries with
delete (map->find(key))->second. I suspect this is because that chars after the first '/0' does not get deallocated.
I am not sure how else to do this. Personally, I am new to CGO so I never used it prior to this project so any help would be appreciated.
There are two problems:
Use C.CBytes, not C.CString.
Please read this in its entirety before using cgo.
The C.C* functions which allocate (C.CString and C.CBytes do that) internally call malloc() from the linked in libc library to get the destination memory block, and you're supposed to eventually call C.Free() on them (which calls libc's free()), as documented.
AFAIK, C++ compiler is by no means oblidged to use libc's malloc() and free() to implement new and delete, so calling delete on the results of C.C* functions is a sure path to disaster.
The simplest way to solve this, IMO, is to export a "constructor" function from your C++ side: something like
extern "C" {
char* clone(char *src, size_t len);
…which would 1) allocate a memory block of length len using whatever method works best for C++; 2) copy len bytes from src to it; 3) return it.
You could then call it from the Go side—as C.clone(&b[0], len(b)) and call it a day: the C++ side is free to call delete on the result.

Proper way to manipulate registers (PUT32 vs GPIO->ODR)

I'm learning how to use microcontrollers without a bunch of abstractions. I've read somewhere that it's better to use PUT32() and GET32() instead of volatile pointers and stuff. Why is that?
With a basic pin wiggle "benchmark," the performance of GPIO->ODR=0xFFFFFFFF seems to be about four times faster than PUT32(GPIO_ODR, 0xFFFFFFFF), as shown by the scope:
(The one with lower frequency is PUT32)
This is my code using PUT32
PUT32(0x40021034, 0x00000002); // RCC IOPENR B
PUT32(0x50000400, 0x00555555); // PB MODER
while (1) {
PUT32(0x50000414, 0x0000FFFF); // PB ODR
PUT32(0x50000414, 0x00000000);
This is my code using the arrow thing
* (volatile uint32_t *) 0x40021034 = 0x00000002; // RCC IOPENR B
GPIOB->MODER = 0x00555555; // PB MODER
while (1) {
GPIOB->ODR = 0x00000000; // PB ODR
GPIOB->ODR = 0x0000FFFF;
I shamelessly adapted the assembly for PUT32 from somewhere
STR R1,[R0]
My questions are:
Why is one method slower when it looks like they're doing the same thing?
What's the proper or best way to interact with GPIO? (Or rather what are the pros and cons of different methods?)
Additional information:
Chip is STM32G031G8Ux, using Keil uVision IDE.
I didn't configure the clock to go as fast as it can, but it should be consistent for the two tests.
Here's my hardware setup: (Scope probe connected to the LEDs. The extra wires should have no effect here)
Thank you for your time, sorry for any misunderstandings
PUT32 is a totally non-standard method that the poster in that other question made up. They have done this to avoid the complication and possible mistakes in defining the register access methods.
When you use the standard CMSIS header files and assign to the registers in the standard way, then all the complication has already been taken care of for you by someone who has specific knowledge of the target that you are using. They have designed it in a way that makes it hard for you to make the mistakes that the PUT32 is trying to avoid, and in a way that makes the final syntax look cleaner.
The reason that writing to the registers directly is quicker is because writing to a register can take as little as a single cycle of the processor clock, whereas calling a function and then writing to the register and then returning takes four times longer in the context of your experiment.
By using this generic access method you also risk introducing bugs that are not possible if you used the manufacturer provided header files: for example using a 32 bit access when the register is 16 or 8 bits.

How to send multipart messages using libnl and generic netlink?

I'm trying to send a relatively big string (6Kb) through libnl and generic netlink, however, I'm receiving the error -5 (NL_ENOMEM) from the function nla_put_string in this process. I've made a lot of research but I didn't find any information about these two questions:
What's the maximum string size supported by generic netlink and libnl nla_put_string function?
How to use the multipart mechanism of generic netlink to broke this string in smaller parts to send and reassemble it on the Kernel side?
If there is a place to study such subject I appreciate that.
How to use the multipart mechanism of generic netlink to broke this string in smaller parts to send and reassemble it on the Kernel side?
Netlink's Multipart feature "might" help you transmit an already fragmented string, but it won't help you with the actual string fragmentation operation. That's your job. Multipart is a means to transmit several small correlated objects through several packets, not one big object. In general, Netlink as a whole is designed with the assumption that any atomic piece of data you want to send will fit in a single packet. I would agree with the notion that 6Kbs worth of string is a bit of an oddball.
In actuality, Multipart is a rather ill-defined gimmic in my opinion. The problem is that the kernel doesn't actually handle it in any generic capacity; if you look at all the NLMSG_DONE usage instances, you will notice not only that it is very rarely read (most of them are writes), but also, it's not the Netlink code but rather some specific protocol doing it for some static (ie. private) operation. In other words, the semantics of NLMSG_DONE are given by you, not by the kernel. Linux will not save you any work if you choose to use it.
On the other hand, libnl-genl-3 does appear to perform some automatic juggling with the Multipart flags (NLMSG_DONE and NLM_F_MULTI), but that only applies when you're sending something from Kernelspace to Userspace, and on top of that, even the library itself admits that it doesn't really work.
Also, NLMSG_DONE is supposed to be placed in the "type" Netlink header field, not in the "flags" field. This is baffling to me, because Generic Netlink stores the family identifier in type, so it doesn't look like there's a way to simultaneously tell Netlink that the message belongs to you, AND that it's supposed to end some data stream. Unless I'm missing something important, Multipart and Generic Netlink are incompatible with each other.
I would therefore recommend implementing your own message control if necessary and forget about Multipart.
What's the maximum string size supported by generic netlink and libnl nla_put_string function?
It's not a constant. nlmsg_alloc() reserves
getpagesize() bytes per packet by default. You can tweak this default with nlmsg_set_default_size(), or more to the point you can override it with nlmsg_alloc_size().
Then you'd have to query the actual allocated size (because it's not guaranteed to be what you requested) and build from there. To get the available payload you'd have to subtract the Netlink header length, the Generic Header length and the Attribute Header lengths for any attributes you want to add. Also the user header length, if you have one. You would also have to align all these components because their sizeof is not necessarily their actual size (example).
All that said, the kernel will still reject packets which exceed the page size, so even if you specify a custom size you will still need to fragment your string.
So really, just forget all of the above. Just fragment the string to something like getpagesize() / 2 or whatever, and send it in separate chunks.
This is the general idea:
static void do_request(struct nl_sock *sk, int fam, char const *string)
struct nl_msg *msg;
msg = nlmsg_alloc();
genlmsg_put(msg, NL_AUTO_PORT, NL_AUTO_SEQ, fam,
0, 0, DOC_EXMPL_C_ECHO, 1);
nla_put_string(msg, DOC_EXMPL_A_MSG, string);
nl_send_auto(sk, msg);
int main(int argc, char **argv)
struct nl_sock *sk;
int fam;
sk = nl_socket_alloc();
fam = genl_ctrl_resolve(sk, FAMILY_NAME);
do_request(sk, fam, "I'm sending a string.");
do_request(sk, fam, "Let's pretend I'm biiiiiig.");
do_request(sk, fam, "Look at me, I'm so big.");
do_request(sk, fam, "But I'm already fragmented, so it's ok.");
return 0;
I left a full sandbox in my Dropbox. See the README. (Tested in kernel 5.4.0-37-generic.)

Using Thrust Functions with raw pointers: Controlling the allocation of memory

I have a question regarding the thrust library when using CUDA.
I am using a thrust function, i.e. exclusive_scan, and I want to use raw pointers. I am using raw (device) pointers because I want to have full control of when the memory is allocated and deallocated.
After the function call, I will hand over the pointer to another data structure and then free the memory in either the destructor of this data structure, or in the next function call, when I recompute my (device) pointers. I came across for example this problem here now, which recommends to wrap the data structure in a device_vector. But then I run into the problem that the memory is freed once my device_vector goes out of scope, which I do not want. Having the device pointer globally is also not an option, since I am hacking code, i.e. it is used as a buffer and I would have to rewrite a lot if I wanted to do something like that.
Does anyone have a good workaround regarding this? The only chance I do see right now is to rewrite the thrust-function on my own, only using raw device-pointers.
EDIT: I misread, I can wrap it in a device_ptr instead of a device_vector.
Asking further though, how could I solve this if there wasn't the option of using a device_ptr?
There is no problem using plain pointers in thrust methods.
For data on the device do:
struct DoSomething {
__device__ int operator()(int item) { return 1; }
int* IntData;
cudaMalloc(&IntData, sizeof(int) * count);
auto dev_data = device_pointer_cast(IntData);
thrust::generate(dev_data, dev_data + count, DoSomething());
thrust::sort(dev_data, dev_data + count);
For data on the host use plain malloc/free and raw_pointer_cast instead of device_pointer_cast.
See: thrust: Memory management

Apple SSL Secure Transport

I have just started to work with os x and have no experience with it at all. But all I want to do now is to replace old OpenSSL code with Apple Security API. I'm using Secure Transport and I'm a bit confused about with these functions: SSLSetIOFuncs, SSLWrite, and SSLRead.
So SSLSetIOFuncs sets callbacks that perform writing/reading operations (which I should implement). And a lot of questions appear at this point:
First, I just don't understand why I should implement it (in OpenSSL it is implemented already). But ok, I just have to.
Should this implementation be encrypted? I guess no.
Also there are following 2 functions:
SSLWrite (SSLContextRef context,
const void * __nullable data,
size_t dataLength,
size_t *processed);
SSLRead (SSLContextRef context,
void * data,
size_t dataLength,
size_t *processed);
And they are "Normal application-level read/write." according to code comments. So why do I need to define those 2 callbacks for reading and writing then? And if first twos are callbacks which functions I should call for reading/writing in my code (when I really need to read some data from server)?
There are no good documentation and I got stuck with it all. May be I'm just way too dump but little help would be just perfect anyway. Please help!
SecureTransport is callback-based, unlike OpenSSL's SSL_read() and SSL_write() functions. This could require big changes to your code. If you want a read/write-style API that can use SecureTransport for encryption, look at CFNetwork and specifically CFStream
