nanopb oneof size requirements - protocol-buffers

I came across nanopb and want to use it in my project. I'm writing code for an embedded device so memory limitations are a real concern.
My goal is to transfer data items from device to device, each data item has an 32bit identifier and an value. The value can be anything from 1 char to float to long string. I'm wondering what would be the most efficient way of declaring the messages for this type of problem.
I was thinking something like this:
message data_msg{
message data_item{
int32 id = 1;
oneof value{
int8 ival = 2;
float fval = 3;
string sval = 4;
}
}
repeated data_item;
}
But as I understood this is converted in to C union, which is the size of the largest element. Say I limit the string to 50 chars, then the union is always 50 bytes long, even if I would need 4 bytes for a float.
Have I understood this correctly or is there some other way to accomplish this?
Thanks!

Your understanding is correct, the structure size in C will equal the size of the largest member of the oneof. However this is only the size in memory, the message size after serialization will be the minimum needed for the contents.
If the size in memory is an issue, there are several options available. The default of allocating maximum possibly needed size makes memory management easy. If you want to dynamically allocate only the needed amount of memory, you'll have to decide how you want to do it:
Using PB_ENABLE_MALLOC compilation option, you can use the FT_POINTER field type for the string and other large fields. The memory will then be allocated using malloc() from the system heap.
With FT_CALLBACK, instead of allocating any memory, you'll get a callback where you can read out the string and process or store it any way you want. For example, if you wanted to write the string to SD card, you could do so without ever storing it completely in memory.
In overall system design, static allocation for maximum size required is often the easiest to test for. If the data fits once, it will always fit. If you go for dynamic allocation, you'll need to analyze more carefully what is the maximum possible memory usage needed.

Related

Is there any performance difference between Buffer, StructuredBuffer and ByteAddressBuffer (also their RW variants)?

I tried looking this up on various websites, including MS Docs on DirectX 11 Compute Shader types; but I haven't found anything mentioning performance differences of these buffer types.
Are they exactly the same performance-wise ?
If no, what is the most optimum way of using each in various scenarios ?
Performance will eventually differ from GPU/Driver combination.
There is a project here that does benchmark access for those (the linear/random cases are the most useful).
Constant access is also useful if you want to compare cbuffer access versus other buffer access (on NVidia it is common to perform a buffer to cbuffer gpu copy before to go on an expensive shader for example).
https://github.com/sebbbi/perftest
Note that also different buffers (in d3d11 land) have different limitations.
So the performance benefit can be hindered by those.
Structured buffers cannot be bound as vertex/index buffers. So if you want to use them you need to perform an extra copy. (For vertex buffers you can just fetch from vertex id, there is no penalty of this, index buffers can be read but are a bit more problematic).
Byte address allow to store anything in a non structured way (just a basic pointer somehow). Reads are still aligned to 4 bytes (int size). Converting to float (reads) need a asfloat, from float (writes) need a asuint, but in driver cases this is generally a nop, so there is no performance impact.
Byte address (and typed buffers) can be used as index buffer or vertex buffers. No copy necessary.
Typed buffers do not support Interlocked operations too well, in this case you need to use a Structured/ByteAddress buffer (note that you can use interlocked on a small buffer and perform the read/writes on a typed buffer if you want).
Byte address can be more annoying to use if you have an array of elements of the same type (even a float4x4 is a decent amount of code to fetch versus a StructuredBuffer < float4x4 >
Structured buffers allow you to bind "Partial views". So even if your buffers has let's say 2048 floats, you can bind a range from 4-456 (it also allows you to bind 500-600 as write at the same time since they are not overlapping).
For all buffers, if you use them as readonly, don't bind them as RW, this generally has a decent penalty.
To add to the accepted answer,
There is also a performance penalty for elements in the StructuredBuffer not being aligned to a 128 bit stride [sizeof float4]. If not there is the possability that a single float4 for example could span across cache lines causing up to a 5% perf penalty.
An example of how to solve this is to use padding to re-align elements:
struct Foo
{
float4 Position;
float Radius;
float pad0;
float pad1;
float pad2;
float4 Rotation;
};
NVIDIA post with more detail

How do I correctly adjust the Argon2 parameters in Go to consume less memory?

Argon2 by design is memory hungry. In the semi-official Go implementation, the following parameters are recommended when using IDKey:
key := argon2.IDKey([]byte("some password"), salt, 1, 64*1024, 4, 32)
where 1 is the time parameter and 64*1024 is the memory parameter. This means the library will create a 64MB buffer when hashing a value. In scenarios where many hashing procedures might run at the same time this creates high pressure on the host memory.
In cases where this is too much memory consumption it is advised to decrease the memory parameter and increase the time factor:
The draft RFC recommends[2] time=1, and memory=64*1024 is a sensible number. If using that amount of memory (64 MB) is not possible in some contexts then the time parameter can be increased to compensate.
So, assuming I would like to limit memory consumption to 16MB (1/4 of the recommended 64MB), it is still unclear to me how I should be adjusting the time parameter: is this supposed to be times 4 so that the product of memory and time stays the same? Or is there some other logic behind the correlation of time and memory at play?
The draft RFC recommends[2] time=1, and memory=64*1024 is a sensible number. If using that amount of memory (64 MB) is not possible in some contexts then the time parameter can be increased to compensate.
I think the key here is the word "to compensate", so in this context it is trying to say: to achieve similar hashing complexity as IDKey([]byte("some password"), salt, 1, 64*1024, 4, 32), you can try IDKey([]byte("some password"), salt, 4, 16*1024, 4, 32).
But if you want to decrease hashing result complexity (and decreasing performance overhead), you can decrease the size of memory uint32 disregarding the time parameter.
is this supposed to be times 4 so that the product of memory and time stays the same?
I dont think so, i believe the memory here means the length of result hash, but time parameter could mean "how many times the hashing result needs to be re-hash-ed until i get the end result".
So these 2 parameters are independent of each other. These are just controlling how much "brute- force cost savings due to time-memory tradeoffs" you want to achieve
Difficulty is roughly equal to time_cost * memory_cost (and possibly / parallelism). So if you 0.25x the memory cost, you should 4x the time cost. See also this answer.
// The time parameter specifies the number of passes over the memory and the
// memory parameter specifies the size of the memory in KiB.
Check out the Argon2 API itself. I'm going to cross-reference a little bit and use the argon2-cffi documentation. It looks like the go interface uses the C-FFI (foreign function interface) under the hood, so the protoype should be the same.
Parameters
time_cost (int) – Defines the amount of computation realized and therefore the execution time, given in number of iterations.
memory_cost (int) – Defines the memory usage, given in kibibytes.
parallelism (int) – Defines the number of parallel threads (changes the resulting hash value).
hash_len (int) – Length of the hash in bytes.
salt_len (int) – Length of random salt to be generated for each password.
encoding (str) – The Argon2 C library expects bytes. So if hash() or verify() are passed an unicode string, it will be encoded using this encoding.
type (Type) – Argon2 type to use. Only change for interoperability with legacy systems.
Indeed, if we look at the Go docs:
// The draft RFC recommends[2] time=1, and memory=64*1024 is a sensible number.
// If using that amount of memory (64 MB) is not possible in some contexts then
// the time parameter can be increased to compensate.
//
// The time parameter specifies the number of passes over the memory and the
// memory parameter specifies the size of the memory in KiB. For example
// memory=64*1024 sets the memory cost to ~64 MB. The number of threads can be
// adjusted to the numbers of available CPUs. The cost parameters should be
// increased as memory latency and CPU parallelism increases. Remember to get a
// good random salt.
I'm not 100% clear on the impact of thread count, but I believe it does parallelize the hashing, and like any multithreaded job, this reduces the total amount of time taken by approximately 1/N, for N cores. Apparently, you should essentially set the parallelism to cpu count.

Indexing text file for microcontroller

I need to search for a specific record in a large file. The search will be performed on a microprocessor (ESP8266), so I'm working with limited storage and RAM.
The list looks like this:
BSSID,data1,data2
001122334455,float,float
001122334466,float,float
...
I was thinking using an index to speed up the search. The data are static, and the index will be built on a computer and then loaded onto the microcontroller.
What I've done so far is very simplistic.
I created an index of the first byte of the BSSID and points at the first and last values with that BSSID prefix.
The performance is terrible, but the index file is very small and uses very little RAM. I though to go further with this method, taking a look at the first two bytes, but the index table will be 256 times larger, resulting in a table 1/3 the size of the data file.
This is the index with the first method:
00,0000000000,0000139984
02,0000139984,0000150388
04,0000150388,0000158812
06,0000158812,0000160900
08,0000160900,0000171160
What indexing algorithm do you suggest that I use?
EDIT:Sorry I didn't include enough background before.I'm storing the data and index file on the flash memory of the chip. I have at the moment 30000 records, but this number could potentially grow until the chips momery limit is hit. The set is indeed static when is stored on the microcontroller but could be updated in a second moment with the help of a computer.The data isn't spread simmetrically between indexes.My goal is to find a good compromise between search speed, index size and RAM used.
I'm not sure where you're stuck, but I can comment on what you've done so far.
Most of all, the way to determine the "best" method is to
define "best" for your purposes;
research indexing algorithms (basic ones have been published for over 50 years);
choose a handful to implement;
Evaluate those implementations according to your definition of "best".
Keep in mind your basic resource restriction: you have limited RAM. If method requires more RAM than you have, it doesn't work, and is therefore infinitely slower than any method that does work.
You've come close to a critical idea, however: you want your index table to expand to consume any free RAM, using that space as effectively as possible. If you can index 16 bits instead of 8 and still fit the table comfortably into your available space, then you've cut down your linear search time by roughly a factor of 256.
Indexing considerations
Don't put the ending value in each row: it's identical to the starting value in the next row. Omit that, and you save one word in each row of the table, giving you twice the table room.
Will you get better performance if you slice the file into equal parts (same quantity of BSSIDS for each row of your table), and then store the entire starting BSSID with its record number? If your BSSIDs are heavily clumped, this might improve your overall processing, even though your table had fewer rows. You can't use a direct index in this case; you have to search the first column to get the proper starting point.
Does that move you toward a good solution?
Not sure how much memory you got (I am not familiar with that MCU) but do not forget that these tables are static/constant so they can be stored in EEPROM instead of RAM some chips have quite a lot of EEPROM usually way more than RAM...
Assume your file is sorted by the index. So You you got (assuming 32bit address) per each entry:
BYTE ix, DWORD beg,DWORD end
Why not this:
struct entry { DWORD beg,end };
entry ix0[256];
Where the first BYTE is also address in index array. This will spare 1 Byte per entry
Now as Prune suggested you can ignore the end address as you will scan the following entries in file anyway until you hit the correct index or index with different first BYTE. so yo can use:
DWORD ix[256];
where yo have only start address beg.
Now we do not know how many entries you actually have nor how many entries will share the same second BYTE of index. So we can not do any further assumption to improve...
You wanted to do something like:
DWORD ix[65536];
But have not enough memory for it ... how about doing something like this instead:
const N=1024; // number of entries you can store
const dix=(max_index_value+1)/N;
const ix[N]={.....};
so each entry ix[i] will cover all the indexes from i*dix to ((i+1)*dix)-1. So to find index you do this:
i = ix[index/dix];
for (;i<file_size;)
{
read entry from file at i-th position;
update position i;
if (file_index==index) { do your stuff; break; }
if (file_index> index) { index not found; break; }
}
To improve performance you can rewrite this linear scan into binary search between address of ix[index/dix] and ix[(index/dix)+1] or file size for the last index ... assuming each entry in file has the same size ...

A heap manager for C/Pascal that automatically fills freed memory with zero bytes

What do you think about an option to fill freed (not actually used) pages with zero bytes? This may improve performance under Windows, and also under VMWare and other virtual machine environments? For example, VMWare and HyperV calculate hash of memory pages, and, if the contents is the same, mark this page as "shared" inside a virtual machine and between virtual machines on the same host, until the page is modified. It effectively decreases memory consumption. Windows does the same - it handles zero pages differently, treating them as free.
We could have the heap manager that would automatically fill memory with zeros when we call FreeMem/ReallocMem. As an alternative option, we could have a function that zeroizes empty memory by demand, i.e. only when this function is explicitly called. Of course, this function has to be thread-safe.
The drawback of filling memory with zeros is touching the memory, which might have already been turned into virtual, thus issuing page faults. Besides that, any memory store operations are slow, so our program will be slower, albeit to an unknown extent (maybe negligible).
If we manage to fill 4-K pages completely with zeros, the hypervisor or Windows will explicitly mark it as a zero page. But even partial zeroizing may be beneficial, since the hypervisor may compress pages using LZ or similar algorithms to save physical memory.
I just want to know your opinion whether the benefits of filling emptied heap memory with zero bytes by the heap manager itself will outweigh the disadvantages of such a technique.
Is zeroizing worth its price when we buy reduced physical memory consumption?
When you have a page whose contents you no longer care about but you still want to keep it allocated, you can call VirtualAlloc (and variants) and pass the MEM_RESET flag.
From VirtualAlloc on MSDN:
MEM_RESET
Indicates that data in the memory range specified by lpAddress and
dwSize is no longer of interest. The pages should not be read from or
written to the paging file. However, the memory block will be used
again later, so it should not be decommitted. This value cannot be
used with any other value.
Using this value does not guarantee that
the range operated on with MEM_RESET will contain zeros. If you want
the range to contain zeros, decommit the memory and then recommit it.
This gives the best of both worlds - you don't have the cost of zeroing the memory, and the system does not have the cost of paging it back in. You get to take advantage of the well-tuned memory manager which already has a zero-pool.
Similar functionality also exists on Linux under the MADV_FREE (or MADV_DONTNEED for Posix) flag to madvise. Glibc uses this function in the implementation of its heap.:
/*
* Stack:
* int shrink_heap (heap_info *h, long diff)
* int heap_trim (heap_info *heap, size_t pad) at arena.c:660
* void _int_free (mstate av, mchunkptr p, int have_lock) at malloc.c:4097
* void __libc_free (void *mem) at malloc.c:2948
* void free(void *mem)
*/
static int
shrink_heap (heap_info *h, long diff)
{
long new_size;
new_size = (long) h->size - diff;
/* ... snip ... */
__madvise ((char *) h + new_size, diff, MADV_DONTNEED);
/* ... snip ... */
h->size = new_size;
return 0;
}
If your heap is in user space this will never work. The kernel can only trust itself, not user space. If the kernel zeros a page, it can treat it as zero. If user space says it zeroed a page, the kernel would still have to check that. It might just as well zero it. One thing user space can do is to discard pages. Which marks them as "don't care". Then a kernel can treat them as zero. But manually zeroing pages in user space is futile.

Does mmap allocate heap memory contiguously?

Provided that:
The size I request is a multiple of the page size
The start address I request is the size + start address of the last allocation
If I always follow these rules when using mmap to allocate memory on the heap, will the addresses returned be contiguous? Or could there be gaps between them?
You can get the behavior you want with the MAP_FIXED flag. Unfortunately for your goal, it's not universally supported, so you'd want to check the return value to ensure that it gave you the allocation you requested. For good portability, you'd need a backup plan for when the call returns 0.
Quick Answer: Not necessarily. There's a good chance it will "almost always work" in both limited an extensive testing on a variety of machines, but its definitely not good practice. The MAP_FIXED flag is supported on most flavors of Linux but it is also buggy in my experience. Avoid.
Better in your case is to simply allocate everything you need at once, and then assign pointers manually to each sub-section of the mapping:
int LengthOf_FirstThing = 0x18000;
int LengthOf_SecondThing = 0x10100;
int LengthOf_ThirdThing = 0x20000;
int _pagesize = getpagesize();
int _pagemask = _pagesize - 1;
size_t sizeOfEverything = LengthOf_FirstThing + LengthOf_SecondThing + LengthOf_ThirdThing;
sizeOfEverything = (sizeOfEverything + _pagemask) & ~(_pagemask);
int8_t* result = (int8_t*)mmap(nullptr, sizeOfEverything, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
int8_t* myFirstThing = result;
int8_t* mySecondThing = myFirstThing + LengthOf_FirstThing;
int8_t* myThirdThing = mySecondThing + LengthOf_SecondThing;
An advantage of this approach also being that each of things you're mapping don't have to be strictly aligned to the page size. And most importantly, it assures fully contigious memory.
Longer answer:
Implementations of mmap() can freely disregard the 'hint' address entirely and so you should never expect the address to be honored. This may be more common than expected, because some implementations may not actually support pagesize granularity for new mmap()'s. They may limit valid starting maps to 16k or 64k boundaries to help reduce the overhead needed to manage very large virtual address spaces. Such an implementation would always disregard an mmap() hint that isn't aligned to such boundary.
Additionally, mmap() does not allocate memory from the heap at all. The heap is an area of memory created/reserved by the C runtime libraries (glibc on *nix) when a process is created. malloc() and new/delete are typically the only functions that pull from the heap, along with any libraries that may use malloc/new internally. The heap itself is typically created and managed by calls to mmap() internally.
I think this is not specified but a so called "implementation detail". I.e. you should not rely on one behaviour or the other, but assume that the pointer is opaque and not be concerned with its exact value.
(That said, there can be a place and time for hacks. In that case you need to find out exactly how your OS behaves.)

Resources