Difference between Arena::CreateMessage and Arena::CreateMaybeMessage - protocol-buffers

When I using Protocol Buffer with Arena, what is the difference between those 2 functions
google::protobuf::Arena::CreateMaybeMessage<LPD::MyObj>();
And
google::protobuf::Arena::CreateMessage<LPD::MyObj>();

Related

Is there any performance difference between Buffer, StructuredBuffer and ByteAddressBuffer (also their RW variants)?

I tried looking this up on various websites, including MS Docs on DirectX 11 Compute Shader types; but I haven't found anything mentioning performance differences of these buffer types.
Are they exactly the same performance-wise ?
If no, what is the most optimum way of using each in various scenarios ?
Performance will eventually differ from GPU/Driver combination.
There is a project here that does benchmark access for those (the linear/random cases are the most useful).
Constant access is also useful if you want to compare cbuffer access versus other buffer access (on NVidia it is common to perform a buffer to cbuffer gpu copy before to go on an expensive shader for example).
https://github.com/sebbbi/perftest
Note that also different buffers (in d3d11 land) have different limitations.
So the performance benefit can be hindered by those.
Structured buffers cannot be bound as vertex/index buffers. So if you want to use them you need to perform an extra copy. (For vertex buffers you can just fetch from vertex id, there is no penalty of this, index buffers can be read but are a bit more problematic).
Byte address allow to store anything in a non structured way (just a basic pointer somehow). Reads are still aligned to 4 bytes (int size). Converting to float (reads) need a asfloat, from float (writes) need a asuint, but in driver cases this is generally a nop, so there is no performance impact.
Byte address (and typed buffers) can be used as index buffer or vertex buffers. No copy necessary.
Typed buffers do not support Interlocked operations too well, in this case you need to use a Structured/ByteAddress buffer (note that you can use interlocked on a small buffer and perform the read/writes on a typed buffer if you want).
Byte address can be more annoying to use if you have an array of elements of the same type (even a float4x4 is a decent amount of code to fetch versus a StructuredBuffer < float4x4 >
Structured buffers allow you to bind "Partial views". So even if your buffers has let's say 2048 floats, you can bind a range from 4-456 (it also allows you to bind 500-600 as write at the same time since they are not overlapping).
For all buffers, if you use them as readonly, don't bind them as RW, this generally has a decent penalty.
To add to the accepted answer,
There is also a performance penalty for elements in the StructuredBuffer not being aligned to a 128 bit stride [sizeof float4]. If not there is the possability that a single float4 for example could span across cache lines causing up to a 5% perf penalty.
An example of how to solve this is to use padding to re-align elements:
struct Foo
{
float4 Position;
float Radius;
float pad0;
float pad1;
float pad2;
float4 Rotation;
};
NVIDIA post with more detail

ACP and DMA, how they work?

I'm using ARM a53 platform, it has ACP component, and I'm trying to use DMA to transfer data through ACP.
By ARM trm document, if I understand it correctly, the DMA transmission data size limits to 64 bytes for each DMA transfer when using ACP.
If so, does this limitation make DMA not usable? Because it's dumb to configure DMA descriptor but to transfer 64 bytes only each time.
Or DMA should auto divide its transfer length into many ACP size limited(64 bytes) packets, without any software intervention.
Need any expert to explain how ACP and DMA work together.
Somewhere in the interfaces from the DMA to the ACP's AXI port should auto divide its transfer length as needed into transfers of appropriate length. For the Cortex-A53 ACP, AXI transfers are limited to 64B(perhaps intentionally 1x cacheline).
From https://developer.arm.com/documentation/ddi0500/e/level-2-memory-system/acp/transfer-size-support :
x byte INCR request characterized by:(some list of limitations)
Note the use of INCR instead of FIXED. INCR will automatically increment the address according to the size of the transfer, while FIXED will not. This makes it simple for the peripheral break a large transfer into a series of multiple INCR transfers.
However, do note that on the Cortex-A53, transfer size(x in the quote) is fixed at 16 or 64 byte aligned transfers. If the DMA sends an inappropriate sized transfer(because misconfigured or correct size unsupported), the AXI will emit a SLVERR. If the buffer is not appropriately aligned, I think this also causes a SLVERR.
Lastly, the on-chip network routing must support connecting the DMA to the ACP at chip design time. In my experience this is more commonly done for network accelerators and FPGA fabric glue, but tends to be less often connected for low speed peripherals like UART/SPI/I2C.

What's the reason behind ZigZag encoding in Protocol Buffers and Avro?

ZigZag requires a lot of overhead to write/read numbers. Actually I was stunned to see that it doesn't just write int/long values as they are, but does a lot of additional scrambling. There's even a loop involved:
https://github.com/mardambey/mypipe/blob/master/avro/lang/java/avro/src/main/java/org/apache/avro/io/DirectBinaryEncoder.java#L90
I don't seem to be able to find in Protocol Buffers docs or in Avro docs, or reason myself, what's the advantage of scrambling numbers like that? Why is it better to have positive and negative numbers alternated after encoding?
Why they're not just written in little-endian, big-endian, network order which would only require reading them into memory and possibly reverse bit endianness? What do we buy paying with performance?
It is a variable length 7-bit encoding. The first byte of the encoded value has it high bit set to 0, subsequent bytes have it at 1. Which is the way the decoder can tell how many bytes were used to encode the value. Byte order is always little-endian, regardless of the machine architecture.
It is an encoding trick that permits writing as few bytes as needed to encode the value. So an 8 byte long with a value between -64 and 63 takes only one byte. Which is common, the range provided by long is very rarely used in practice.
Packing the data tightly without the overhead of a gzip-style compression method was the design goal. Also used in the .NET Framework. The processor overhead needed to en/decode the value is inconsequential. Already much lower than a compression scheme, it is a very small fraction of the I/O cost.

AXI4 (Lite) Narrow Burst vs. Unaligned Burst Clarification/Compatibility

I'm currently writing an AXI4 master that is supposed to support AXI4 Lite (AXI4L) as well.
My AXI4 master is receiving data from a 16-bit interface. This is on a Xilinx Spartan 6 FPGA and I plan on using the EDK AXI4 Interconnect IP, which has a minimum WDATA width of 32 bits.
At first I wanted to use narrow burst, i.e. AWSIZE = x"01" (2 bytes in transfer). However, I found that Xilinx' AXI Reference Guide UG761 states "narrow bursts [are] supported but [...] not recommended." Unaligned transactions are supposed to be supported.
This had me thinking. Say I start an unaligned burst:
AWLEN = x"01" (2 beats)
AWSIZE = x"02" (4 bytes in transfer")
And do the following:
AX (32-bit word #0: send hi16)
XB (32-bit word #1: send lo16)
Where A, B are my 16 bit words that start off at an unaligned (2-byte aligned) address. X means WSTRB is deasserted for the indicated 16 bit.
Is this supported or does this fall under the category "narrow burst" even through AWSIZE = x"02" (4 bytes in transfer) as opposed to AWSIZE = x"01" (2 bytes in transfer)?
Now, if this was just for AXI4, I would probably not care as much about this use case, because AXI4 peripherals are required to use the WSTRB signals. However, the AXI Reference Guide UG761 states "[AXI4L] Slaves interface can elect to ignore WSTRB (assume all bytes valid)."
I read here that many (but not all; and there is not list?) Xilinx AXI4L peripherals do elect to ignore WSTRB.
Does this mean that I'm essentially barred from doing narrow burst ("not recommended") as well as unaligned bursts ("WSTRB can be ignored") or is there an easy way to unload some of the implementation work from my master into the interconnect, guaranteeing proper system behavior when accessing AXI4L peripherals?
Your example is not a narrow burst, and should work.
The reason narrow burst is not recommended is that it gives sub-optimal performances. Both narrow-burst and data realignement cost in area and are not recommended IMHO. However, DRE has minimal bandwidth cost, while narrow burst does. If your AXI port is 100MHz 32 bits, you have 3.2GBits maximum throughput, if you use narrow burst of 16 bits 50% of the time, than your maximum throughput is reduced to 2.4GBits (32bits X 50MHz + 16bits X 50Mhz). Also, I'm not sure AXI-Lite support narrow burst or data realignement.
Your example has 2 major flaws. First, it requires 3 data-beats to transfer 32 bits, which is worst than narrow-burst (I don't think AXI is smart enough to cancel the last burst with WSTRB to 0). Second, you can't burst more than 2 16-bits at a time, which will hang your AXI infrastructure's performances if you have a lot of data to transfer.
The best way to deal with this is concatenate the 16 bits together to form a 32 bits in your block. Then you buffer these 32 bits and burst them when you have enough. This is the AXI high performance way to do this.
However, if you receive data as 16-bits, it seems you would be better using AXI-Stream, which support 16-bits but doesn't have the notion of addresses. You can map an AXI-Stream to AXI-4 using Xilinx's IP cores. Either AXI-Datamover or AXI-DMA can do that. Both do the same (in fact, AXI-DMA includes a datamover), but AXI-DMA is controlled trough an AXI-Lite interface while Datamover is controlled through additionals AXI-Streams.
As a final note, the Xilinx cores never requires narrow-burst or DRE. If you need DRE in AXI-DMA, it's done by the AXI-DMA core and not the AXI Interconnect. Also, these cores are clear-source, so you can checkout how they operate easily.

Implementing Stack and Queue with O(1/B)

This is an exercise from this text book (page 77):
Exercise 48 (External memory stacks and queues). Design a stack data structure that needs O(1/B) I/Os per operation in the I/O model
from Section 2.2. It suffices to keep two blocks in internal memory.
What can happen in a naive implementation with only one block in
memory? Adapt your data structure to implement FIFOs, again using two
blocks of internal buffer memory. Implement deques using four buffer
blocks.
I don't want the code. Can anyone explain me what the question needs, and how can i do operations in O(1/B)?
As the book goes, quoting Section 2.2 on page 27:
External Memory: <...> There are special I/O operations that transfer B consecutive words between slow and fast memory. For
example, the external memory could be a hard disk, M would then be the
main memory size and B would be a block size that is a good compromise
between low latency and high bandwidth. On current technology, M = 1
GByte and B = 1 MByte are realistic values. One I/O step would then be
around 10ms which is 107 clock cycles of a 1GHz machine. With another
setting of the parameters M and B, we could model the smaller access
time difference between a hardware cache and main memory.
So, doing things in O(1/B) most likely means, in other words, using a constant number of these I/O operations for each B stack/queue operations.

Resources