Why does the pseudocode of _mm_insert_ps calculate %8? - intrinsics

Within the intel intrinsics guide, the pseudocode for the operation of _mm_insert_ps, the following is defined:
FOR j := 0 to 3
i := j*32
IF imm8[j%8]
dst[i+31:i] := 0
dst[i+31:i] := tmp2[i+31:i]
. The access into imm8 confuses me: IF imm8[j%8]. As j is within the range 0..3, the modulo 8 part doesn't seem to do anything. Does this maybe signal a convertion that I am not aware of? Or is % not "modulo" in this case?

Seems like a pointless modulo.
Intel's documentation for the corresponding asm instruction, insertps, doesn't use any % modulo operations in the pseudocode. It uses ZMASK ←imm8[3:0] and then basically unrolls that part of the pseudocode where this uses a loop, with checks like
IF (ZMASK[2] = 1) THEN DEST[95:64]←00000000H
ELSE DEST[95:64]←TMP2[95:64]
This is just showing how the low 4 bits of the immediate perform zero-masking on the 4 dword elements of the final result, after the insert of an element from another vector, or a scalar in memory.
(There's no intrinsic for insert directly from memory; you'd need an intrinsic for movss and then hope the compiler folds that load into a memory operand for insertps. With a memory source, imm8[7:6] are ignored, just taking that scalar dword as the element to insert (that's the ELSE COUNT_S←0 in the asm pseudocode), but then everything else works the same, including the zero-masking you're asking about.)


GLSL: Why is a random write in local array significantly slower than a looped write?

Lets look at a simplified example function in GLSL:
void foo() {
vec2 localData[16];
// ...
int i = ... // somehow dependent on dynamic data (not known at compile time)
localData[i] = x; // THE IMPORTANT LINE
It writes some value x to a dynamic determined index in a local array.
Now, replacing the line localData[i] = x; with
for( int j = 0; j < 16; ++j )
if( i == j )
localData[j] = x;
makes the code significantly faster. In several tested examples (different shaders) the execution time almost halved and there were much more things going on than this write.
For example: in an order-independent transparency shader which, among other things, fetches 16 texels the timings are 39ms with the direct write and 23ms with the looped write. Nothing else changed!
The test hardware is an GTX1080. The assembly returned by glGetProgramBinary is still too high-level. It contains one line in the first case and a loop+if surrounding an identical line in the second.
Why does this performance issue happen?
Is this true for all vendors?
Guess: localData is stored in 8 vec4 registers (the assembly does not say anything about that). Further I assume, that registers cannot be addressed with an index. If both are true, than the final binary must use some branch construct. The loop variant might be unrolled and result in a switch-like pattern which is faster. But is that common for all vendors? Why can't the compiler use whatever results from the for loop as the default for such writes?
Further experiments have shown that the reason is the use of a different memory type for the array. The (unrolled) looped variant uses registers, while the random access variant switches to local memory.
Local memory is usual placed in the global one, but private to each thread. It is likely that accesses to this local array are going to be cached (L2?).
The experiments to verify this reasoning were the following:
Manual versions of unrolled loops (measured in an insertion sort with 16 elements over 1M pixels):
Base line: localData[i] = x 33ms
For loop: for j + if i=j 16.8ms
Switch: switch(i) { case 0: localData[0] ...: 16.92ms
If else tree (splitting in halves): 16.92ms
If list (plain manual unrolled): 16.8ms
=> All kinds of branch constructs result in more or less the same timings. So it is not a bad branching behavior as initially guessed.
Multiple vs. one vs no random access (32 element insertion sort)
2x localData[i] = x 47ms
1x localData[i] = x 45ms
0x localData[i] = x 16ms
=> As long as there is at least one random access the performance will be bad. This means there is a global decision changing the behavior of localData -- most likely the use of a different memory. Using more than one random access does not make things worse much, because of caching.

Increment or decrement in boundaries

I'll make examples in Python, since I use Python, but the question is not about Python.
Lets say I want to increment a variable by specific value so that it stays in given boundaries.
So for increment and decrement I have these two functions:
def up (a, s, Bmax):
r = a + s
if r > Bmax : return Bmax
else : return r
def down (a, s, Bmin):
r = a - s
if r < Bmin : return Bmin
else : return r
Note: it is supposed that initial value of the variable "a" is already in boundaries (min <= a <= max) so additional initial checking does not belong to this function. What makes me curious, almost every program I made needs these functions.
The question is:
are those classified as some typical operations and have they specific names?
if yes, is there some correspondence to intrinsic processor functionality so it is optimised in some compilers?
Reason why I ask is pure curiousity, of course I cannot optimise it in Python and I know little about CPU architecture.
To be more specific, on a lower level for an unsigned 8-bit integer the increment would look I suppose like this:
def up (a, s, Bmax):
counter = 0
while True:
if counter == s : break
if a == Bmax : break
if a == 255 : break
a += 1
counter += 1
I know the latter would not make any sense in Python so treat it as my naive attempt to imagine low level code which adds the value in place. There are some nuances, e.g. signed, unsigned, but I was interested merely about unsigned integers since I came across it more often.
It is called saturation arithmetic, it has native support on DSPs and GPUs (not a random pair: both deals with signals).
For example the NVIDIA PTX ISA let the programmer chose if an addition is saturated or not
add.type d, a, b;
add{.sat}.s32 d, a, b; // .sat applies only to .s32
limits result to MININT..MAXINT (no overflow) for the size of the operation.
The TI TMS320C64x/C64x+ DSP has support for
Dual 16-bit saturated arithmetic operations
and instruction like sadd to perform a saturated add and even a whole register (Saturation Status Register) dedicated to collecting precise information about saturation while executing a sequence of instructions.
Even the mainstream x86 has support for saturation with instructions like vpaddsb and similar (including conversions).
Another example is the GLSL clamp function, used to make sure color values are not outside the range [0, 1].
In general if the architecture must be optimized for signal/media processing it has support for saturation arithmetic.
Much more rare is the support for saturation with arbitrary bounds, e.g. asymmetrical bounds, non power of two bounds, non word sized bounds.
However, saturation can be implemented easily as min(max(v, b), B) where v is the result of the unsaturated (and not overflowed) operation, b the lower bound and B the upper bound.
So any architecture that support finding the minimum and the maximum without a branch, can implement any form of saturation efficiently.
See also this question for a more real example of how saturated addition is implemented.
As a side note the default behavior is wrap around: for 8-bit quantities the sum 255 + 1 equals 0 (i.e. operations are modulo 28).

Strange pointer arithmetic

I came across too strange behaviour of pointer arithmetic. I am developing a program to develop SD card from LPC2148 using ARM GNU toolchain (on Linux). My SD card a sector contains data (in hex) like (checked from linux "xxd" command):
fe 2a 01 34 21 45 aa 35 90 75 52 78
While printing individual byte, it is printing perfectly.
char *ch = buffer; /* char buffer[512]; */
for(i=0; i<12; i++)
debug("%x ", *ch++);
Here debug function sending output on UART.
However pointer arithmetic specially adding a number which is not multiple of 4 giving too strange results.
uint32_t *p; // uint32_t is typedef to unsigned long.
p = (uint32_t*)((char*)buffer + 0);
debug("%x ", *p); // prints 34012afe // correct
p = (uint32_t*)((char*)buffer + 4);
debug("%x ", *p); // prints 35aa4521 // correct
p = (uint32_t*)((char*)buffer + 2);
debug("%x ", *p); // prints 0134fe2a // TOO STRANGE??
Am I choosing any wrong compiler option? Pls help.
I tried optimization options -0 and -s; but no change.
I could think of little/big endian, but here i am getting unexpected data (of previous bytes) and no order reversing.
Your CPU architecture must support unaligned load and store operations.
To the best of my knowledge, it doesn't (and I've been using STM32, which is an ARM-based cortex).
If you try to read a uint32_t value from an address which is not divisible by the size of uint32_t (i.e. not divisible by 4), then in the "good" case you will just get the wrong output.
I'm not sure what's the address of your buffer, but at least one of the three uint32_t read attempts that you describe in your question, requires the processor to perform an unaligned load operation.
On STM32, you would get a memory-access violation (resulting in a hard-fault exception).
The data-sheet should provide a description of your processor's expected behavior.
Even if your processor does support unaligned load and store operations, you should try to avoid using them, as it might affect the overall running time (in comparison with "normal" load and store operations).
So in either case, you should make sure that whenever you perform a memory access (read or write) operation of size N, the target address is divisible by N. For example:
uint08_t x = *(uint08_t*)y; // 'y' must point to a memory address divisible by 1
uint16_t x = *(uint16_t*)y; // 'y' must point to a memory address divisible by 2
uint32_t x = *(uint32_t*)y; // 'y' must point to a memory address divisible by 4
uint64_t x = *(uint64_t*)y; // 'y' must point to a memory address divisible by 8
In order to ensure this with your data structures, always define them so that every field x is located at an offset which is divisible by sizeof(x). For example:
uint16_t a; // offset 0, divisible by sizeof(uint16_t), which is 2
uint08_t b; // offset 2, divisible by sizeof(uint08_t), which is 1
uint08_t a; // offset 3, divisible by sizeof(uint08_t), which is 1
uint32_t c; // offset 4, divisible by sizeof(uint32_t), which is 4
uint64_t d; // offset 8, divisible by sizeof(uint64_t), which is 8
Please note, that this does not guarantee that your data-structure is "safe", and you still have to make sure that every myStruct_t* variable that you are using, is pointing to a memory address divisible by the size of the largest field (in the example above, 8).
There are two basic rules that you need to follow:
Every instance of your structure must be located at a memory address which is divisible by the size of the largest field in the structure.
Each field in your structure must be located at an offset (within the structure) which is divisible by the size of that field itself.
Rule #1 may be violated if the CPU architecture supports unaligned load and store operations. Nevertheless, such operations are usually less efficient (requiring the compiler to add NOPs "in between"). Ideally, one should strive to follow rule #1 even if the compiler does support unaligned operations, and let the compiler know that the data is well aligned (using a dedicated #pragma), in order to allow the compiler to use aligned operations where possible.
Rule #2 may be violated if the compiler automatically generates the required padding. This, of course, changes the size of each instance of the structure. It is advisable to always use explicit padding (instead of relying on the current compiler, which may be replaced at some later point in time).
LDR is the ARM instruction to load data. You have lied to the compiler that the pointer is a 32bit value. It is not aligned properly. You pay the price. Here is the LDR documentation,
If the address is not word-aligned, the loaded value is rotated right by 8 times the value of bits [1:0].
See: 4.2.1. LDR and STR, words and unsigned bytes, especially the section Address alignment for word transfers.
Basically your code is like,
p = (uint32_t*)((char*)buffer + 0);
p = (p>>16)|(p<<16);
debug("%x ", *p); // prints 0134fe2a
but has encoded to one instruction on the ARM. This behavior is dependent on the ARM CPU type and possibly co-processor values. It is also highly non-portable code.
It's called "undefined behavior". Your code is casting a value which is not a valid unsigned long * into an unsigned long *. The semantics of that operation are undefined behavior, which means pretty much anything can happen*.
In this case, the reason two of your examples behaved as you expected is because you got lucky and buffer happened to be word-aligned. Your third example was not as lucky (if it was, the other two would not have been), so you ended up with a pointer with extra garbage in the 2 least significant bits. Depending on the version of ARM you are using, that could result in an unaligned read (which it appears is what you were hoping for), or it could result in an aligned read (using the most significant 30 bits) and a rotation (word rotated by the number of bytes indicated in the least significant 2 bits). It looks pretty clear that the later is what happened in your 3rd example.
Anyway, technically, all 3 of your example outputs are correct. It would also be correct for the program to crash on all 3 of them.
Basically, don't do that.
A safer alternative is to write the bytes into a uint32_t. Something like:
uint32_t w;
memcpy(&w, buffer, 4);
debug("%x ", w);
memcpy(&w, buffer+4, 4);
debug("%x ", w);
memcpy(&w, buffer+2, 4);
debug("%x ", w);
Of course, that's still assuming sizeof(uint32_t) == 4 && CHAR_BITS == 8, but that's a much safer assumption. (Ie, it should work on pretty much any machine with 8 bit bytes.)

Can this Delphi 6 bitmap modification code be sped up with SIMD or another approach?

I have a Delphi 6 application that modifies bitmaps in real time. Currently I am using the code shown below to do quickie brightness boost and contrast changes. If the operation were just an addition or just a multiplication, I could see how SIMD could be used, but since both an addition and a multiplication are involved, and since there is also the Trunc() operation to restrict it to the range of a Byte, I'm not sure if SIMD could be used here. Here are my questions:
Can SIMD be used with this code and do you know of a good code sample I could work from? What kind of a speed boost could I expect?
Would the (potential) padding of the scan lines be a problem?
Any general optimization tips on speeding up the code?
// A fast version of this function would be to only allow range reductions
// as a power of 2 and then use shl operations instead of divisions.
procedure doBrightnessAndContrast(var clip: tbitmap; compressionRatio: double; shiftValue: Byte);
p0: PByte;
x,y: Integer;
for y := 0 to clip.Height-1 do
p0 := clip.scanline[y];
// Can't just do the whole buffer as a big block of bytes since the
// individual scan lines may be padded for CPU alignment.
for x := 0 to clip.Width - 1 do
// Red
p0^ := IntToByte(Trunc(p0^ * compressionRatio) + shiftValue);
// Green
p0^ := IntToByte(Trunc(p0^ * compressionRatio) + shiftValue);
// Green
p0^ := IntToByte(Trunc(p0^ * compressionRatio) + shiftValue);
Sure, SSE or MMX is possible.
In your case however you may get almost the same speed improvement if you precompute a 256 entry table using your equations.
Then replace all computations with a simple table lookup. My best bet is, that on modern processors this will give nearly the same speed as MMX/SSE.

linear interpolation on 8bit microcontroller

I need to do a linear interpolation over time between two values on an 8 bit PIC microcontroller (Specifically 16F627A but that shouldn't matter) using PIC assembly language. Although I'm looking for an algorithm here as much as actual code.
I need to take an 8 bit starting value, an 8 bit ending value and a position between the two (Currently represented as an 8 bit number 0-255 where 0 means the output should be the starting value and 255 means it should be the final value but that can change if there is a better way to represent this) and calculate the interpolated value.
Now PIC doesn't have a divide instruction so I could code up a general purpose divide routine and effectivly calculate (B-A)/(x/255)+A at each step but I feel there is probably a much better way to do this on a microcontroller than the way I'd do it on a PC in c++
Has anyone got any suggestions for implementing this efficiently on this hardware?
The value you are looking for is (A*(255-x)+B*x)/255. It requires only 8x8 multiplication, and a final division by 255, which can be approximated by simply taking the high byte of the sum.
Choosing x in range 0..128, no approximation is needed: take the high byte of (A*(128-x)+B*x)<<1.
Assuming you interpolate a sequence of values where the previous endpoint is the new start point:
sounds like a bad idea. If you use base 255 as a fixedpoint representation, you get the same interpolant twice. You get B when x=255 and B as the new A when x=0.
Use 256 as the fixedpoint system. Divides become shifts, but you need 16-bit arithmetic and 8x8 multiplication with a 16-bit result. The previous issue can be fixed by simply ignoring any bits in the higher-bytes as x mod 256 becomes 0. This suggestion uses 16-bit multiplication, but can't overflow. and you don't interpolate over the same x twice.
interp = (a*(256 - x) + b*x) >> 8
256 - x becomes just a subtract-with-borrow, as you get 0 - x.
The PIC lacks these operations in its instruction set:
Right and left shift. (both logical and arithmetic)
Any form of multiplication.
You can get right-shifting by using rotate-right instead, followed by masking out the extra bits on the left with bitwise-and. A straight-forward way to do 8x8 multiplication with 16-bit result:
void mul16(
unsigned char* hi, /* in: operand1, out: the most significant byte */
unsigned char* lo /* in: operand2, out: the least significant byte */
unsigned char a,b;
/* loop over the smallest value */
a = (*hi <= *lo) ? *hi : *lo;
b = (*hi <= *lo) ? *lo : *hi;
*hi = *lo = 0;
if(*lo < b) /* unsigned overflow. Use the carry flag instead.*/
The techniques described by Eric Bainville and Mads Elvheim will work fine; each one uses two multiplies per interpolation.
Scott Dattalo and Tony Kubek have put together a super-optimized PIC-specific interpolation technique called "twist" that is slightly faster than two multiplies per interpolation.
Is using this difficult-to-understand technique worth running a little faster?
You could do it using 8.8 fixed-point arithmetic. Then a number from range 0..255 would be interpreted as 0.0 ... 0.996 and you would be able to multiply and normalize it.
Tell me if you need any more details or if it's enough for you to start.
You could characterize this instead as:
using a value range of x=0..255, precompute the values of 256/(x+1) as a fixed-point number in a table, and then code a general purpose multiply, adjust for the position of the binary point. This might not be small spacewise; I'd expect you to need a 256 entry table of 16 bit values and the multiply code. (If you don't need speed, this would suggest your divison method is fine.). But it only takes one multiply and an add.
My guess is that you don't need every possible value of X. If there are only a few values of X, you can compute them offline, do a case-select on the specific value of X and then implement the multiply in terms of a fixed sequence of shifts and adds for the specific value of X. That's likely to be pretty efficient in code and very fast for a PIC.
Given two values X & Y , its basically:
X/2 + Y/2 (to prevent the odd-case that A+B might overflow the size of the register)
Hence try the following:
Initially A=MAX, B=MIN
Loop {
Right-Shift A by 1-bit.
Right-Shift B by 1-bit.
C = ADD the two results.
Check MSB of 8-bit interpolation value
if MSB=0, then B=C
if MSB=1, then A=C
Left-Shift 8-bit interpolation value
}Repeat until 8-bit interpolation value becomes zero.
The actual code is just as easy. Only i do not remember the registers and instructions off-hand.
