I have little bit difficulty understanding max work group limit reported by OpenCL and how it affects the program.
So my program is reporting following thing,
CL_DEVICE_MAX_WORK_ITEM_SIZES : 1024, 1024, 1024
CL_DEVICE_MAX_WORK_GROUP_SIZE : 256
CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS : 3
Now I am writing program to add vectors with 1 million entries.
So the calculation for globalSize and localSize for NDRange is as follows
int localSize = 64;
// Number of total work items - localSize must be devisor
globalSize = ceil(n/(float)localSize)*localSize;
.......
// Execute the kernel over the entire range of the data set
err = clEnqueueNDRangeKernel(queue, kernel, 1, NULL, &globalSize, &localSize,
0, NULL, NULL);
Here as per my understanding OpenCL indirectly calculates the number of work groups it will launch.
For above example
globalSize = 15625 * 64 -> 1,000,000 -> So this is total number of threads that will be launched
localSize = 64 -> So each work group will have 64 work items
Hence from above we get
Total Work Groups Launched = globalSize/ localSize -> 15625 Work Groups
Here my confusion starts,
If you see value reported by OpenCL CL_DEVICE_MAX_WORK_GROUP_SIZE : 256 So, I was thinking this means max my device can launch 256 work groups in one dimension,
but above calculations showed that I am launching 15625 work groups.
So how is this thing working ?
I hope some one can clarify my confusion.
I am sure I am understanding something wrong.
Thanks in advance.
According to the specification of clEnqueueNDRangeKernel: https://www.khronos.org/registry/OpenCL/sdk/2.2/docs/man/html/clEnqueueNDRangeKernel.html,
CL_DEVICE_MAX_WORK_ITEM_SIZES and CL_DEVICE_MAX_WORK_GROUP_SIZE indicate the limits of local size (CL_KERNEL_WORK_GROUP_SIZE is CL_DEVICE_MAX_WORK_GROUP_SIZE in OpenCL 1.2).
const int dimension = n;
const int localSizeDim[n] = { ... }; // Each element must be less than or equal to 'CL_DEVICE_MAX_WORK_ITEM_SIZES[i]'
const int localSize = localSizeDim[0] * localSizeDim[1] * ... * localSizeDim[n-1]; // The size must be less than or equal to 'CL_DEVICE_MAX_WORK_GROUP_SIZ'
I couldn't find the device limit of global work items, but maximum value representable by size t is the limit of global work items in the description of the error CL_INVALID_GLOBAL_WORK_SIZE.
This is a bit of a turn around.
Usually one is attempting to use shifts to perform multiplication and not the other way around.
On the Hitachi/Motorola 6309 there is no shift by n bits. There is only shift by 1 bit.
However there is a 16 bit x 16 bit signed multiply (provides a 32 bit signed result).
(EDIT) Using this is no problem for a 16 bit shift (left) however I'm trying to use 2 x 16x16 signed mults to do a 32 bit shift. The high order word of the result for the low order word shift is the problem. (Does that make sence?)
Some pseudo code might help:
result.highword = low word of (val.highword * shiftmulttable[shift])
temp = val.lowword * shiftmulttable[shift]
result.lowword = temp.lowword
result.highword = or (result.highword, temp.highword)
(with some magic on temp.highword to consider signed values)
I have been exercising my logic in an attempt to use this instruction to perform the shifts but so far I have failed.
I can easily achieve any positive value shifts by 0 to 14 but when it comes to shifting by 15 bits (mult by 0x8000) or shifting any negative values certain combinations of values require either:
complementing the result by 1
complementing the result by 2
adding 1 to the result
doing nothing to the result
And I just can't see any pattern to these values.
Any ideas appreciated!
Best I can tell from the problem description, implementing the 32-bit shift would work as desired by using an unsigned 16x16->32 bit multiply. This can easily be synthesized from a signed 16x16->32 multiply instruction by exploiting the two's complement integer representation. If the two factors are a and b, adding b to the high-order 16 bits of the signed product when a is negative, and adding a to the high-order 16 bits of the signed product when b is negative will give us the unsigned multiplication result.
The following C code implements this approach and tests it exhaustively:
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
/* signed 16x16->32 bit multiply. Hardware instruction */
int32_t mul16_wide (int16_t a, int16_t b)
{
return (int32_t)a * (int32_t)b;
}
/* unsigned 16x16->32 bit multiply (synthetic) */
int32_t umul16_wide (int16_t a, int16_t b)
{
int32_t p = mul16_wide (a, b); // signed 16x16->32 bit multiply
if (a < 0) p = p + (b << 16); // add 'b' to upper 16 bits of product
if (b < 0) p = p + (a << 16); // add 'a' to upper 16 bits of product
return p;
}
/* unsigned 16x16->32 bit multiply (reference) */
uint32_t umul16_wide_ref (uint16_t a, uint16_t b)
{
return (uint32_t)a * (uint32_t)b;
}
/* test synthetic unsigned multiply exhaustively */
int main (void)
{
int16_t a, b;
int32_t res, ref;
uint64_t count = 0;
a = -32768;
do {
b = -32768;
do {
res = umul16_wide (a, b);
ref = umul16_wide_ref (a, b);
count++;
if (res != ref) {
printf ("!!!! a=%d b=%d res=%d ref=%d\n", a, b, res, ref);
return EXIT_FAILURE;
}
if (b == 32767) break;
b = b + 1;
} while (1);
if (a == 32767) break;
a = a + 1;
} while (1);
printf ("test cases passed: %llx\n", count);
return EXIT_SUCCESS;
}
I am not familiar with the Hitachi/Motorola 6309 architecture. I assume it uses a special 32-bit register to hold the result of a wide multiply, from which high and low half can be extracted into 16-bit general-purpose registers, and the conditional corrections can then be applied to the register holding the upper 16 bits.
Are you using fixed-point multiplicative inverses to use the high half result for a right shift?
If you're just left-shifting, multiply by 0x8000 should work. The low half of an NxN => 2N-bit multiply is the same whether inputs are treated as signed or unsigned. Or do you need a 32-bit shift result from your 16-bit input?
Is the multiply instruction actually faster than a few 1-bit shifts for small shift counts? (I wouldn't be surprised if compile-time-constant counts of 2 or 3 would be faster with just a chain of 2 or 3 add same,same or left-shift instructions.)
Anyway, for a compile-time-constant shift count of 15, maybe just multiply by 1<<14 and then do the last count with a 1-bit shift (add same,same).
Or if your ISA has rotates, rotate right by 1 and mask away the low bits, skipping the multiply. Or zero a register, right-shift the low bit into the carry flag, then rotate-through-carry into the top of the zeroed register.
(The latter might be useful on an ISA that doesn't have large immediates and couldn't "mask away all the low bits" in one instruction. Or an ISA that only has RCR not ROR. I don't know 6309 at all)
If you're using a runtime count to look up a multiplier from a table, maybe branch for that case, or adjust your LUT so every entry needs an extra 1-bit shift, so you can do mul(lut[count]) and an unconditional extra shift.
(Only works if you don't need to support a shift-count of zero.)
Not that there would be many interested people who would want to see the 6309 code, but here it is:
Compliant with OS9 C ABI.
Pointer to result and arguments pushed on stack right to left.
U,PC,val(4bytes),shift(2bytes),*result(2bytes)
0 2 4 8 10
:
* 10,s pointer to long result
* 4,s 4 byte value
* 8,s 2 byte shift
* x = pointer to result
pshs u
ldx 10,s * load pointer to result
ldd 8,s * load shift
* if shift amount is greater than 31 then
* just return zero. OS9 C standard.
cmpd #32
blt _10x
ldq #0
stq 4,s
bra _13x
* if shift amount is greater than 16 than
* move bottom word of value into top word
* and clear bottom word
_10x
cmpb #16
blt _1x
ldu 6,s
stu 4,s
clr 6,s
clr 7,s
_1x
* setup pointer u and offset e into mult table _2x
leau _2x,pc
andb #15
* if there is no shift value just return value
beq _13x
aslb * need to double shift to use as word table offset
stb 8,s * save double shft
tfr b,e
* shift top word q = val.word.high * multtab[shft]
ldd 4,s
muld e,u
stw ,x * result.word.high = low word of mult
* shift bottom word q = val.word.low * multtab[shft]
lde 8,s * reload double shft
ldd 6,s
muld e,u
stw 2,x * result.word.low = low word of mult
* The high word or mult needs to be corrected for sign
* if val is negative then muld will return negated results
* and need to un negate it
lde 8,s * reload double shift
tst 4,s * test top byte of val for negative
bge _11x
addd e,u * add the multtab[shft] again to top word
_11x
* if multtab[shft] is negative (shft is 15 or shft<<1 is 30)
* also need to un negate result
cmpe #30
bne _12x
addd 6,s * add val.word.low to top word
_12x
* combine top and bottom and save bottom half of result
ord ,x
std ,x
bra _14x
* this is only reached if the result is in value (let result = value)
_13x
ldq 4,s * load value
stq ,x * result = value
_14x
puls u,pc
_2x fdb $01,$02,$04,$08,$10,$20,$40,$80,$0100,$0200,$0400,$0800
fdb $1000,$2000,$4000,$8000
I cannot understand what is mean slot.
It is the device number from the dbdf(Domain:bus:device:function)?
Codes from here::
https://elixir.bootlin.com/linux/latest/source/include/uapi/linux/pci.h#L32
/*
* The PCI interface treats multi-function devices as independent
* devices. The slot/function address of each device is encoded
* in a single byte as follows:
*
* 7:3 = slot
* 2:0 = function
*/
#define PCI_DEVFN(slot, func) ((((slot) & 0x1f) << 3) | ((func) & 0x07))
#define PCI_SLOT(devfn) (((devfn) >> 3) & 0x1f)
#define PCI_FUNC(devfn) ((devfn) & 0x07)
I have problem how to solve this one, Iam thinking about return
int product = 3 * n;
return (!n || product/n == 3);
however, I cant use those operators.
/*
* Overflow detection of 3*n
* Input is positive
* Example: overflow( 10 ) = 0
* Example: overlfow( 1<<30 ) = 1
* Legal ops: & | >> << ~
* Max ops: 10
*
* Number of X86 instructions:
*/
int overflow_3( int n ) {
return 2;
}
The condition is equivalent to checking whether x is larger than MAX_INT / 3, that is, x > 0x2aaaaaaa. Since x is known to be nonnegative, we know that the top bit is zero and thus we can check the condition as follows:
unsigned overflow(unsigned x) {
return (x + 0x55555555) >> 31;
}
There are two possible options for a number to overflow when multiplied by 3.
Let's look at X3 multiplication. There are two actions:
1. Shift left by 1 leaves the leftmost bit set. This could only happen if the near leftmost (i.e the 30) bit is set
2. Shift left by 1 leaves the leftmost bit unset. However the following addition of the original number results in having the bits set. This could only happen if the 29 bit is set (since it is the only one that will become the 30 after the shift) and if either the 28 or the 27 bit is set (since they can overflow to the 30 bit). However the 27 but by itself being set is not enough (since we need the 26 bit to be set, or the 25th and 24th) and etc.
So basically you need a loop here. However since loops are not allowed I would use recursion. So:
int overflow_3(int n){
return n >> 30 || (n >> 29 && overflow_3( (n & ( (1 << 29) - 1)) << 2 ) );
}
I am currently reading "Linux Kernel Development" by Robert Love, and I got a few questions about the CFS.
My question is how calc_delta_mine calculates :
delta_exec_weighted= (delta_exec * weight)/lw->weight
I guess it is done by two steps :
calculation the (delta_exec * 1024) :
if (likely(weight > (1UL << SCHED_LOAD_RESOLUTION)))
tmp = (u64)delta_exec * scale_load_down(weight);
else
tmp = (u64)delta_exec;
calculate the /lw->weight ( or * lw->inv_weight ) :
if (!lw->inv_weight) {
unsigned long w = scale_load_down(lw->weight);
if (BITS_PER_LONG > 32 && unlikely(w >= WMULT_CONST))
lw->inv_weight = 1;
else if (unlikely(!w))
lw->inv_weight = WMULT_CONST;
else
lw->inv_weight = WMULT_CONST / w;
}
/*
* Check whether we'd overflow the 64-bit multiplication:
*/
if (unlikely(tmp > WMULT_CONST))
tmp = SRR(SRR(tmp, WMULT_SHIFT/2) * lw->inv_weight,
WMULT_SHIFT/2);
else
tmp = SRR(tmp * lw->inv_weight, WMULT_SHIFT);
return (unsigned long)min(tmp, (u64)(unsigned long)LONG_MAX);
The SRR (Shift right and round) macro is defined via :
#define SRR(x, y) (((x) + (1UL << ((y) - 1))) >> (y))
And the other MACROS are defined :
#if BITS_PER_LONG == 32
# define WMULT_CONST (~0UL)
#else
# define WMULT_CONST (1UL << 32)
#endif
#define WMULT_SHIFT 32
Can someone please explain how exactly the SRR works and how does this check the 64-bit multiplication overflow?
And please explain the definition of the MACROS in this function((~0UL) ,(1UL << 32))?
The code you posted is basically doing calculations using 32.32 fixed-point arithmetic, where a single 64-bit quantity holds the integer part of the number in the high 32 bits, and the decimal part of the number in the low 32 bits (so, for example, 1.5 is 0x0000000180000000 in this system). WMULT_CONST is thus an approximation of 1.0 (using a value that can fit in a long for platform efficiency considerations), and so dividing WMULT_CONST by w computes 1/w as a 32.32 value.
Note that multiplying two 32.32 values together as integers produces a result that is 232 times too large; thus, WMULT_SHIFT (=32) is the right shift value needed to normalize the result of multiplying two 32.32 values together back down to 32.32.
The necessity of using this improved precision for scheduling purposes is explained in a comment in sched/sched.h:
/*
* Increase resolution of nice-level calculations for 64-bit architectures.
* The extra resolution improves shares distribution and load balancing of
* low-weight task groups (eg. nice +19 on an autogroup), deeper taskgroup
* hierarchies, especially on larger systems. This is not a user-visible change
* and does not change the user-interface for setting shares/weights.
*
* We increase resolution only if we have enough bits to allow this increased
* resolution (i.e. BITS_PER_LONG > 32). The costs for increasing resolution
* when BITS_PER_LONG <= 32 are pretty high and the returns do not justify the
* increased costs.
*/
As for SRR, mathematically, it computes the rounded result of x / 2y.
To round the result of a division x/q you can calculate x + q/2 floor-divided by q; this is what SRR does by calculating x + 2y-1 floor-divided by 2y.