I'll make examples in Python, since I use Python, but the question is not about Python.
Lets say I want to increment a variable by specific value so that it stays in given boundaries.
So for increment and decrement I have these two functions:
def up (a, s, Bmax):
r = a + s
if r > Bmax : return Bmax
else : return r
def down (a, s, Bmin):
r = a - s
if r < Bmin : return Bmin
else : return r
Note: it is supposed that initial value of the variable "a" is already in boundaries (min <= a <= max) so additional initial checking does not belong to this function. What makes me curious, almost every program I made needs these functions.
The question is:
are those classified as some typical operations and have they specific names?
if yes, is there some correspondence to intrinsic processor functionality so it is optimised in some compilers?
Reason why I ask is pure curiousity, of course I cannot optimise it in Python and I know little about CPU architecture.
To be more specific, on a lower level for an unsigned 8-bit integer the increment would look I suppose like this:
def up (a, s, Bmax):
counter = 0
while True:
if counter == s : break
if a == Bmax : break
if a == 255 : break
a += 1
counter += 1
I know the latter would not make any sense in Python so treat it as my naive attempt to imagine low level code which adds the value in place. There are some nuances, e.g. signed, unsigned, but I was interested merely about unsigned integers since I came across it more often.
It is called saturation arithmetic, it has native support on DSPs and GPUs (not a random pair: both deals with signals).
For example the NVIDIA PTX ISA let the programmer chose if an addition is saturated or not
add.type d, a, b;
add{.sat}.s32 d, a, b; // .sat applies only to .s32
.sat
limits result to MININT..MAXINT (no overflow) for the size of the operation.
The TI TMS320C64x/C64x+ DSP has support for
Dual 16-bit saturated arithmetic operations
and instruction like sadd to perform a saturated add and even a whole register (Saturation Status Register) dedicated to collecting precise information about saturation while executing a sequence of instructions.
Even the mainstream x86 has support for saturation with instructions like vpaddsb and similar (including conversions).
Another example is the GLSL clamp function, used to make sure color values are not outside the range [0, 1].
In general if the architecture must be optimized for signal/media processing it has support for saturation arithmetic.
Much more rare is the support for saturation with arbitrary bounds, e.g. asymmetrical bounds, non power of two bounds, non word sized bounds.
However, saturation can be implemented easily as min(max(v, b), B) where v is the result of the unsaturated (and not overflowed) operation, b the lower bound and B the upper bound.
So any architecture that support finding the minimum and the maximum without a branch, can implement any form of saturation efficiently.
See also this question for a more real example of how saturated addition is implemented.
As a side note the default behavior is wrap around: for 8-bit quantities the sum 255 + 1 equals 0 (i.e. operations are modulo 28).
Related
Let's say I have N sequences x[i], each with length seqLength[i] for 0 <= i < N. As far as I understand from the cuDNN docs, they have to be ordered by sequence length, the longest first, so assume that seqLength[i] >= seqLength[i+1]. Let's say that they have the feature dimension D, so x[i] is a 2D tensor of shape (seqLength[i], D). As far as I understand, I should prepare a tensor x where all x[i] are contiguously behind each other, i.e. it would be of shape (sum(seqLength), D).
According to the cuDNN docs, the functions cudnnRNNForwardInference / cudnnRNNForwardTraining gets the argument int seqLength and cudnnTensorDescriptor_t* xDesc, where:
seqLength: Number of iterations to unroll over.
xDesc: Array of tensor descriptors. Each must have the same second dimension. The first dimension may decrease from element n to element n + 1 but may not increase.
I'm not exactly sure I understand this correctly.
Is seqLength my max(seqLength)?
And xDesc is an array. Of what length? max(seqLength)? If so, I assume that it describes one batch of features for each frame but some of the later frames will have less sequences in it. It sounds like the number of sequences per frame is described in the first dimension.
So:
xDesc[t].shape[0] = len([i for i in range(N) if t < seqLength[i]])
for all 0 <= t < max(seqLength). I.e. 0 <= xDesc[t].shape[0] <= N.
How much dimensions does each xDesc[t] describe, i.e. what is len(xDesc[t].shape)? I would assume that it is 2 and the second dimension is the feature dimension, i.e. D, i.e.:
xDesc[t].shape = (len(...), D)
The strides would have to be set accordingly, although it's also not totally clear. If x is stored in row-major order, then
xDesc[0].strides[0] = D * xDesc[0].shape[0]
xDesc[0].strides[1] = 1
But how does cuDNN compute the offset for frame t? I guess it will keep track and thus calculate sum([xDesc[t2].strides[0] for t2 in range(t)]).
Most example code I have seen assume that all sequences are of the same length. Also they all describe 3 dimensions per xDesc[t], not 2. Why is that? The third dimension is always 1, as well as the stride of the second and third dimension, and the stride for the first dimension is N. So this assumes that the tensor x is row-major ordered and of shape (max(seqLength), N, D). The code is actually a bit strange. E.g. from TensorFlow:
int dims[] = {batch_size, data_size, 1};
int strides[] = {dims[1] * dims[2], dims[2], 1};
cudnnSetTensorNdDescriptor(
...,
sizeof(dims) / sizeof(dims[0]) /*nbDims*/, dims /*dimA*/,
strides /*strideA*/);
The code looks really similar in all examples I have found. Search for cudnnSetTensorNdDescriptor or cudnnRNNForwardTraining. E.g.:
TensorFlow (issue 6633)
Theano
mxnet
Torch
Baidu persistent-rnn
Caffe2
Chainer
I found one example which can handle sequences of different length. Again search for cudnnSetTensorNdDescriptor:
Microsoft CNTK
That claims that there must be 3 dimensions for every xDesc[t]. It has the comment:
these dimensions are what CUDNN expects: (the minibatch dimension, the data dimension, and the number 1 (because each descriptor describes one frame of data)
Edit: Support for this was added now end of 2018 for PyTorch, in this commit.
Am I missing something from the cuDNN documentation? I really have not found that information in it.
My question is basically, is my conclusion about how to set the arguments x, seqLength and xDesc for cudnnRNNForwardInference / cudnnRNNForwardTraining correct, and also my implicit assumptions, or if not, how would I use it, how does the memory layout look like, etc.?
is there any way to print the value of n^1000 without using BigInt? have been thinking on the lines of using some sort of shift logic but haven't been able to come up with something good yet.
You can certainly do this, and I recommend it as an exercise. Beyond that there's little reason to implement this in a language with an existing BigInteger implementation.
In case you're looking for an exercise, it's really helpful to do it in a language that supports BigIntegers out of the box. That way you can gradually replace BigInteger operations with your own until there's nothing left to replace.
BigInteger libraries typically represent values larger than the largest primitive by using an array of the same primitive type, such as byte or int. Here's some Python I wrote that models unsigned bytes (UByte) and lists of unsigned bytes (BigUInt). Any BigUInt with multiple UBytes treats index 0 as the most-significant byte, making it a big-endian representation. Doing the opposite is fine too.
class UByte:
def __init__(self, n=0):
n = int(n)
if (n < 0) or (n >= 255):
raise ValueError("Expecting integer in range [0,255).")
self.__n = n
def value(self):
return self.__n
class BigUInt:
def __init__(self, b=[]):
self.__b = b
def value(self):
# treating index 0 as most-significant byte (big endian)
byte_count = len(self.__b)
if byte_count == 0:
return 0
result = 0
for i in range(byte_count):
place_value = 8 * (byte_count - i - 1)
byte_value = self.__b[i].value() << place_value
result += byte_value
return result
def __str__(self):
# base 10 representation
return "%s" % self.value()
The code above doesn't quite do what you want. Several parts of BigUInt#value depend on Python's built-in BigIntegers, for instance the left-shifting to compute byte_value doesn't overflow, even when place_value is really large. In lower-level machine code, each value has a fixed number of bits and left shifting without care can result in lost information. Similarly, the += operation to update the result would eventually overflow for the same reason in lower-level code, but Python handles that for you.
Notice that __str__ is implemented by calling value(). One way to bypass Python's magic is by reimplementing __str__ so it doesn't call value(). Figure out how to translate a binary number into a string of base-10 digits. Once that's done, you can implement value() in terms of __str__ simply by calling return int(self.__str__())
Here are some sample tests for the code above. They may help as a sanity check while you rework the code.
ten_as_byte = UByte(10)
ten_as_big_uint = BigUInt([UByte(10)])
print "ten as byte ?= ten as ubyte: %s" % (ten_as_byte.value() == ten_as_big_uint.value())
three_hundred = 300
three_hundred_as_big_uint = BigUInt([UByte(0x01), UByte(0x2c)])
print "three hundred ?= three hundred as big uint: %s" % (three_hundred == three_hundred_as_big_uint.value())
two_to_1000th_power = 2**1000
two_to_1000th_power_as_big_uint = BigUInt([UByte(0x01)] + [UByte() for x in range(125)])
print "2^1000 ?= 2^1000 as big uint: %s" % (two_to_1000th_power == two_to_1000th_power_as_big_uint.value())
EDIT: For a better low-level description of what's required, refer to chapter 2 of the From NAND to Tetris curriculum. The project in that chapter is to implement a 16-bit ALU (Arithmetic Logic Unit). If you then extend the ALU to output an overflow bit, an arbitrary number of these ALUs can be chained together to handle fundamental computations over arbitrarily large input numbers.
Raising a small number - one that fits into a normal integer variable - to a high power is one of the easiest big integer operations to implement, and it is often used as a task to let people discover some big integer math implementation principles.
A good discussion - including lots of different examples in C - is in the topic Sum of digits in a^b over on Code Review. My contribution there shows how to do fast exponentiation via repeated squaring, using a std::vector<uint32_t> as a sort of 'fake' big integer. But there are even simpler solutions in that topic, just take your pick.
An easy way of testing C/C++ big integer code without having to go hunt for a big integer library is to compile the code as managed C++ in Visual C++ (Express), which gives you access to the .NET BigInteger class:
#using "System.Numerics.dll"
using System::Numerics::BigInteger;
BigInteger n = BigInteger::Parse("123456789");
System::Console::WriteLine(n.Pow(1000));
Days ago, my teacher told me it was possible to check if a given point is inside a given rectangle using only bit operators. Is it true? If so, how can I do that?
This might not answer your question but what you are looking for could be this.
These are the tricks compiled by Sean Eron Anderson and he even put a bounty of $10 for those who can find a single bug. The closest thing I found here is a macro that finds if any integer X has a word which is between M and N
Determine if a word has a byte between m and n
When m < n, this technique tests if a word x contains an unsigned byte value, such that m < value < n. It uses 7 arithmetic/logical operations when n and m are constant.
Note: Bytes that equal n can be reported by likelyhasbetween as false positives, so this should be checked by character if a certain result is needed.
Requirements: x>=0; 0<=m<=127; 0<=n<=128
#define likelyhasbetween(x,m,n) \
((((x)-~0UL/255*(n))&~(x)&((x)&~0UL/255*127)+~0UL/255*(127-(m)))&~0UL/255*128)
This technique would be suitable for a fast pretest. A variation that takes one more operation (8 total for constant m and n) but provides the exact answer is:
#define hasbetween(x,m,n) \
((~0UL/255*(127+(n))-((x)&~0UL/255*127)&~(x)&((x)&~0UL/255*127)+~0UL/255*(127-(m)))&~0UL/255*128)
It is possible if the number is a finite positive integer.
Suppose we have a rectangle represented by the (a1,b1) and (a2,b2). Given a point (x,y), we only need to evaluate the expression (a1<x) & (x<a2) & (b1<y) & (y<b2). So the problems now is to find the corresponding bit operation for the expression c
Let ci be the i-th bit of the number c (which can be obtained by masking ci and bit shift). We prove that for numbers with at most n bit, c<d is equivalent to r_(n-1), where
r_i = ((ci^di) & ((!ci)&di)) | (!(ci^di) & r_(i-1))
Prove: When the ci and di are different, the left expression might be true (depends on ((!ci)&di)), otherwise the right expression might be true (depends on r_(i-1) which is the comparison of next bit).
The expression ((!ci)&di) is actually equivalent to the bit comparison ci < di. Hence, this recursive relation return true that it compares the bit by bit from left to right until we can decide c is smaller than d.
Hence there is an purely bit operation expression corresponding to the comparison operator, and so it is possible to find a point inside a rectangle with pure bitwise operation.
Edit: There is actually no need for condition statement, just expands the r_(n+1), then done.
x,y is in the rectangle {x0<x<x1 and y0<y<y1} if {x0<x and x<x1 and y0<y and y<y1}
If we can simulate < with bit operators, then we're good to go.
What does it mean to say something is < in binary? Consider
a: 0 0 0 0 1 1 0 1
b: 0 0 0 0 1 0 1 1
In the above, a>b, because it contains the first 1 whose counterpart in b is 0. We are those seeking the leftmost bit such that myBit!=otherBit. (== or equiv is a bitwise operator which can be represented with and/or/not)
However we need some way through to propagate information in one bit to many bits. So we ask ourselves this: can we "code" a function using only "bit" operators, which is equivalent to if(q,k,a,b) = if q[k] then a else b. The answer is yes:
We create a bit-word consisting of replicating q[k] onto every bit. There are two ways I can think of to do this:
1) Left-shift by k, then right-shift by wordsize (efficient, but only works if you have shift operators which duplicate the last bit)
2) Inefficient but theoretically correct way:
We left-shift q by k bits
We take this result and and it with 10000...0
We right-shift this by 1 bit, and or it with the non-right-shifted version. This copies the bit in the first place to the second place. We repeat this process until the entire word is the same as the first bit (e.g. 64 times)
Calling this result mask, our function is (mask and a) or (!mask and b): the result will be a if the kth bit of q is true, other the result will be b
Taking the bit-vector c=a!=b and a==1111..1 and b==0000..0, we use our if function to successively test whether the first bit is 1, then the second bit is 1, etc:
a<b :=
if(c,0,
if(a,0, B_LESSTHAN_A, A_LESSTHAN_B),
if(c,1,
if(a,1, B_LESSTHAN_A, A_LESSTHAN_B),
if(c,2,
if(a,2, B_LESSTHAN_A, A_LESSTHAN_B),
if(c,3,
if(a,3, B_LESSTHAN_A, A_LESSTHAN_B),
if(...
if(c,64,
if(a,64, B_LESSTHAN_A, A_LESSTHAN_B),
A_EQUAL_B)
)
...)
)
)
)
)
This takes wordsize steps. It can however be written in 3 lines by using a recursively-defined function, or a fixed-point combinator if recursion is not allowed.
Then we just turn that into an even larger function: xMin<x and x<xMax and yMin<y and y<yMax
I want to calculate whether a point, (x,y), is inside a rectangle which is determined by two points, (a,b) and (c,d).
If a<=c and b<=d, then it is simple:
a<=x&&x<=c&&b<=y&&y<=d
However, since it is unknown whether a<=c or b<=d, the code should be
(a<=x&&x<=c||c<=x&&x<=a)&&(b<=y&&y<=d||d<=y&&y<=b)
This code may work, but it is too long. I can write a function and use it, but I wonder if there's shorter way (and should be executed very fast - the code is called a lot) to write it.
One I can imagine is:
((c-x)*(x-a)>=0)&&((d-y)*(y-b)>=0)
Is there more clever way to do this?
(And, is there any good way to iterate from a from c?)
Swap the variables as needed so that a = xmin and b = ymin:
if a > c: swap(a,c)
if b > d: swap(b,d)
a <= x <= c and b <= y <= d
Shorter but slightly less efficient:
min(a,c) <= x <= max(a,c) and min(b,d) <= y <= max(b,d)
As always when optimizing you should profile the different options and compare hard numbers. Pipelining, instruction reordering, branch prediction, and other modern day compiler/processor optimization techniques make it non-obvious whether programmer-level micro-optimizations are worthwhile. For instance it used to be significantly more expensive to do a multiply than a branch, but this is no longer always the case.
I like the this:
((c-x)*(x-a)>=0)&&((d-y)*(y-b)>=0)
but with more whitespace and more symmetry:
(c-x)*(a-x) <= 0 && (d-y)*(b-y) <= 0
It's mathematically elegant, and probably the fastest too. You will need to measure to really determine which is the fastest. With modern pipelined processors, I would expect that straight-line code with the minimum number of operators will run fastest.
While sorting the (a, b) and (c, d) pairs as suggested in the accepted answer is probably the best solution in this case, an even better application of this method would probably be to elevate the a < b and c < d requirement to the level of the program-wide invariant. I.e. require that all rectangles in your program are created and maintained in this "normalized" form from the very beginning. Thus, inside your point-in-rectangle test function you should simply assert that a < b and c < d instead of wasting CPU resources on actually sorting them in every call.
Define intermediary variables i = min(a,b), j = min(c,d), k = max(a,b), l = max(c,d)
Then you only need i<=x && x<=k && j<=y && y<=l.
EDIT: Mind you, efficiency-wise it's probably better to use your "too long" code in a function.
I need to do a linear interpolation over time between two values on an 8 bit PIC microcontroller (Specifically 16F627A but that shouldn't matter) using PIC assembly language. Although I'm looking for an algorithm here as much as actual code.
I need to take an 8 bit starting value, an 8 bit ending value and a position between the two (Currently represented as an 8 bit number 0-255 where 0 means the output should be the starting value and 255 means it should be the final value but that can change if there is a better way to represent this) and calculate the interpolated value.
Now PIC doesn't have a divide instruction so I could code up a general purpose divide routine and effectivly calculate (B-A)/(x/255)+A at each step but I feel there is probably a much better way to do this on a microcontroller than the way I'd do it on a PC in c++
Has anyone got any suggestions for implementing this efficiently on this hardware?
The value you are looking for is (A*(255-x)+B*x)/255. It requires only 8x8 multiplication, and a final division by 255, which can be approximated by simply taking the high byte of the sum.
Choosing x in range 0..128, no approximation is needed: take the high byte of (A*(128-x)+B*x)<<1.
Assuming you interpolate a sequence of values where the previous endpoint is the new start point:
(B-A)/(x/255)+A
sounds like a bad idea. If you use base 255 as a fixedpoint representation, you get the same interpolant twice. You get B when x=255 and B as the new A when x=0.
Use 256 as the fixedpoint system. Divides become shifts, but you need 16-bit arithmetic and 8x8 multiplication with a 16-bit result. The previous issue can be fixed by simply ignoring any bits in the higher-bytes as x mod 256 becomes 0. This suggestion uses 16-bit multiplication, but can't overflow. and you don't interpolate over the same x twice.
interp = (a*(256 - x) + b*x) >> 8
256 - x becomes just a subtract-with-borrow, as you get 0 - x.
The PIC lacks these operations in its instruction set:
Right and left shift. (both logical and arithmetic)
Any form of multiplication.
You can get right-shifting by using rotate-right instead, followed by masking out the extra bits on the left with bitwise-and. A straight-forward way to do 8x8 multiplication with 16-bit result:
void mul16(
unsigned char* hi, /* in: operand1, out: the most significant byte */
unsigned char* lo /* in: operand2, out: the least significant byte */
)
{
unsigned char a,b;
/* loop over the smallest value */
a = (*hi <= *lo) ? *hi : *lo;
b = (*hi <= *lo) ? *lo : *hi;
*hi = *lo = 0;
while(a){
*lo+=b;
if(*lo < b) /* unsigned overflow. Use the carry flag instead.*/
*hi++;
--a;
}
}
The techniques described by Eric Bainville and Mads Elvheim will work fine; each one uses two multiplies per interpolation.
Scott Dattalo and Tony Kubek have put together a super-optimized PIC-specific interpolation technique called "twist" that is slightly faster than two multiplies per interpolation.
Is using this difficult-to-understand technique worth running a little faster?
You could do it using 8.8 fixed-point arithmetic. Then a number from range 0..255 would be interpreted as 0.0 ... 0.996 and you would be able to multiply and normalize it.
Tell me if you need any more details or if it's enough for you to start.
You could characterize this instead as:
(B-A)*(256/(x+1))+A
using a value range of x=0..255, precompute the values of 256/(x+1) as a fixed-point number in a table, and then code a general purpose multiply, adjust for the position of the binary point. This might not be small spacewise; I'd expect you to need a 256 entry table of 16 bit values and the multiply code. (If you don't need speed, this would suggest your divison method is fine.). But it only takes one multiply and an add.
My guess is that you don't need every possible value of X. If there are only a few values of X, you can compute them offline, do a case-select on the specific value of X and then implement the multiply in terms of a fixed sequence of shifts and adds for the specific value of X. That's likely to be pretty efficient in code and very fast for a PIC.
Interpolation
Given two values X & Y , its basically:
(X+Y)/2
or
X/2 + Y/2 (to prevent the odd-case that A+B might overflow the size of the register)
Hence try the following:
(Pseudo-code)
Initially A=MAX, B=MIN
Loop {
Right-Shift A by 1-bit.
Right-Shift B by 1-bit.
C = ADD the two results.
Check MSB of 8-bit interpolation value
if MSB=0, then B=C
if MSB=1, then A=C
Left-Shift 8-bit interpolation value
}Repeat until 8-bit interpolation value becomes zero.
The actual code is just as easy. Only i do not remember the registers and instructions off-hand.