Halide: How to avoid unwanted execution overhead in Halide LUT index - halide

The calculation of input value to LUT index is constant over multiple calls,
therefore I calculate the contents of 'indexToLut' upfront.
However, this also means that the checks on the values in that buffer cannot be done here.
The LUT itself has only 17 elements.
#define LUT_SIZE 17 /* Size in each dimension of the 4D LUT */
class ApplyLut : public Halide::Generator<ApplyLut> {
public:
// We declare the Inputs to the Halide pipeline as public
// member variables. They'll appear in the signature of our generated
// function in the same order as we declare them.
Input < Buffer<uint8_t>> Lut { "Lut" , 1}; // LUT to apply
Input < Buffer<int>> indexToLut { "indexToLut" , 1}; // Precalculated mapping of uint8_t to LUT index
Input < Buffer<uint8_t >> inputImageLine { "inputImageLine" , 1}; // Input line
Output< Buffer<uint8_t >> outputImageLine { "outputImageLine", 1}; // Output line
void generate();
};
HALIDE_REGISTER_GENERATOR(ApplyLut, outputImageLine)
void ApplyLut::generate()
{
Var x("x");
outputImageLine(x) = Lut(indexToLut(inputImageLine(x)));
inputImageLine .dim(0).set_min(0); // Input image sample index
outputImageLine.dim(0).set_bounds(0, inputImageLine.dim(0).extent()); // Output line matches input line
Lut .dim(0).set_bounds(0, LUT_SIZE); //iccLut[...]: , limited number of values
indexToLut .dim(0).set_bounds(0, 256); //chan4_offset[...]: value index: 256 values
}
In question Are there any restrictions with LUT: unbounded way in dimension, it is already stated that such an issue can be solved by using 'clamp' functionality.
This will change the expression to
outputImageLine(x) = Lut(clamp(indexToLut(inputImageLine(x)), 0, LUT_SIZE));
However, the generated code shows the following expression
outputImageLine[outputImageLine.s0.x] = Lut[max(min(indexToLut[int32(inputImageLine[outputImageLine.s0.x])], 17), 0)]
I think that this means that the execution will do a min/max evaluation which can be omitted in my case, because I know that all values of indexToLut are limited to 0..16.
Is there a way to avoid the execution overhead in such a case?

You can use unsafe_promise_clamped instead of clamp to promise that the input is bounded in the way you describe. It might not be any faster though - min and max on integer indices is very cheap compared to the indirect load.

Related

How to implement "i++ and i>=max ? 0: i" that only use atomic in Go

only use atomic implement the follow code:
const Max = 8
var index int
func add() int {
index++
if index >= Max {
index = 0
}
return index
}
such as:
func add() int {
atomic.AddUint32(&index, 1)
// error: race condition
atomic.CompareAndSwapUint32(&index, Max, 0)
return index
}
but it is wrong. there is a race condition.
can be implemented that don't use lock ?
Solving it without loops and locks
A simple implementation may look like this:
const Max = 8
var index int64
func Inc() int64 {
value := atomic.AddInt64(&index, 1)
if value < Max {
return value // We're done
}
// Must normalize, optionally reset:
value %= Max
if value == 0 {
atomic.AddInt64(&index, -Max) // Reset
}
return value
}
How does it work?
It simply adds 1 to the counter; atomic.AddInt64() returns the new value. If it's less than Max, "we're done", we can return it.
If it's greater than or equal to Max, then we have to normalize the value (make sure it's in the range [0..Max)) and reset the counter.
Reset may only be done by a single caller (a single goroutine), which will be selected by the counter's value. The winner will be the one that caused the counter to reach Max.
And the trick to avoid the need of locks is to reset it by adding -Max, not by setting it to 0. Since the counter's value is normalized, it won't cause any problems if other goroutines are calling it and incrementing it concurrently.
Of course with many goroutines calling this Inc() concurrently it may be that the counter will be incremented more that Max times before a goroutine that ought to reset it can actually carry out the reset, which would cause the counter to reach or exceed 2 * Max or even 3 * Max (in general: n * Max). So we handle this by using a value % Max == 0 condition to decide if a reset should happen, which again will only happen at a single goroutine for each possible values of n.
Simplification
Note that the normalization does not change values already in the range [0..Max), so you may opt to always perform the normalization. If you want to, you may simplify it to this:
func Inc() int64 {
value := atomic.AddInt64(&index, 1) % Max
if value == 0 {
atomic.AddInt64(&index, -Max) // Reset
}
return value
}
Reading the counter without incrementing it
The index variable should not be accessed directly. If there's a need to read the counter's current value without incrementing it, the following function may be used:
func Get() int64 {
return atomic.LoadInt64(&index) % Max
}
Extreme scenario
Let's analyze an "extreme" scenario. In this, Inc() is called 7 times, returning the numbers 1..7. Now the next call to Inc() after the increment will see that the counter is at 8 = Max. It will then normalize the value to 0 and wants to reset the counter. Now let's say before the reset (which is to add -8) is actually executed, 8 other calls happen. They will increment the counter 8 times, and the last one will again see that the counter's value is 16 = 2 * Max. All the calls will normalize the values into the range 0..7, and the last call will again go on to perform a reset. Let's say this reset is again delayed (e.g. for scheduling reasons), and yet another 8 calls come in. For the last, the counter's value will be 24 = 3 * Max, the last call again will go on to perform a reset.
Note that all calls will only return values in the range [0..Max). Once all reset operations are executed, the counter's value will be 0, properly, because it had a value of 24 and there were 3 "pending" reset operations. In practice there's only a slight chance for this to happen, but this solution handles it nicely and efficiently.
I assume your goal is to never let index has value equal or greater than Max. This can be solved using CAS (Compare-And-Swap) loop:
const Max = 8
var index int32
func add() int32 {
var next int32;
for {
prev := atomic.LoadInt32(&index)
next = prev + 1;
if next >= Max {
next = 0
}
if (atomic.CompareAndSwapInt32(&index, prev, next)) {
break;
}
}
return next
}
CAS can be used to implement almost any operation atomically like this. The algorithm is:
Load the value
Perform the desired operation
Use CAS, goto 1 on failure.

Calculating sensing range from sensing sensitivity of the device in Castalia?

I am implementing a WSN algorithm in Castalia. I need to calculate sensing range of the sensing device. I know I will need to use the sensing sensitivity parameter but what will be the exact equation?
The answer will vary depending on the behaviour specified by the PhysicalProcess module used. Since you say in your comment that you may be using the CarsPhysicalProcess let's use that as an example.
A sensor reading request initiated by the application is first sent to the SensorManager via a SensorReadingMessage message. In SensorManager.cc you can see how this is processed in its handleMessage function:
...
case SENSOR_READING_MESSAGE: {
SensorReadingMessage *rcvPacket =check_and_cast<SensorReadingMessage*>(msg);
int sensorIndex = rcvPacket->getSensorIndex();
simtime_t currentTime = simTime();
simtime_t interval = currentTime - sensorlastSampleTime[sensorIndex];
int getNewSample = (interval < minSamplingIntervals[sensorIndex]) ? 0 : 1;
if (getNewSample) { //the last request for sample was more than minSamplingIntervals[sensorIndex] time ago
PhysicalProcessMessage *requestMsg =
new PhysicalProcessMessage("sample request", PHYSICAL_PROCESS_SAMPLING);
requestMsg->setSrcID(self); //insert information about the ID of the node
requestMsg->setSensorIndex(sensorIndex); //insert information about the index of the sensor
requestMsg->setXCoor(nodeMobilityModule->getLocation().x);
requestMsg->setYCoor(nodeMobilityModule->getLocation().y);
// send the request to the physical process (using the appropriate
// gate index for the respective sensor device )
send(requestMsg, "toNodeContainerModule", corrPhyProcess[sensorIndex]);
// update the most recent sample times in sensorlastSampleTime[]
sensorlastSampleTime[sensorIndex] = currentTime;
} else { // send back the old sample value
rcvPacket->setSensorType(sensorTypes[sensorIndex].c_str());
rcvPacket->setSensedValue(sensorLastValue[sensorIndex]);
send(rcvPacket, "toApplicationModule");
return;
}
break;
}
....
As you can see, what it's doing is first working out how much time has elapsed since the last sensor reading request for this sensor. If it's less time than specified by the minSamplingInterval possible for this sensor (this is determined by the maxSampleRates NED parameter of the SensorManager), it just returns the last sensor reading given. If it's greater, a new sensor reading is made.
A new sensor reading is made by sending a PhysicalProcessMessage message to the PhysicalProcess module (via the toNodeContainerModule gate). In the message we pass the X and Y coordinates of the node.
Now, if we have specified CarsPhysicalProcess as the physical process to be used in our omnetpp.ini file, the CarsPhysicalProcess module will receive this message. You can see this in CarsPhysicalProcess.cc:
....
case PHYSICAL_PROCESS_SAMPLING: {
PhysicalProcessMessage *phyMsg = check_and_cast < PhysicalProcessMessage * >(msg);
// get the sensed value based on node location
phyMsg->setValue(calculateScenarioReturnValue(
phyMsg->getXCoor(), phyMsg->getYCoor(), phyMsg->getSendingTime()));
// Send reply back to the node who made the request
send(phyMsg, "toNode", phyMsg->getSrcID());
return;
}
...
You can see that we calculate a sensor value based on the X and Y coordinates of the node, and the time at which the sensor reading was made. The response is sent back to the SensorManager via the toNode gate. So we need to look at the calculateScenarioReturnValue function to understand what's going on:
double CarsPhysicalProcess::calculateScenarioReturnValue(const double &x_coo,
const double &y_coo, const simtime_t &stime)
{
double retVal = 0.0f;
int i;
double linear_coeff, distance, x, y;
for (i = 0; i < max_num_cars; i++) {
if (sources_snapshots[i][1].time >= stime) {
linear_coeff = (stime - sources_snapshots[i][0].time) /
(sources_snapshots[i][1].time - sources_snapshots[i][0].time);
x = sources_snapshots[i][0].x + linear_coeff *
(sources_snapshots[i][1].x - sources_snapshots[i][0].x);
y = sources_snapshots[i][0].y + linear_coeff *
(sources_snapshots[i][1].y - sources_snapshots[i][0].y);
distance = sqrt((x_coo - x) * (x_coo - x) +
(y_coo - y) * (y_coo - y));
retVal += pow(K_PARAM * distance + 1, -A_PARAM) * car_value;
}
}
return retVal;
}
We start with a sensor return value of 0. Then we loop over every car that is on the road (if you look at the TIMER_SERVICE case statement in the handleMessage function, you will see that CarsPhysicalProcess puts cars on the road randomly according to the car_interarrival rate, up to a maximum of max_num_cars number of cars). For every car, we calculate how far the car has travelled down the road, and then calculate the distance between the car and the node. Then for each car we add to the return value based on the formula:
pow(K_PARAM * distance + 1, -A_PARAM) * car_value
Where distance is the distance we have calculated between the car and the node, K_PARAM = 0.1, A_PARAM = 1 (defined at the top of CarsPhysicalProcess.cc) and car_value is a number specified in the CarsPhysicalProcess.ned parameter file (default is 30).
This value is passed back to the SensorManager. The SensorManager then may change this value depending on the sensitivity, resolution, noise and bias of the sensor (defined as SensorManager parameters):
....
case PHYSICAL_PROCESS_SAMPLING:
{
PhysicalProcessMessage *phyReply = check_and_cast<PhysicalProcessMessage*>(msg);
int sensorIndex = phyReply->getSensorIndex();
double theValue = phyReply->getValue();
// add the sensor's Bias and the random noise
theValue += sensorBias[sensorIndex];
theValue += normal(0, sensorNoiseSigma[sensorIndex], 1);
// process the limitations of the sensing device (sensitivity, resoultion and saturation)
if (theValue < sensorSensitivity[sensorIndex])
theValue = sensorSensitivity[sensorIndex];
if (theValue > sensorSaturation[sensorIndex])
theValue = sensorSaturation[sensorIndex];
theValue = sensorResolution[sensorIndex] * lrint(theValue / sensorResolution[sensorIndex]);
....
So you can see that if the value is below the sensitivity of the sensor, the floor of the sensitivity is returned.
So basically you can see that there is no specific 'sensing range' in Castalia - it all depends on how the specific PhysicalProcess handles the message. In the case of CarsPhysicalProcess, as long as there is a car on the road, it will always return a value, regardless of the distance - it just might be very small if the car is a long distance away from the node. If the value is very small, you may receive the lowest sensor sensitivity instead. You could increase or decrease the car_value parameter to get a stronger response from the sensor (so this is kind of like a sensor range)
EDIT---
The default sensitivity (which you can find in SensorManager.ned) is 0. Therefore for CarsPhysicalProcess, any car on the road at any distance should be detected and returned as a value greater than 0. In other words, there is an unlimited range. If the car is very, very far away it may return a number so small it becomes truncated to zero (this depends on the limits in precision of a double value in the implementation of c++)
If you wanted to implement a sensing range, you would have to set a value for devicesSensitivity in SensorManager.ned. Then in your application, you would test to see if the returned value is greater than the sensitivity value - if it is, the car is 'in range', if it is (almost) equal to the sensitivity it is out of range. I say almost because (as we have seen earlier) the SensorManager adds noise to the value returned, so for example if you have a sensitivity value of 5, and no cars, you will get values which will hover slightly around 5 (e.g. 5.0001, 4.99)
With a sensitivity value set, to calculate the sensing range (assuming only 1 car on the road), this means simply solving the equation above for distance, using the minimum sensitivity value as the returned value. i.e. if we use a sensitivity value of 5:
5 = pow(K_PARAM * distance + 1, -A_PARAM) * car_value
Substitute values for the parameters, and use algebra to solve for distance.

Costs of new AVX512 instruction - Scatter store

I'm playing around with the new AVX512 instruction sets and I try to understand how they work and how one can use them.
What I try is to interleave specific data, selected by a mask.
My little benchmark loads x*32 byte of aligned data from memory into two vector registers and compresses them using a dynamic mask (fig. 1). The resulting vector registers are scattered into the memory, so that the two vector registers are interleaved (fig. 2).
Figure 1: Compressing the two data vector registers using the same dynamically created mask.
Figure 2: Scatter store to interleave the compressed data.
My code looks like the following:
void zipThem( uint32_t const * const data, __mmask16 const maskCompress, __m512i const vindex, uint32_t * const result ) {
/* Initialize a vector register containing zeroes to get the store mask */
__m512i zeroVec = _mm512_setzero_epi32();
/* Load data */
__m512i dataVec_1 = _mm512_conflict_epi32( data );
__m512i dataVec_2 = _mm512_conflict_epi32( data + 16 );
/* Compress the data */
__m512i compVec_1 = _mm512_maskz_compress_epi32( maskCompress, dataVec_1 );
__m512i compVec_2 = _mm512_maskz_compress_epi32( maskCompress, dataVec_2 );
/* Get the store mask by compare the compressed register with the zero-register (4 means !=) */
__mmask16 maskStore = _mm512_cmp_epi32_mask( zeroVec, compVec_1, 4 );
/* Interleave the selected data */
_mm512_mask_i32scatter_epi32(
result,
maskStore,
vindex,
compVec_1,
1
);
_mm512_mask_i32scatter_epi32(
result + 1,
maskStore,
vindex,
compVec_2,
1
);
}
I compiled everything with
-O3 -march=knl -lmemkind -mavx512f -mavx512pf
I call the method for 100'000'000 elements. To actually get an overview of the behaviour of the scatter store I repeated this measurement with different values for maskCompress.
I expected some kind of dependence between the time needed for execution and the number of set bits within the maskCompress. But I observed, that the tests needed roughly the same time for execution. Here is the result of the performance test:
Figure 3: Results of the measurements. The x-axis represents the number of written elements, depending on maskCompressed. The y-axis shows the performance.
As one can see, the performance is getting higher when more data is actual written to memory.
I did a little bit of research and came up to this: Instruction latency of avx512. Following the given link, the latency of the used instructions are constant. But to be honest, I am a little bit confused about this behaviour.
Regarding to the answers from Christoph and Peter, I changed my approach a little bit. Thus I have no idea how I can use unpackhi / unpacklo to interleave sparse vector registers, I just combined the AVX512 compress intrinsic with a shuffle (vpermi):
int zip_store_vpermit_cnt(
uint32_t const * const data,
int const compressMask,
uint32_t * const result,
std::ofstream & log
) {
__m512i data1 = _mm512_undefined_epi32();
__m512i data2 = _mm512_undefined_epi32();
__m512i comp_vec1 = _mm512_undefined_epi32();
__m512i comp_vec2 = _mm512_undefined_epi32();
__mmask16 comp_mask = compressMask;
__mmask16 shuffle_mask;
uint32_t store_mask = 0;
__m512i shuffle_idx_lo = _mm512_set_epi32(
23, 7, 22, 6,
21, 5, 20, 4,
19, 3, 18, 2,
17, 1, 16, 0 );
__m512i shuffle_idx_hi = _mm512_set_epi32(
31, 15, 30, 14,
29, 13, 28, 12,
27, 11, 26, 10,
25, 9, 24, 8 );
std::size_t pos = 0;
int pcount = 0;
int fullVec = 0;
for( std::size_t i = 0; i < ELEM_COUNT; i += 32 ) {
/* Loading the current data */
data1 = _mm512_maskz_compress_epi32( comp_mask, _mm512_load_epi32( &(data[i]) ) );
data2 = _mm512_maskz_compress_epi32( comp_mask, _mm512_load_epi32( &(data[i+16]) ) );
shuffle_mask = _mm512_cmp_epi32_mask( zero, data2, 4 );
/* Interleaving the two vector register, depending on the compressMask */
pcount = 2*( __builtin_popcount( comp_mask ) );
store_mask = std::pow( 2, (pcount) ) - 1;
fullVec = pcount / 17;
comp_vec1 = _mm512_permutex2var_epi32( data1, shuffle_idx_lo, data2 );
_mm512_mask_storeu_epi32( &(result[pos]), store_mask, comp_vec1 );
pos += (fullVec) * 16 + ( ( 1 - ( fullVec ) ) * pcount ); // same as pos += ( pCount >= 16 ) ? 16 : pCount;
_mm512_mask_storeu_epi32( &(result[pos]), (store_mask >> 16) , comp_vec2 );
pos += ( fullVec ) * ( pcount - 16 ); // same as pos += ( pCount >= 16 ) ? pCount - 16 : 0;
//a simple _mm512_store_epi32 produces a segfault, because the memory isn't aligned anymore :(
}
return pos;
}
That way the sparse data within the two vector registers can be interleaved. Unfortunately I have to manually calculate the mask for the store. This seems to be quite expensive. One could use a LUT to avoid the calculation, but I think that is not the way it should be.
Figure 4: Results of the performance test of 4 different kinds of store.
I know that this is not the usual way, but I have 3 questions, related to this topic and I am hopefull that one can help me out.
Why should a masked store with only one set bit needs the same time as a masked store where all bits are set?
Does anyone has some experience or is there a good documentation to understand the behaviour of the AVX512 scatter store?
Is there a more easy or more performant way to interleave two vector registers?
Thanks for your help!
Sincerely

Distinct number of changes in real time data

Hi I am taking in data in real time where the value goes from 1009 , 1008 o 1007 to 0. I am trying to count the number of distinct times this occurs, for example the snippet below should count 2 distinct periods of change.
1008
1009
1008
0
0
0
1008
1007
1008
1008
1009
9
0
0
1009
1008
I have written a for loop as below but I can't figure out if the logic is correct as I get multiple increments instead of just the one
if(current != previous && current < 100)
x++;
else
x = x;
You tagged this with the LabVIEW tag. Is this actually supposed to be LabVIEW code?
Your logic has a bug related to the noise you say you have - if the value is less than 100 and it changes (for instance from 9 to 0), you log that as a change. You also have a line which doesn't do anything (x=x), although if this is supposed to be LV code, then this could make sense.
The code you posted here does not seem to make sense to me if I understand your goal. My understanding is that you want to identify this specific pattern:
1009
1008
1007
0
And that any deviation from this sequence of numbers would constitute data that should be ignored. To this end, you should be monitoring the history of the past 3 numbers. In C you might write this logic in the following way:
#include <stdio.h>
//Function to get the next value from our data stream.
int getNext(int *value) {
//Variable to hold our return code.
int code;
//Replace following line to get gext number from the stream. Possible read from a file?
*value = 0;
//Replace following logic to set 'code' appropriately.
if(*value == -1)
code = -1;
else
code = 0;
//Return 'code' to the caller.
return code;
}
//Example application for counting the occurrences of the sequence '1009','1008','1007','0'.
int main(int argc, char **argv) {
//Declare 4 items to store the past 4 items in the sequence (x0-x3)
//Declare a count and a flag to monitor the occurrence of our pattern
int x0 = 0, x1 = 0, x2 = 0, x3 = 0, count = 0, occurred = 0;
//Run forever (just as an example, you would provide your own looping structure or embed the algorithm in your app).
while(1) {
//Get the next element (implement getNext to provide numbers from your data source).
//If the function returns non-zero, exit the loop and print the count.
if( getNext(&x0) != 0 )
break;
//If the newest element is 0, we can trigger a check of the prior 3.
if(x0 == 0) {
//Set occurred to 0 if the prior elements don't match our pattern.
occurred = (x1 == 1007) && (x2 == 1008) && (x3 == 1009);
if(occurred) {
//Occurred was 1, meaning the pattern was matched. Increment our count.
count++;
//Reset occurred
occurred = 0;
}
//If the newest element is not 0, dont bother checking. Just shift the elements down our list.
} else {
x3 = x2; //Shift 3rd element to 4th position
x2 = x1; //Shift 2nd element to 3rd position
x1 = x0; //Shift 1st element to 2nd position
}
}
printf("The pattern count is %d\n", count);
//Exit application
return 0;
}
Note that the getNext function is just shown here as an example but obviously what I have implemented will not work. This function should be implemented based on how you are extracting data from the stream.
Writing the application in this way might not make sense within your larger application but the algorithm is what you should take away from this. Essentially you want to buffer 4 elements in a rolling window. You push the newest element into x0 and shift the others down. After this process you check the four elements to see if they match your desired pattern and increment the count accordingly.
If the requirement is to count falling edges and you don't care about the specific level, and want to reject noise band or ripple in the steady state then just make the conditional something like
if ((previous - current) > threshold)
No complex shifting, history, or filtering required. Depending on the application you can follow up with a debounce (persistency check) to ignore spurious samples (just keep track of falling/rising, or fell/rose as simple toggling state spanning a desired number of samples).
Code to the pattern, not the specific values; use constant or adjustable parameters to control the value sensitivity.

floating point operations anomaly

I am reading temperature from temp sensor tmp36 using atmega2560. After reading temperature sensor digital values and converting them into readable form in two atmega2560 microcontrollers, I get different answers. Why do I get this type of answers. ?
Piece of code is present below:
float temp; // global variable
{
unsigned long temp_volt;
unsigned char temp_h, temp_l;
unsigned int temp_buf;
temp_l=ADCL;
temp_h=ADCH;
temp_buf=((int)temp_h<<8)|temp_l;
temp_volt =(((unsigned long)temp_buf*256*10)/1023) - 993; // subtract offset gain
temp = ((float)temp_volt*1000/1014*100/196)/10; // adjust the gain
printf("temp_buf: %d, temp_volt: %d, temp: %0.2f\r\n", temp_buf, temp_volt, temp);
}
On one ATMEGA2560 answers I got is:
temp_buf: 55, temp_volt: 447, temp: 22.4
On another ATMEGA2560 what i got is:
temp_buf: 53, temp_volt: -861, temp: 0.00
Because of this I made this adjustments
temp_volt =(((unsigned long)temp_buf*256*100)/1023) - 904;
Why is two microcontrollers behaving differently when I am usiong same code?
Have double type for temp_volt and temp_buf so that you don;t lose data because of integer arithmetic, for example, 7/4 = 1 and 7.0/4.0 = 1.75
So,
double temp_volt;
double temp_buf;
and your computations as:
temp_volt =temp_buf*256.0*10.0)/1023.0) - 993.0; // subtract offset gain
temp = ((float)temp_volt*1000.0/1014.0*100.0/196.0)/10.0; // adjust the gain
If you need your result as int, then do that in the final step, e.g.
temp_volt =(double)(int)(temp_buf*256.0*10.0)/1023.0) - 993.0);

Resources