Unexpected results using ImageParam inside a loop with the same input/output buffer - halide

With the code below I noticed some unexpected differences for the same input. Some times the while loop execute different quantities of iterations, even using the exactly the same code and input data. I got the same problem scheduling for CPU and GPU.
int main() {
int dx, dy;
Buffer<uint8_t> output;
Buffer<int> count = Buffer<int>::make_scalar();
ImageParam param(UInt(8), 2);
update_func = build_hysteresis_update(param);
count_func = build_hysteresis_count(param);
output = ini_func.realize(dx, dy); // ini_func always returns the same output.
do {
param.set(output);
update_func.realize(output); // --> same input/output. I use this way because performance get really better.
param.set(output);
count_func.realize(count);
} while (count(0) > 0); // --> Different quantities of iterations happens here.
}
Func build_hysteresis_update(ImageParam input) {
RDom rh(-1, 3, -1, 3);
Func inb, update;
Var x, y;
inb = BoundaryConditions::repeat_edge(input);
update(x, y) = cast<uint8_t>(select(inb(x, y) == 1 && sum(cast<uint16_t>(inb(x+rh.x, y+rh.y))) > 9, 254, inb(x, y) == 254, 255, inb(x, y)));
Var xi, yi;
update.tile(x, y, xi, yi, 32, 8).parallel(y).vectorize(xi, 8);
return update;
}
Func build_hysteresis_count(ImageParam input) {
RDom rc(0, input.width(), 0, input.height());
Func count;
count() = sum(select(input(rc.x, rc.y) == 254, 1, 0));
return count;
}
I have tried these solutions, worked, but the performance is not so good. Is there a better way to avoid this problem (it seems to be a race-condition issue) keeping the first performance? I also have tried to use "device_sync" without success to solve the problem.
// Solution 1:
Func build_hysteresis_update(ImageParam input) {
...
// Adding "aux" and replace "inb" inside "update" func.
aux(x, y) = inb(x, y);
update(x, y) = cast<uint8_t>(select(aux(x, y) == 1 && sum(cast<uint16_t>(aux(x+rh.x, y+rh.y))) > 9, 254, aux(x, y) == 254, 255, aux(x, y)));
...
// Also adding scheduling "compute_root" for "aux" func.
aux.compute_root();
...
}
// Solution 2:
int main() {
...
do {
// Don't pass the allocated "output" to the "realize" method.
param.set(output);
output = func_update.realize(dx, dy);
...
} while (count(0) > 0);
...
}

While there is a race condition, that's not really the issue here. You're doing a small blur, which has a very different meaning if the input and output are the same buffer. Blurring in-place requires a different sort of algorithm. One solution is double-buffering:
do {
param.set(output1);
func_update.realize(output2);
param.set(output2);
func_update.realize(output1);
...
} while (count(0) > 0);

Related

Could this algorithm be made better?

I have a question, in one of the algorithms, I had written, to seperate out, all the even numbers to the left and odd numbers to the right.
Example: Input:
{ 12, 10, 52, 57, 14, 91, 34, 100, 245, 78, 91, 32, 354, 80, 13, 67, 65 }
Output:
{12,10,52,80,14,354,34,100,32,78,91,245,91,57,13,67,65}
Below is the algorithm
public int[] sortEvenAndOdd(int[] combinedArray) {
int endPointer = combinedArray.length - 1;
int startPointer = 0;
for (int i = endPointer; i >= startPointer; i--) {
if (combinedArray[i] % 2 == 0) {
if (combinedArray[startPointer] % 2 != 0) {
int temp = combinedArray[i];
combinedArray[i] = combinedArray[startPointer];
combinedArray[startPointer] = temp;
startPointer = startPointer + 1;
} else {
while (combinedArray[startPointer] % 2 == 0 &&
(startPointer != i)) {
startPointer = startPointer + 1;
}
int temp = combinedArray[i];
combinedArray[i] = combinedArray[startPointer];
combinedArray[startPointer] = temp;
startPointer = startPointer + 1;
}
}
}
return combinedArray;
}
Anybody, have any suggestions, for it make it to O(n) or better ?
Your code is O(n), but it's a bit more complicated than it needs to be. Here's an improvement.
startPointer = 0;
endPointer = a.length - 1;
while (startPointer < endPointer)
{
if (a[startPointer] % 2 != 0)
{
// move endPointer backwards to even number
while (endPointer > startPointer && a[endPointer] % 2 != 0)
{
--endPointer;
}
swap(a[startPointer], a[endPointer]);
}
++startPointer;
}
By the way, the operation is more of a partition than a sort. I think a better function name would be partitionEvenOdd.
Make two queue one for even and another for odd . When any new number come push into respective queue and when all number finished then traverse first even queue and push into answer vector and then odd queue number . This is O(n) solution .I hope I am able to explain the solution .
Sorry for english.
If you want then I can post implementation but you should try.
You can't do it better than O(n) time, but you can make your code more concise.
Looking at your solution, since order of elements doesn't matter, you can simply keep a pointer variable which goes from last to first and keep swapping elements with this pointer.
Snippet:
private static void solve(int[] arr){
for(int i=arr.length-1,ptr = i;i>=0;--i){
if((arr[i] & 1) == 1){
swap(arr,i,ptr--);
}
}
}
private static void swap(int[] arr,int x,int y){
int temp = arr[x];
arr[x] = arr[y];
arr[y] = temp;
}
Demo: https://onlinegdb.com/HyKNVMbwL
If the order of the elements matter
You can collect all odd ones in a new list.
Move all even ones to the left.
Assign all odd ones one by one from the list to the array from where even ones ended.
This will increase space complexity to O(n).

Matrix Text rain effect in Processing 3.3

I'm working on making a matrix text rain effect in Processing 3.3 as a simple starter project for learning the processing library and Java. My code so far:
class Symbol {
int x, y;
int switchInterval = round(random(2, 50));
float speed;
char value;
Symbol(int x, int y, float speed) {
this.x = x;
this.y = y;
this.speed = speed;
}
//Sets to random symbol based on the Katakana Unicode block
void setToRandomSymbol() {
if(frameCount % switchInterval == 0) {
value = char((int) random(0x30A0, 0x3100));
}
}
//rains the characters down the screen and loops them to the top when they
// reach the bottom of the screen
void rain() {
if(y <= height) {
y += speed;
}else {
y = 0;
}
}
}
Symbol symbol;
class Stream {
int totalSymbols = round(random(5, 30));
Symbol[] symbols = new Symbol[500];
float speed = random(5, 20);
//generates the symbols and adds them to the array, each symbol one symbol
//height above the one previous
void generateSymbols() {
int y = 0;
int x = width / 2;
for (int i = 0; i <= totalSymbols; i++) {
symbols[i] = new Symbol(x, y, speed);
symbols[i].setToRandomSymbol();
y -= symbolSize;
}
}
void render() {
for(Symbol s : symbols) {
fill(0, 255, 70);
s.setToRandomSymbol();
text(s.value, s.x, s.y);
s.rain();
}
}
}
Ok, so that was a lot of code, Let me explain my dilemma. The issue I'm having is that when I run the code I get a NullpointerException at the s.setToRandomSymbol(); method call in the for each loop in the render function. The weird part about this NullPointerException error and the part I'm not understanding is that it's being thrown on a method that doesn't take in any arguments that could be coming back empty, and the method itself is void, so it shouldn't be returning anything, right? Why is this returning Null and what did I do wrong to have it return this way?
First you come up with a random number betwen 5 and 30:
int totalSymbols = round(random(5, 30));
Then you create an array that holds 500 instances of your Symbol class:
Symbol[] symbols = new Symbol[500];
Note that this array holds 500 null values at this point.
Then you add a maximum of 30 instances of Symbol to your array:
for (int i = 0; i <= totalSymbols; i++) {
symbols[i] = new Symbol(x, y, speed);
Note that this array now holds at least 470 null values at this point.
Then you iterate over all 500 indexes:
for(Symbol s : symbols) {
s.setToRandomSymbol();
But remember that at least 470 of these indexes are null, which is why you're getting a NullPointerException.
Some basic debugging would have told you all of this. I would have started by adding a basic println() statement just before you get the error:
for(Symbol s : symbols) {
println("s: " + s);
s.setToRandomSymbol();
This would have showed you that you're iterating over null values.
Anyway, to fix your problem you need to stop iterating over your entire array, or you need to stop making room for indexes you never use.
In the future, please try to narrow your problem down to a MCVE before posting. Note that this much smaller example program shows your error:
String[] array = new String[10];
array[0] = "test";
for(String s : array){
println(s.length());
}

How to set up if statements so that loop goes forward and then in reverse? Processing

int x = 31;
int y = 31;
int x_dir = 4;
int y_dir = 0;
void setup ()
{
size (800, 800);
}
void draw ()
{
background (150);
ellipse (x,y,60, 60);
if (x+30>=width)
{
x_dir =-4;
y_dir = 4;
}
if (y+30>=height)
{
x_dir=4;
y_dir = 0;
}
if (x+30>=width)
{
x_dir = -4;
}
x+=x_dir;
y+=y_dir;
println(x,y);
}
Hi,
I have to create this program in processing which produces an animation of a ball going in a Z pattern (top left to top right, diagonal top right to bottom left, and then straight from bottom left to bottom right) which then goes backwards along the same path it came.
While I have the code written out for the forward direction, I don't know what 2 if or else statements I need to write for the program so that based on one condition it goes forwards, and based on another condition it will go backwards, and it will continue doing so until it terminates.
If I am able to figure out which two if statements I need to write, all I need to do is copy and reverse the x_dir and y_dir signs on the forward loop.
There are a ton of different ways you can do this.
One approach is to keep track of which "mode" you're in. You could do this using an int variable that's 0 when you're on the first part of the path, 1 when you're on the second part of the path, etc. Then just use an if statement to decide what to do, how to move the ball, etc.
Here's an example:
int x = 31;
int y = 31;
int mode = 0;
void setup ()
{
size (800, 800);
}
void draw ()
{
background (150);
ellipse (x, y, 60, 60);
if (mode == 0) {
x = x + 4;
if (x+30>=width) {
mode = 1;
}
} else if (mode == 1) {
x = x - 4;
y = y + 4;
if (y+30>=height) {
mode = 2;
}
} else if (mode == 2) {
x = x + 4;
if (x+30>=width) {
mode = 3;
}
} else if (mode == 3) {
x = x - 4;
y = y - 4;
if (y-30 < 0) {
mode = 2;
}
}
}
Like I said, this is only one way to approach the problem, and there are some obvious improvements you could make. For example, you could store the movement speeds and the conditions that change the mode in an array (or better yet, in objects) and get rid of all of the if statements.

Swift 2.0 compiler error #"Use of unresolved identifier"

I have written a program to find the greatest number using Xcode 7.0 playground in swift 2.0 which is
var arrofNumbare:Array = [1065782,4234,234,23,234,234,23,443,3978909,234000990];
var n:Int = 0
var greatestnumbar : Int = arrofNumbare[0]
while (n < arrofNumbare.count - 1)
{
greatestnumbar = greatest(greatestnumbar, b:arrofNumbare[n+1])
n++;
}
print("\(greatestnumbar)")
func greatest(a:Int , b:Int) -> Int
{
if(a > b)
{
return a
}
else
{
return b
}
}
everything is working proper but geeting following error while calling method
Playgrounds run the code written in them sequentially.
The trouble you are having is that you have defined a function after the place where it was used.
If you move your function to above where it is called, then the error goes away.
var arrofNumbare:Array = [1065782, 4234, 234, 23, 234, 234, 23, 443, 3978909, 234000990];
var n: Int = 0
var greatestnumbar : Int = arrofNumbare[0]
func greatest(a:Int , b:Int) -> Int
{
if(a > b)
{
return a
}
else
{
return b
}
}
while (n < arrofNumbare.count - 1)
{
greatestnumbar = greatest(greatestnumbar, b:arrofNumbare[n+1])
n++;
}
print("\(greatestnumbar)")
Edited to add
Although not part of your question, here is a better way of doing what you want in Swift; using the reduce function
let array: [Int] = [1065782, 4234, 234, 23, 234, 234, 23, 443, 3978909, 234000990]
let maximumValue = array.reduce(Int.min) { (accumulator, value) -> Int in
return max(accumulator, value)
}
the reduce function takes an initial value in an accumulator and and a closure that applies to this accumulator value and the next value in the array. After running this block on each element of the array, the accumulator value is returned. In the closure above I am just putting storing the larger of the accumulator and the next value into this accumulator.
This is a much clearer way of getting the maximum value.
It's also easier to reason about what your code is doing, because the mechanics of iterating through the array and updating the initial value is all taken care of. You can see that in your attempt, most of the code is related to iterating through the array.

Cuda Bayer/CFA demosaicing example

I've written a CUDA4 Bayer demosaicing routine, but it's slower than single threaded CPU code, running on a16core GTS250.
Blocksize is (16,16) and the image dims are a multiple of 16 - but changing this doesn't improve it.
Am I doing anything obviously stupid?
--------------- calling routine ------------------
uchar4 *d_output;
size_t num_bytes;
cudaGraphicsMapResources(1, &cuda_pbo_resource, 0);
cudaGraphicsResourceGetMappedPointer((void **)&d_output, &num_bytes, cuda_pbo_resource);
// Do the conversion, leave the result in the PBO fordisplay
kernel_wrapper( imageWidth, imageHeight, blockSize, gridSize, d_output );
cudaGraphicsUnmapResources(1, &cuda_pbo_resource, 0);
--------------- cuda -------------------------------
texture<uchar, 2, cudaReadModeElementType> tex;
cudaArray *d_imageArray = 0;
__global__ void convertGRBG(uchar4 *d_output, uint width, uint height)
{
uint x = __umul24(blockIdx.x, blockDim.x) + threadIdx.x;
uint y = __umul24(blockIdx.y, blockDim.y) + threadIdx.y;
uint i = __umul24(y, width) + x;
// input is GR/BG output is BGRA
if ((x < width) && (y < height)) {
if ( y & 0x01 ) {
if ( x & 0x01 ) {
d_output[i].x = (tex2D(tex,x+1,y)+tex2D(tex,x-1,y))/2; // B
d_output[i].y = (tex2D(tex,x,y)); // G in B
d_output[i].z = (tex2D(tex,x,y+1)+tex2D(tex,x,y-1))/2; // R
} else {
d_output[i].x = (tex2D(tex,x,y)); //B
d_output[i].y = (tex2D(tex,x+1,y) + tex2D(tex,x-1,y)+tex2D(tex,x,y+1)+tex2D(tex,x,y-1))/4; // G
d_output[i].z = (tex2D(tex,x+1,y+1) + tex2D(tex,x+1,y-1)+tex2D(tex,x-1,y+1)+tex2D(tex,x-1,y-1))/4; // R
}
} else {
if ( x & 0x01 ) {
// odd col = R
d_output[i].y = (tex2D(tex,x+1,y+1) + tex2D(tex,x+1,y-1)+tex2D(tex,x-1,y+1)+tex2D(tex,x-1,y-1))/4; // B
d_output[i].z = (tex2D(tex,x,y)); //R
d_output[i].y = (tex2D(tex,x+1,y) + tex2D(tex,x-1,y)+tex2D(tex,x,y+1)+tex2D(tex,x,y-1))/4; // G
} else {
d_output[i].x = (tex2D(tex,x,y+1)+tex2D(tex,x,y-1))/2; // B
d_output[i].y = (tex2D(tex,x,y)); // G in R
d_output[i].z = (tex2D(tex,x+1,y)+tex2D(tex,x-1,y))/2; // R
}
}
}
}
void initTexture(int imageWidth, int imageHeight, uchar *imagedata)
{
cudaChannelFormatDesc channelDesc = cudaCreateChannelDesc(8, 0, 0, 0, cudaChannelFormatKindUnsigned);
cutilSafeCall( cudaMallocArray(&d_imageArray, &channelDesc, imageWidth, imageHeight) );
uint size = imageWidth * imageHeight * sizeof(uchar);
cutilSafeCall( cudaMemcpyToArray(d_imageArray, 0, 0, imagedata, size, cudaMemcpyHostToDevice) );
cutFree(imagedata);
// bind array to texture reference with point sampling
tex.addressMode[0] = cudaAddressModeClamp;
tex.addressMode[1] = cudaAddressModeClamp;
tex.filterMode = cudaFilterModePoint;
tex.normalized = false;
cutilSafeCall( cudaBindTextureToArray(tex, d_imageArray) );
}
There aren't any obvious bugs in your code, but there are several obvious performance opportunities:
1) for best performance, you should use texture to stage into shared memory - see the 'SobelFilter' SDK sample.
2) As written, the code is writing bytes to global memory, which is guaranteed to incur a large performance hit. You can use shared memory to stage results before committing them to global memory.
3) There is a surprisingly big performance advantage to sizing blocks in a way that match the hardware's texture cache attributes. On Tesla-class hardware, the optimal block size for kernels using the same addressing scheme as your kernel is 16x4. (64 threads per block)
For workloads like this, it may be hard to compete with optimized CPU code. SSE2 can do 16 byte-sized operations in a single instruction, and CPUs are clocked about 5 times as fast.
Based on answer on Nvidia forums, here (for the search engines) is a slightly more optomised version which writes a 2x2 block of pixels in each thread. Although the difference in speed isn't measurable on my setup.
Note it should be called with a gridsize half the size of the image;
dim3 blockSize(16, 16); // for example
dim3 gridSize((width/2) / blockSize.x, (height/2) / blockSize.y);
__global__ void d_convertGRBG(uchar4 *d_output, uint width, uint height)
{
uint x = 2 * (__umul24(blockIdx.x, blockDim.x) + threadIdx.x);
uint y = 2 * (__umul24(blockIdx.y, blockDim.y) + threadIdx.y);
uint i = __umul24(y, width) + x;
// input is GR/BG output is BGRA
if ((x < width-1) && (y < height-1)) {
// x+1, y+1:
d_output[i+width+1] = make_uchar4( (tex2D(tex,x+2,y+1)+tex2D(tex,x,y+1))/2, // B
(tex2D(tex,x+1,y+1)), // G in B
(tex2D(tex,x+1,y+2)+tex2D(tex,x+1,y))/2, // R
0xff);
// x, y+1:
d_output[i+width] = make_uchar4( (tex2D(tex,x,y+1)), //B
(tex2D(tex,x+1,y+1) + tex2D(tex,x-1,y+1)+tex2D(tex,x,y+2)+tex2D(tex,x,y))/4, // G
(tex2D(tex,x+1,y+2) + tex2D(tex,x+1,y)+tex2D(tex,x-1,y+2)+tex2D(tex,x-1,y))/4, // R
0xff);
// x+1, y:
d_output[i+1] = make_uchar4( (tex2D(tex,x,y-1) + tex2D(tex,x+2,y-1)+tex2D(tex,x,y+1)+tex2D(tex,x+2,y-1))/4, // B
(tex2D(tex,x+2,y) + tex2D(tex,x,y)+tex2D(tex,x+1,y+1)+tex2D(tex,x+1,y-1))/4, // G
(tex2D(tex,x+1,y)), //R
0xff);
// x, y:
d_output[i] = make_uchar4( (tex2D(tex,x,y+1)+tex2D(tex,x,y-1))/2, // B
(tex2D(tex,x,y)), // G in R
(tex2D(tex,x+1,y)+tex2D(tex,x-1,y))/2, // R
0xff);
}
}
There are many if's and else's in the code. If you structure the code to eliminate all the conditional statements then you will get a huge performance boost as branching is a performance killer. It is indeed possible to remove the branches. There are exactly 30 cases which you will have to code explicitly. I have implemented it on CPU and it does not contain any conditional statements. I am thinking of making a blog explaining it. Will post it once its done.

Resources