Optimizing gpu allocation/transfer of matrix tiles - performance

I am working with very large matrices (>1GB) but imagine that I have the following matrix:
A = [1 1 2 2;
1 1 2 2;
3 3 4 4;
3 3 4 4]
I need to pin each tile of the previous matrix to transfer them to the GPU in an async way (using the CUDA.jl package).
The following code allocates the space of each tile in the GPU and it is working:
function allocGPU!(gpu_buf, m,n)
dev_buf = CUDA.Mem.alloc(CUDA.Mem.DeviceBuffer, m*n*8)
dev_ptr = convert(CuPtr{Float64}, dev_buf);
push!(gpu_buf, dev_buf)
tile_gpu = unsafe_wrap(CuArray{Float64}, dev_ptr, (m,n));
gpu_buf
return tile_gpu
end
A_coor = [(1:2,1:2) (1:2, 3:4);
(3:4,1:2) (3:4,3:4)]
A_tiles = [A[A_coor[i][1], A_coor[i,j][2]] for i=1:size(A_coor)[1], j=1:size(A_coor)[2]]
gpu_buf = []
A_tiles_gpu = [allocGPU!(gpu_buf, m,n) for i=1:size(A_tiles)[1], j=1:size(A_tiles)[2]]
But it's copying each tile into a new object, taking more time than I would like. Is there any way to wrap a 2x2 Array to each tile in order to reduce the number of allocations?
I also tried with this line:
A_tiles = [unsafe_wrap(Array{Float64}, pointer(A[A_coor[i][1], A_coor[i,j][2]]), (m,n)) for i=1:size(A_coor)[1], j=1:size(A_coor)[2]]
I also though of pinning matrix A and then transfer to the GPU as:
copyto!(tile_gpu, A[1:2,1:2])
but I'm guessing julia will copy the A[1:2,1:2] into a new object and then transfer the tile, yielding the same results as 1st method.
Edit:
As I suspected the:
copyto!(tile_gpu, A[1:2,1:2])
Creates a new object, in a different memory location, I also tried to use the #view macro, although it works for the CPU it doesn't seem to work with copyto! to the GPU memory.

Related

Divide an image into non-overlapping blocks and applying the 2D DWT on each block

I am working on creating an image splicing detection software so I need to divide the image into non-overlapping blocsk and apply Discrete Meyer Wavelet Transform on each block of the image
I have tried the blockproc function to do that but I got no result:
I = imread('pears.png');
fun = #(block_struct)...
dwt2(block_struct.data,'dmey');
C = blockproc(I,[64 64],fun);
So how can I access the [cA,cH,cV,cD] of dwt2 using the above code?
blockproc assumes that you are outputting an actual image. You cannot use this for multiple outputs. If you truly want this to work with blockproc, you will unfortunately need to call blockproc four times, with each time extracting the different set of coefficients for the directions. Also note that the 2D DWT only works for grayscale images, so you need to convert to grayscale before actually doing any processing. The pears image you've chosen is a colour / RGB image.
I'd like to reference this post on how to select the Nth output given an input function: How do I get the second return value from a function without using temporary variables?. You will need to save this code to a file called nth_output.m, which allows you to programatically extract all output variables from a function and choose only one output.
function value = nth_output(N,fcn,varargin)
[value{1:N}] = fcn(varargin{:});
value = value{N};
end
Simply omitting the extra output arguments when you call the function only gives you the first output, which is what your blockproc code is doing. Once you do that, it's a matter of creating 4 anonymous functions to capture each output from dwt2, and running blockproc 4 times. Make sure you specify which output you want for each of the anonymous functions, so 1 up to 4 and you simply provide a handle to the function you want to run in addition to the input arguments that go into the function.
Therefore, try something like this:
I = rgb2gray(imread('pears.png'));
fun1 = #(block_struct) nth_output(1, #dwt2, block_struct.data,'dmey');
fun2 = #(block_struct) nth_output(2, #dwt2, block_struct.data,'dmey');
fun3 = #(block_struct) nth_output(3, #dwt2, block_struct.data,'dmey');
fun4 = #(block_struct) nth_output(4, #dwt2, block_struct.data,'dmey');
I = rgb2gray(I);
cA = blockproc(I,[64 64],fun1);
cH = blockproc(I,[64 64],fun2);
cV = blockproc(I,[64 64],fun3);
cD = blockproc(I,[64 64],fun4);
cA, cH, cV, and cD contain the DWT coefficients you need for each set of directions.

Deep Learning Dataset Design for a Image Reference to Different Classes?

I want to training a image classifier using inception model.
Now, I have a dishes called chicken rice.
Suppose i want to create rice and chicken meat class.
So can i design output ground true probability as [0.5,0.5,0,0,0...]?
In other words, If the target image contains two classes' content, what should I do to make it reasonable?
Do somebody has tried this?
I have tried to train the image separately, and google did this, too.
keycnt = 0
imagcnt = 0
TestNumber_byclass = np.zeros([keycount],np.int32)
for key in TestKeys:
TestNumber_byclass[keycnt] = len(json_data_test[key])
for imagedata in json_data_test[key]:
imgdata = tf_resize_images(imagdir + imagedata + '.jpg')
imgdata = np.array(imgdata, dtype = np.uint8)
# make image center at 0 in the range of (-1,1]
#imgdata = (imgdata - mean - 128) / 128
h5f = h5py.File(h5filedir_test + str(imagcnt) + ".h5", "w")
h5f.create_dataset('image', data=imgdata)
h5f.create_dataset('label', data=keycnt)
h5f.create_dataset('name' , data=key)
h5f.close()
imagcnt = imagcnt + 1
keycnt =keycnt +1
message = '\r[%d/%d] progress...' % (keycnt,keycount)
sys.stdout.write(message)
sys.stdout.flush()
Many thanks.
What you're trying to do is a multiclass classification, where M out of N classes will be predicted. This is usually done by setting the flag to 1 if the object appears in the image and setting it to 0 if that's not the case.
The really important piece of information is that the last activation function needs to be a sigmoid instead of a softmax. That way you decouple the confidence for each class from the other classes and the sum will be between 0 and N.

Scala: Iterate over 20k images once, crop 100 ways to make 100 Iterators

requirement
I need to iterate over images, splitting each into 100 blocks (ROIs) and calculating something independently per block. I can't store anything other than the file paths in a list in memory, and I can't perform disk IO more than once. Performance is more important than simplicity here. How do I build 100 iterators while iterating over images?
code
I've written this a few ways but always get a StackOverflowError after ~5 hours (should finish in under 20 minutes).
The following is the way that made the most sense to me: Iterate over an in-memory list of paths and build a map of iterators.
def calcAll(run: ImageBase, rois: Traversable[Roi]): Map[Roi, TraversableOnce[T]] = {
val results: mutable.Map[Roi, Iterator[T]] = emptyMutableMap(rois)
// calculate feature between every two frames
var prevImage: RichImage = null // it'll be ok, I promise
for (frame <- ImageStore.walk(run) { // iterates over nio Path objects
val image = RichImages.of(frame)
if (prevImage != null) for (roi <- rois) {
val next: Iterator[T] = calc(Iterator(prevImage.crop(roi), image.crop(roi)))
results(roi) = results(roi) ++ next // StackOverflowError!!
}
prevImage = image
}
results.toMap // immutable
}
background
I have a directory of 20k grayscale frames from a video. The video has a set of 100 Regions of Interest (ROIs), non-overlapping rectangles that we care about. I need to calculate features between consecutive images, but independently for each ROI. The amount of data and number of ROIs prohibits reading an image more than once.
I believe you need something similar to this:
def calcAll(run: ImageBase, rois: Seq[Roi]): Traversable[Map[Roi, T]] = {
ImageStore.walk(run).map(RichImages.of).sliding(2).map {
case Seq(image1, image2) =>
rois.map(roi => roi -> calc(image1.crop(roi), image2.crop(roi)).toMap
}
}
Given that ImageStore.walk returns an Iterator or Traversable, this code will load each image only once and won't have to store more than two images in memory at a time.
This gives you a single iterator though. Having 100 iterators will require either storing all images in memory, or traversing them 100 times. So, unfortunately, I believe you'd have to do with a Traversable[Map[Roi, T]].

RGB to norm rgb transformation. Vectorizing

I'm writing a piece of code that has to transform from an RGB image to an rgb normalized space. I've got it working with a for format but it runs too slow and I need to evaluate lots of images. I'm trying to vectorize the full function in order to faster it. What I have for the moment is the following:
R = im(:,:,1);
G = im(:,:,2);
B = im(:,:,3);
r=reshape(R,[],1);
g=reshape(G,[],1);
b=reshape(B,[],1);
clear R G B;
VNormalizedRed = r(:)/(r(:)+g(:)+b(:));
VNormalizedGreen = g(:)/(r(:)+g(:)+b(:));
VNormalizedBlue = b(:)/(r(:)+g(:)+b(:));
NormalizedRed = reshape(VNormalizedRed,height,width);
NormalizedGreen = reshape(VNormalizedGreen,height,width);
NormalizedBlue = reshape(VNormalizedBlue,height,width);
The main problem is that when it arrives at VNormalizedRed = r(:)/(r(:)+g(:)+b(:)); it displays an out of memory error (wich is really strange because i just have freed three vectors of the same size). Were is the error? (solved)
Its possible to do the same process in a more efficiently way?
Edit:
After using Martin sugestions I found the reshape function was not necessary, being able to do the same with a simple code:
R = im(:,:,1);
G = im(:,:,2);
B = im(:,:,3);
NormalizedRed = R(:,:)./sqrt(R(:,:).^2+G(:,:).^2+B(:,:).^2);
NormalizedGreen = G(:,:)./sqrt(R(:,:).^2+G(:,:).^2+B(:,:).^2);
NormalizedBlue = B(:,:)./sqrt(R(:,:).^2+G(:,:).^2+B(:,:).^2);
norm(:,:,1) = NormalizedRed(:,:);
norm(:,:,2) = NormalizedGreen(:,:);
norm(:,:,3) = NormalizedBlue(:,:);
I believe you want
VNormalizedRed = r(:)./(r(:)+g(:)+b(:));
Note the dot in front of the /, which specifies an element-by-element divide. Without the dot, you're solving a system of equations -- which is likely not what you want to do. This probably also explains why you're seeing the high memory consumption.
Your entire first code can be rewritten in one vectorized line:
im_normalized = bsxfun(#rdivide, im, sum(im,3,'native'));
Your second slightly modified version as:
im_normalized = bsxfun(#rdivide, im, sqrt(sum(im.^2,3,'native')));
BTW, you should be aware of the data type used for the image, otherwise one can get unexpected results (due to integer division for example). Therefore I would convert the image to double before performing the normalization calculations:
im = im2double(im);

Animating with Ruby

This relates to both physical programming as well as Ruby running on a web server. I have an array of RGB leds, which is 5x5 so a total of 25 leds. They are numbered and individually addressable as such:
1 2 3 4 5
6 7 8 9 10
11 12 13 14 15
16 17 18 19 20
21 22 23 24 25
Here is a photo:
As for the hardware (which really isn't important, because it works fine), the system consists of 25 BlinkM's, an Arduino, and some various cabling and connectors.
The led's are sent commands via serial with a command as such:
#sp.write ["\x01", led, "\x04\x00", "c", color]
Which writes the byte array out to serial using ruby's Serialport gem, the variables "led" and "color" are substituted with the hex of each, so for example if I wanted to make led number 8 turn red, my output would read:
#sp.write ["\x01","\x08", "\x04\x00", "c", "\xff\x00\x00"]
So far all of this works wonders, and I'm really happy with what I have, now my question relates pretty much to general mathematics and simple programming, but somehow the implementation goes over my head.
Here is a sample of such animation. Mostly I'm interesting in how one could animate patterns using ruby here. I recall certain "processing" animation scripts, just looping over a function using the array as an object and affecting the elements of the array creating interesting animations just due to the mathematics of the output.
Does anyone have any idea on how I could get started with something like that? I'm currently able to affect the LED's one at a time with my script, and I can string them together with sleep x after each command and manually build animations, but how could I make one run indefinitely with some sort of procedural animation?
EDIT
I really didn't describe the bytecode array in its entirety, here are what each part does:
#sp.write ["\x01", led, "\x04\x00", "c", color]
^ ^ ^ ^ ^ ^
a b c d e f
a. start byte (not important, tells serial that it is the start of a command)
b. hex for LED address, ex. `\x07` is led 7
c. length of command (starting at "e")
d. bytes to be read (always 0 in our case)
e. the "fade to color" command
f. the color we want to fade to in rrggbb hex format.
It should be easy to map your leds to a 2d array
#led = []
led = 1
5.times do |y|
5.times do |x|
#led[x] ||= []
#led[x][y] = led
led +=1
end
end
I'd probably make an LED class that encapsulates the ability to write out colors, so instead of this:
#led[x][y] = led
it becomes
#led[x][y] = Led.new(:id => led)
And then write a method so you can easily do something like this:
#led[1][5].color(255,255,255)
or whatever.
If you just want to make animations, you should try to abstract away the hardware so it can be represented by some data structure that is easy to work with. I'm not familiar with Ruby so I don't know the best way to go about it. If you were making something like that table, which is just a grid, I would try to map the LEDs to a 2D array.
I would then create an infinite loop. This loop would contain another set of loops that iterates through each pixel in that array and writes the color in each element out to the corresponding hardware. Once it writes out all the pixels, it could then sleep for a few ms, call some function that steps your animation when it wakes and repeat the loop again.
Once you do this then all you'll have to manipulate is that data structure. Does that make any sense?
So something like this:
function stepAnimation(){
//modify 2d array for each step of the animation here
}
//I'm assuming you have a function that gets
//Looped forever. In Wiring you do, not sure
//about working with Arduino using Ruby, if not
//just add while(1) in there..
function mainLoop(){
for(var y = 0; y < 5; y++){
for(var x = 0; x < 5; x++){
sp.write(2darray[x][y]) //write color from array to hardware
}
}
sleep(60);
stepAnimation();
}

Resources