Optimal number of filters in a Convolutional network - algorithm

I'm building a convolutional Network image classification purposes, my network is inspired by VGG conv network but I changed the number of layers and filters per layers because my image dataset is quite simple.
Nevertheless I'm wondering why the number of fitlers in VGG is always a power of 2 : 64 -> 128 -> 256 -> 512 -> 4096
I guessed that's because each pooling divide the output size by 2 x 2 and therefore one would want to multiply the number of filters by 2.
But I'm still wondering what's the real reason behind this choice; is this for optimization ? is it easier to distribute calculation ? And should I keep this logic in my network.

Yes, it is mainly for optimization. If the network is going to run on a GPU, threads in GPUs come in groups and blocks, normally a group is of 32 threads.
Roughly speaking, if you have a layer with 40 filters, you will need 2 groups = 64 threads. So why not making use of the rest threads and make the layer of 64 filters that can be computed in parallel.

Related

Scalability of a Parallel MPI based Poisson Solver using Finite Difference Scheme

I am developing a MPI based parallel numerical solver for a Laplace equation subjected to Dirichlet Boundary condition using Finite Difference method.I have taken a square computational domain with a grid size of 1024 x 1024. My objective is to evaluate the Scalability of the developed solver.
I have tested the code using 4,8,16,32,64,128 and 256 processors on a HPC facility having 8 nodes with each node consisting of 48 cores. Until 128 processors I could find a linear scalability in the computational time.(i.e between two consecutive cores say 4 and 8, 8 and 16, 16 and 32, 32 and 64 the solution converges and attains the set L2 and L-infinity true error norms values in the order of 10^-7 and 10^-8 without any issues and also the computational time reduces by half. whereas when i run the code using 128 and 256 cores the error norm values are not following a similar trend as that for 64 cores. The convergence rate drastically reduces say for 128 processors the error norm value is not reducing beyond 10^-6 and it takes more than half the computational time for 64 cores.Similarly with 256 cores the convergence rate is very poor and the computational time is very high.So, I request the experts in this field to help me to resolve this issue.

Cache blocking brings no improvement for image filter on ARM

I'm experimenting with cache blocking. To do that, I implemented 2 convolution based smoothing algorithms. The gaussian kernel I'm using looks like this:
The first algorithm is just the simple double for loop, looping from left to right, top to bottom as shown below.
Image source: (https://people.engr.ncsu.edu/efg/521/f02/common/lectures/notes/lec9.html)
In the second algorithm I tried to play with cache blocking by spliting the loops into chunks, which became something like the following. I used a BLOCK size of 512x512.
Image source: (https://people.engr.ncsu.edu/efg/521/f02/common/lectures/notes/lec9.html)
I'm running the code on a raspberry pi 3B+, which has a Cortex-A53 with 32KB of L1 and 256KB of L2, I believe. I ran the two algorithms with different image sizes (2048x1536, 6000x4000, 12000x8000, 16000x12000. 8bit gray scale images). But across different image sizes, I saw the run time being very similar.
The question is shouldn't the first algorithm experience access latency which the second should not, especially when using large size image (like 12000x8000). Base on the description of cache blocking in this link, when processing data at the end of image rows using the 1st algorithm, the data at the beginning of the rows should have been evicted from the L1 cache. Using 12000x8000 size image as an example, since we are using 5x5 kernel, 5 rows of data is need, which is 12000x5=60KB, already larger than the 32KB L1 size. When we start processing data for a new row, 4 rows of previous data are still needed but they are likely gone in L1 so needs to be re-fetched. But for the second algorithm it shouldn't have this problem because the block size is small. Can anyone please tell me what am I missing?
I also profiled the algorithm using oprofile with the following data:
Algorithm 1
event
count
L1D_CACHE_REFILL
13,933,254
PREFETCH_LINEFILL
13,281,559
Algorithm 2
event
count
L1D_CACHE_REFILL
9,456,369
PREFETCH_LINEFILL
8,725,250
So it looks like the 1st algorithm does have more cache miss compared to the second, reflecting by the L1D_CACHE_REFILL counts. But it also has higher data prefetching rate, which maybe due to the simple behavior of the loop. So is the whole story of cache blocking not taking into account data prefetching?
Conceptually, you're right blocking will reduce cache misses by keeping the input window in cache.
I suspect the main reason you're not seeing a speedup is because the cache is prefetching from all 5 input rows. Your performance counters show more prefetch loads in the unblocked implementation. I suspect many textbook examples are out of date since cache prefetching has kept getting better. Intel's L2 cache can detect and prefetch from up to 16 linear streams about 10 years ago, I think.
Assume the filter takes 5 * 5 cycles. So that would be 20.8 ns = 25 / 1.2GHz on RPI3. The IO cost will be reading a 5 high column of new input pixels. The amortized IO cost will be 5 bytes / 20.8ns = 229 MiB/s, which is much less than the ~2 GiB/s DRAM bandwidth. So in theory, the relatively slow computation combined with prefetching (I'm not certain how effective) means that memory access isn't a bottleneck.
Try increasing the filter height. The cache can only detect and prefetch from a certain # streams. Or try vectorizing the computation so that memory access becomes the bottleneck.

How does OpenCL distribute work items?

I'm testing and comparing GPU speed up with different numbers of work-items (no work-groups). The kernel I'm using is a very simple but long operation. When I test with multiple work-items, I use a barrier function and split the work in smaller chunks to get the same result as with just one work-item. I measure the kernel execution time using cl_event and the results are the following:
1 work-item: 35735 ms
2 work-items: 11822 ms (3 times faster than with 1 work-item)
10 work-items: 2380 ms (5 times faster than with 2 work-items)
100 work-items: 239 ms (10 times faster than with 10 work-items)
200 work-items: 122 ms (2 times faster than with 100 work-items)
CPU takes about 580 ms on average to do the same operation.
The only result I don't understand and can't explain is the one with 2 work items. I would expect the speed up to be about 2 times faster compared to the result with just one work item, so why is it 3?
I'm trying to make sense of these numbers by looking at how these work-items were distributed on processing elements. I'm assuming if I have just one kernel, only one compute unit (or multiprocessor) will be activated and the work items distributed on all processing elements (or CUDA cores) of that compute unit. What I'm also not sure about is whether a processing element can process multiple work-items at the same time, or is it just one work-item per processing element?
CL_DEVICE_MAX_WORK_ITEM_SIZES are 1024 / 1024 / 64 and CL_DEVICE_MAX_WORK_GROUP_SIZE 1024. Since I'm using just one dimension, does that mean I can have 1024 work-items running at the same time per processing element or per compute unit? When I tried with 1000 work-items, the result was a smaller number so I figured not all of them got executed, but why would that be?
My GPU info: Nvidia GeForce GT 525M, 96 CUDA cores (2 compute units, 48 CUDA cores per unit)
The only result I don't understand and can't explain is the one with 2
work items. I would expect the speed up to be about 2 times faster
compared to the result with just one work item, so why is it 3?
The exact reasons will probably be hard to pin down, but here are a few suggestions:
GPUs aren't optimised at all for small numbers of work items. Benchmarking that end of the scale isn't especially useful.
35 seconds is a very long time for a GPU. Your GPU probably has other things to do, so your work-item is probably being interrupted many times, with its context saved and resumed every time.
It will depend very much on your algorithm. For example, if your kernel uses local memory, or a work-size dependent amount of private memory, it might "spill" to global memory, which will slow things down.
Depending on your kernel's memory access patterns, you might be running into the effects of read/write coalescing. More work items means fewer memory accesses.
What I'm also not sure about is whether a processing element can process multiple work-items at the same time, or is it just one work-item per processing element?
Most GPU hardware supports a form of SMT to hide memory access latency. So a compute core will have up to some fixed number of work items in-flight at a time, and if one of them is blocked waiting for a memory access or barrier, the core will continue executing commands on another work item. Note that the maximum number of simultaneous threads can be further limited if your kernel uses a lot of local memory or private registers, because those are a finite resource shared by all cores in a compute unit.
Work-groups will normally run on only one compute unit at a time, because local memory and barriers don't work across units. So you don't want to make your groups too large.
One final note: compute hardware tends to be grouped in powers of 2, so it's usually a good idea to make your work group sizes a multiple of e.g. 16 or 64. 1000 is neither, which usually means some cores will be doing nothing.
When I tried with 1000 work-items, the result was a smaller number so I figured not all of them got executed, but why would that be?
Please be more precise in this question, it's not clear what you're asking.

why executation time of tf.nn.conv2d function different while the multiply times are the same?

I am using tensorflow to build cnn net in image classification experiment,I found such phenomenon as:
operation 1:tf.nn.conv2d(x, [3,3,32,32], strides=[1,1,1,1], padding='SAME')
the shape of x is [128,128,32],means convolution using 3x3 kernel on x,both input channels and output channels are 32,the total multiply times is
3*3*32*32*128*128=150994944
operation 2:tf.nn.conv2d(x, [3,3,64,64], strides=[1,1,1,1], padding='SAME')
the shape of x is [64,64,64],means convolution using 3x3 kernel on x,both input channels and output channels are 64,the total multiply times is
3*3*64*64*64*64=150994944
In contrast with operation 1,the feature map size of operation 2 scale down to 1/2 and the channel number doubled. The multiply times are the same so the running time should be same.But in practice the running time of operation 1 is longer than operation 2.
My measure method was shown below
eliminate an convolution of operation 1,the training time for one epoch reduced 23 seconds,means the running time of operation 1 is 23 seconds.
eliminate an convolution of operation 2,the training time for one epoch reduced 13 seconds,means the running time of operation 2 is 13 seconds.
the phenomenon can reproduction every time。
My gpu is nvidia gtx980Ti,os is ubuntu 16.04。
So that comes the question: Why the running time of operation 1 was longer than operation 2?
If I had to guess it has to do with how the image is ordered in memory. Remember that in memory everything is stored in a flattened format. This means that if you have a tensor of shape [128, 128, 32], the 32 features/channels are stored next to eachover. Then all of the rows, then all of the columns. https://en.wikipedia.org/wiki/Row-major_order
Accessing closely packed memory is very important to performance especially on a GPU which has a large memory bus and is optimized for aligned in order memory access. In case with the larger image you have to skip around the image more and the memory access is more out of order. In case 2 you can do more in order memory access which gives you more speed. Multiplications are very fast operations. I bet with a convolution memory access if the bottleneck which limits performance.
chasep255's answer is good and probably correct.
Another possibility (or alternative way of thinking about chasep255's answer) is to consider how caching (all the little hardware tricks that can speed up memory fetches, address mapping, etc) could be producing what you see...
You have basically two things: a stream of X input data and a static filter matrix. In case 1, you have 9*1024 static elements, in case 2 you have 4 times as many. Both cases have the same total multiplication count, but in case 2 the process is finding more of its data where it expects (i.e. where it was last time it was asked for.) Net result: less memory access stalls, more speed.

micro-programmed control circuit and one questions

I ran into a question:
in digital system with micro-programmed control circuit, total of distinct operation pattern of 32 signal is 450. if the micro-programmed memory contains 1K micro instruction, by using Nano memory, how many bits is reduced from micro-programmed memory?
1) 22 Kbits
2) 23 Kbits
3) 450 Kbits
4) 450*32 Kbits
I read in my notes, that (1) is true, but i couldn't understand how we get this?
Edit: Micro instructions are stored in the micro memory (control memory). There is a chance that a group of micro instructions may occur several times in a micro program. As a result the more memory space isneeded.By making use of the nano memory we can have significant saving in the memory when a group of micro operations occur several times in a micro program. Please see for nano technique ref:
Control Units
Back in the day, before .NET, when you actually had to know what a computer was, before you could make it do stuff. This question would have gotten a ton of answers.
Except, back then, the internet wasn't really a thing, and Stack overflow was not really a problem, as the concept of a stack and a heap, wasn't really a standard..
So just to make sure that we are in fact talking about the same thing, I will just tr to explain this..
The control unit in a digital computer initiates sequences of microoperations. In a bus-oriented system, the control signals that specify microoperations are
groups of bits that select the paths in multiplexers, decoders, and ALUs.
So we are looking at the control unit, and the instruction set for making it capable of actually doing stuff.
We are dealing with what steps should happen, when the compiled assembly requests a bit shift, clear a register, or similar "low level" stuff.
Some of theese instructions may be hardwired, but usually not all of them.
Micro-programs
Quote: "Microprogramming is an orderly method of designing the control unit
of a conventional computer"
(http://www2.informatik.hu-berlin.de/rok/ca/data/slides/english/ca9.pdf)
The control variables, for the control unit can be represented by a string of 1’s and 0’s called a "control word". A microprogrammed control unit is a control unit whose binary control variables are not hardwired, but are stored in a memory. Before we optimized stuff we called this memory the micro memory ;)
Typically we would actually be looking at two "memories" a control memory, and a main memory.
the control memory is for the microprogram,
and the main memory is for instructions and data
The process of code generation for the control memory is called
microprogramming.
... ok?
Transfer of information among registers in the processor is through MUXs rather
than a bus, we typically have a few register, some of which are familiar to programmers, some are not. The ones that should ring a bell for most in here, is the processor registers. The most common 4 Processor registers are:
Program counter – PC
Address register – AR
Data register – DR
Accumulator register - AC
Examples where microcode uses processor registers to do stuff
Assembly instruction "ADD"
pseudo micro code: " AC ← AC + M[EA] " where M[EA] is data from main memory register
control word: 0000
Assembly instruction "BRANCH"
pseudo micro code "If (AC < 0) then (PC ← EA) "
control word: 0001
Micro-memory
The micro memory only concerns how we organize whats in the control memory.
However when we have big instruction sets, we can do better than simply storing all the instructions. We can subdivide the control memory into "control memory" and "nano memory" (since nano is smaller than micro right ;) )
This is good as we don't waste a lot of valuable space (chip area) on microcode.
The concept of nano memory is derived from a combination of vertical and horizontal instructions, but also provides trade-offs between them.
The motorola M68k microcomputer is one the earlier and popular µComputers with this nano memory control design. Here it was shown that a significant saving of memory could be achieved when a group of micro instructions occur often in a microprogram.
Here it was shown that by structuring the memory properly, that a few bits could be used to address the instructions, without a significant cost to speed.
The reduction was so that only the upper log_2(n) bits are required to specify the nano-address, when compared to the micro-address.
what does this mean?
Well let's stay with the M68K example a bit longer:
It had 640 instructions, out of which only 280 where unique.
had the instructions been coded as simple micro memory, it would have taken up:
640x70 bits. or 44800 bits
however, as only the 280 unique instructions where required to fill all 70 bits, we could apply the nano memory technique to the remaining instructions, and get:
8 < log_2(640-280) < 9 = 9
640*9 bit micro control store, and 280x70 bit nano memory store
total of 25360 bits
or a memory savings of 19440 bits.. which could be laid out as main memory for programmers :)
this shows that the equation:
S = Hm x Wm + Hn x Wn
where:
Hm = Number of words High Level
Wm = Length of words in High Level
Hn = Number of Low Level words
Wn = Length of low level words
S = Control Memory Size (with Nano memory technique)
holds in real life.
note that, micro memory is usually designed vertically (Hm is large, Wm is small) and nano programs are usually opposite Hn small, Wn Large.
Back to the question
I had a few problems understanding the wording of the problem, - that may because my first language is Danish, but still I tried to make some sense of it and got to:
proposition 1:
1000 instructions
32 bits
450 uniques
µCode:
1000 * 32 = 32.000 bits
bit width required for nano memory:
log2(1000-450) > 9 => 10
450 * 32 = 14400
(1000-450) * 10 = 5500
32000 - (14400 + 5500) = 12.100 bits saved
Which is not any of your answers.
please provide clarification?
UPDATE:
"the control word is 32 bit. we can code the 450 pattern with 9 bit and we use these 9 bits instead of 32 bit control word. reduce memory from 1000*(32+x) to 1000*(9+x) is equal to 23kbits. – Ali Movagher"
There is your problem, we cannot code the 450 pattern with 9 bits, as far as I can see we need 10..

Resources