Sliding window operation with arbitrary slide distance and no loops (matlab) - performance

I am trying to completely vectorize a sliding window operation that involves an arbitrary slide distance in order to minimize the run time for this code.
I have two vectors: (1) A time vector which records in sample number a series of event times and (2) A channel vector which indicates on which channel each event time was recorded. So:
time = [1,13,58,96,1002];
channel = [1,1,1,2,2];
Which means, for example, that an event was detected at sample number 1 on channel 1. I want to calculate a sliding event count with an arbitrary slide length. For example, if there were only one channel, it would look something like this:
binary = sparse(1,time,1,1,max(time));
nx = max(time); %length of sequence
nwind = <some window size>;
noverlap = <some size smaller than nwind>;
ncol = fix((nx-noverlap)/(nwind-noverlap)); %number of sliding windows
colindex = 1 + (0:(ncol-1))*(nwind-noverlap); %starting index of each
idx = bsxfun(#plus, (1:nwind)', colindex)-1;
eventcounts=sum(binary(idx),1);
I was wondering whether (1) anyone has an idea how to expand this for multiple channels without adding a loop? and (2) perhaps there is an even faster way of making the calculation in general?
Thanks a lot for any ideas. :)
Here is a sample solution without vectorization:
fs = 100; %number of samples in a window
lastwin = max(time);
slide = fs/2; %slide by a half a window
winvec = 0:slide:lastwin;
for x=1:max(channel)
t = histcounts(time(channel==x),winvec);
t = conv(t, [1,1], 'same'); %convolve to get window size of fs
eventcounts(x,:) = t;
end
Ideally, the script would return an [MxN] array, called eventcounts, where M is the total number of channels and N is the total number of windows numel(winvec). In each position (i,j), eventcounts(i,j) would contain the number of events for channel i and window j.

Related

Converting Scratch to Algorithm

First time I am learning algorithms and trying to figure out with stratch. I am following tutorials on Stratch wiki. How can I convert this to algorithm?( with flow chart or normal steps). Especially the loop.( I uploaded as picture) Please click here to see picture
I Started:
Step:1 Start
Step2: İnt: delete all of numbers, iterator, amount,sum
Step3: How many numbers you want?
Step4:initialize sum=0,amount=0,iterator=1
Step5: Enter the elements values
Step6: found the sum by using loop in array and update sum value in which loop must be continue till (no of elements-1 ) times
Step7:avg=sum/no of elements
Step8: Print the values average
I don't think It's true. I mean I feel there are errors? Thank you for time.
Scratch
Here is the algorithm in variant 2 (see Java algorithm below) in Scratch. The output should be identical.
Java
Here is the algorithm in Java where I did comment the steps which should give you a step-by-step guide on how to do it in Scratch as well.
I have also implemented two variants of the algorithm to show you some considerations that a programmer often has to think of when implementing an algorithm which mainly is time (= time required for the algorithm to complete) and space (= memory used on your computer).
Please note: the following algorithms do not handle errors. E.g. if a user would enter a instead of a number the program would crash. It is easy to adjust the program to handle this but for simplicity I did not do that.
Variant 1: Storing all elements in array numbers
This variant stores all numbers in an array numbers and calculates the sum at the end using those numbers which is slower than variant 2 as the algorithm goes over all the numbers twice. The upside is that you will preserve all the numbers the user entered and you could use that later on if you need to but you will need storage to store those values.
public static void yourAlgorithm() {
// needed in Java to get input from user
var sc = new Scanner(System.in);
// print to screen (equivalent to "say"/ "ask")
System.out.print("How many numbers do you want? ");
// get amount of numbers as answer from user
var amount = sc.nextInt();
// create array to store all elements
var numbers = new int[amount];
// set iterator to 1
int iterator = 1;
// as long as the iterator is smaller or equal to the number of required numbers, keep asking for new numbers
// equivalent to "repeat amount" except that retries are possible if no number was entered
while (iterator <= amount) {
// ask for a number
System.out.printf("%d. number: ", iterator);
// insert the number at position iterator - 1 in the array
numbers[iterator - 1] = sc.nextInt();
// increase iterator by one
iterator++;
}
// calulate the sum after all the numbers have been entered by the user
int sum = 0;
// go over all numbers again! (this is why it is slower) and calculate the sum
for (int i = 0; i < amount; i++) {
sum += numbers[i];
}
// print average to screen
System.out.printf("Average: %s / %s = %s", sum, amount, (double)sum / (double)amount);
}
Variant 2: Calculating sum when entering new number
This algorithm does not store the numbers the user enters but immediately uses the input to calculate the sum, hence it is faster as only one loop is required and it needs less memory as the numbers do not need to be stored.
This would be the best solution (fastest, least space/ memory needed) in case you do not need all the numbers the user entered later on.
// needed in Java to get input from user
var sc = new Scanner(System.in);
// print to screen (equivalent to "say"/ "ask")
System.out.print("How many numbers do you want? ");
// get amount of numbers as answer from user
var amount = sc.nextInt();
// set iterator to 1
int iterator = 1;
int sum = 0;
// as long as the iterator is smaller or equal to the number of required numbers, keep asking for new numbers
// equivalent to "repeat amount" except that retries are possible if no number was entered (e.g. character was entered instead)
while (iterator <= amount) {
// ask for a number
System.out.printf("%d. number: ", iterator);
// get number from user
var newNumber = sc.nextInt();
// add the new number to the sum
sum += newNumber;
// increase iterator by one
iterator++;
}
// print average to screen
System.out.printf("Average: %s / %s = %s", sum, amount, (double)sum / (double)amount);
Variant 3: Combining both approaches
You could also combine both approaches, i. e. calculating the sum within the first loop and additionally storing the values in a numbers array so you could use that later on if you need to.
Expected output

Different way to index threads in CUDA C

I have 9x9 matrix and i flattened it into a vector of 81 elements; then i defined a grid of 9 blocks with 9 threads each for a total of 81 threads; here's a picture of the grid
Then i tried to verify what was the index related to the the thread (0,0) of block (1,1); first i calculated the i-th column and the j-th row like this:
i = blockDim.x*blockId.x + threadIdx.x = 3*1 + 0 = 3
j = blockDim.y*blockId.y + threadIdx.y = 3*1 + 0 = 3
therefore the index is:
index = N*i + j = 9*3 +3 = 30
As a matter of fact thread (0,0) of block (1,1) is actually the 30th element of the matrix;
Now here's my problem: let's say a choose a grid with 4 blocks (0,0)(1,0)(0,1)(1,1) with 4 threads each (0,0)(1,0)(0,1)(1,1)
Let's say i keep the original vector with 81 elements; what should i do to get the index of a generic element of the vector by using just 4*4 = 16 threads? i have tried the formulas written above but they don't seem to apply.
My goal is that every thread handles a single element of the vector...
A common way to have a smaller number of threads cover a larger number of data elements is to use a "grid-striding loop". Suppose I had a vector of length n elements, and I had some smaller number of threads, and I wanted to take every element, add 1 to it, and store it back in the original vector. That code could look something like this:
__global__ void my_inc_kernel(int *data, int n){
int idx = (gridDim.x*blockDim.x)*(threadIdx.y+blockDim.y*blockIdx.y) + (threadIdx.x+blockDim.x*blockIdx.x);
while(idx < n){
data[idx]++;
idx += (gridDim.x*blockDim.x)*(gridDim.y*blockDim.y);}
}
(the above is coded in browser, not tested)
The only complicated parts above are the indexing parts. The initial calculation of idx is just a typical creation/assignment of a globally unique id (idx) to each thread in a 2D threadblock/grid structure. Let's break it down:
int idx = (gridDim.x*blockDim.x)*(threadIdx.y+blockDim.y*blockIdx.y) +
(width of grid in threads)*(thread y-index)
(threadIdx.x+blockDim.x*blockIdx.x);
(thread x-index)
The amount added to idx on each pass of the while loop is the size of the 2D grid in total threads. Therefore, each iteration of the while loop does one "grid's width" of elements at a time, and then "strides" to the next grid-width, to process the next group of elements. Let's break that down:
idx += (gridDim.x*blockDim.x)*(gridDim.y*blockDim.y);
(width of grid in threads)*(height of grid in threads)
This methodology does not require that the total number of elements be evenly divisible the number of threads. The conditional check of the while-loop handles all cases of relationship between vector size and grid size.
This particular grid-striding loop methodology has the additional benefit (in terms of mapping elements to threads) that it tends to naturally promote coalesced access. The reads and writes to data vector in the code above will coalesce perfectly, due to the behavior of the grid-striding loop. You can enhance coalescing behavior in this case by choosing blocks that are a whole-number multiple of 32, but that is not central to your question.

matlab code optimization - clustering algorithm KFCG

Background
I have a large set of vectors (orientation data in an axis-angle representation... the axis is the vector). I want to apply a clustering algorithm to. I tried kmeans but the computational time was too long (never finished). So instead I am trying to implement KFCG algorithm which is faster (Kirke 2010):
Initially we have one cluster with the entire training vectors and the codevector C1 which is centroid. In the first iteration of the algorithm, the clusters are formed by comparing first element of training vector Xi with first element of code vector C1. The vector Xi is grouped into the cluster 1 if xi1< c11 otherwise vector Xi is grouped into cluster2 as shown in Figure 2(a) where codevector dimension space is 2. In second iteration, the cluster 1 is split into two by comparing second element Xi2 of vector Xi belonging to cluster 1 with that of the second element of the codevector. Cluster 2 is split into two by comparing the second element Xi2 of vector Xi belonging to cluster 2 with that of the second element of the codevector as shown in Figure 2(b). This procedure is repeated till the codebook size is reached to the size specified by user.
I'm unsure what ratio is appropriate for the codebook, but it shouldn't matter for the code optimization. Also note mine is 3-D so the same process is done for the 3rd dimension.
My code attempts
I've tried implementing the above algorithm into Matlab 2013 (Student Version). Here's some different structures I've tried - BUT take way too long (have never seen it completed):
%training vectors:
Atgood = Nx4 vector (see test data below if want to test);
vecA = Atgood(:,1:3);
roA = size(vecA,1);
%Codebook size, Nsel, is ratio of data
remainFrac2=0.5;
Nseltemp = remainFrac2*roA; %codebook size
%Ensure selected size after nearest power of 2 is NOT greater than roA
if 2^round(log2(Nseltemp)) &lt roA
NselIter = round(log2(Nseltemp));
else
NselIter = ceil(log2(Nseltemp)-1);
end
Nsel = 2^NselIter; %power of 2 - for LGB and other algorithms
MAIN BLOCK TO OPTIMIZE:
%KFCG:
%%cluster = cell(1,Nsel); %Unsure #rows - Don't know how to initialize if need mean...
codevec(1,1:3) = mean(vecA,1);
count1=1;
count2=1;
ind=1;
for kk = 1:NselIter
hh2 = 1:2:size(codevec,1)*2;
for hh1 = 1:length(hh2)
hh=hh2(hh1);
% for ii = 1:roA
% if vecA(ii,ind) &lt codevec(hh1,ind)
% cluster{1,hh}(count1,1:4) = Atgood(ii,:); %want all 4 elements
% count1=count1+1;
% else
% cluster{1,hh+1}(count2,1:4) = Atgood(ii,:); %want all 4
% count2=count2+1;
% end
% end
%EDIT: My ATTEMPT at optimizing above for loop:
repcv=repmat(codevec(hh1,ind),[size(vecA,1),1]);
splitind = vecA(:,ind)&gt=repcv;
splitind2 = vecA(:,ind)&ltrepcv;
cluster{1,hh}=vecA(splitind,:);
cluster{1,hh+1}=vecA(splitind2,:);
end
clear codevec
%Only mean the 1x3 vector portion of the cluster - for centroid
codevec = cell2mat((cellfun(#(x) mean(x(:,1:3),1),cluster,'UniformOutput',false))');
if ind &lt 3
ind = ind+1;
else
ind=1;
end
end
if length(codevec) ~= Nsel
warning('codevec ~= Nsel');
end
Alternatively, instead of cells I thought 3D Matrices would be faster? I tried but it was slower using my method of appending the next row each iteration (temp=[]; for...temp=[temp;new];)
Also, I wasn't sure what was best to loop with, for or while:
%If initialize cell to full length
while length(find(~cellfun('isempty',cluster))) < Nsel
Well, anyways, the first method was fastest for me.
Questions
Is the logic standard? Not in the sense that it matches with the algorithm described, but from a coding perspective, any weird methods I employed (especially with those multiple inner loops) that slows it down? Where can I speed up (you can just point me to resources or previous questions)?
My array size, Atgood, is 1,000,000x4 making NselIter=19; - do I just need to find a way to decrease this size or can the code be optimized?
Should this be asked on CodeReview? If so, I'll move it.
Testing Data
Here's some random vectors you can use to test:
for ii=1:1000 %My size is ~ 1,000,000
omega = 2*rand(3,1)-1;
omega = (omega/norm(omega))';
Atgood(ii,1:4) = [omega,57];
end
Your biggest issue is re-iterating through all of vecA FOR EACH CODEVECTOR, rather than just the ones that are part of the corresponding cluster. You're supposed to split each cluster on it's codevector. As it is, your cluster structure grows and grows, and each iteration is processing more and more samples.
Your second issue is the loop around the comparisons, and the appending of samples to build up the clusters. Both of those can be solved by vectorizing the comparison operation. Oh, I just saw your edit, where this was optimized. Much better. But codevec(hh1,ind) is just a scalar, so you don't even need the repmat.
Try this version:
% (preallocs added in edit)
cluster = cell(1,Nsel);
codevec = zeros(Nsel, 3);
codevec(1,:) = mean(Atgood(:,1:3),1);
cluster{1} = Atgood;
nClusters = 1;
ind = 1;
while nClusters < Nsel
for c = 1:nClusters
lower_cluster_logical = cluster{c}(:,ind) < codevec(c,ind);
cluster{nClusters+c} = cluster{c}(~lower_cluster_logical,:);
cluster{c} = cluster{c}(lower_cluster_logical,:);
codevec(c,:) = mean(cluster{c}(:,1:3), 1);
codevec(nClusters+c,:) = mean(cluster{nClusters+c}(:,1:3), 1);
end
ind = rem(ind,3) + 1;
nClusters = nClusters*2;
end

Exclude matrix elements from calculation with respect to performance

I am trying to save some calculation time. I am doing some Image processing with the well known Lucas Kanade algorithm. Starting point was this paper by Baker / Simon.
I am doing this Matlab and I also use a background substractor. I want the substractor to set all background to 0 or have a logical mask with 1 as foreground and 0 as background.
What I want to have is to exclude all matrix elements which are background from the calculation. My goal is to save time for the calculation. I am aware that I can use syntax like
A(A>0) = ...
But that doesn't work in a way like
B(A>0) = A.*C.*D
because I am getting an error:
In an assignment A(I) = B, the number of elements in B and I must be the same.
This is probably because A,B and C all together have more elements than only matrix A.
In c-code I would just loop the matrix and check if the pixel has the value 0 and the continue. In this case a save a whole bunch of calculations.
In matlab however it's not very fast to loop through the matrix. So is there a fast way to solve my Problem? I couldn't find a sufficient answere to my problem here.
I case anybody is interested: I am trying to use robust error function instead of quadratic ones.
Update:
I tried the following approach to test the speed as suggested by #Acorbe:
function MatrixTest()
n = 100;
A = rand(n,n);
B = rand(n,n);
C = rand(n,n);
D = rand(n,n);
profile clear, profile on;
for i=1:10000
tests(A,B,C,D);
end
profile off, profile report;
function result = tests(A,B,C,D)
idx = (B>0);
t = A(idx).*B(idx).*C(idx).*D(idx);
LGS1a(idx) = t;
LGS1b = A.*B.*C.*D;
And i got the folloing results with the profiler of matlab:
t = A(idx).*B(idx).*C(idx).*D(idx); 1.520 seconds
LGS1a(idx) = t; 0.513 seconds
idx = (B>0); 0.264 seconds
LGS1b = A.*B.*C.*D; 0.155 seconds
As you can see, the overhead of accessing the matrix by index hast far more costs than just
What about the following?
mask = A>0;
B = zeros(size(A)); % # some initialization
t = A.*C.*D;
B( mask ) = t( mask );
in this way you select just the needed elements of t. Maybe there is some overhead in the calculation, although likely negligible with respect to for loops slowness.
EDIT:
If you want more speed, you can try a more selective approach which uses the mask everywhere.
t = A(mask).*C(mask).*D(mask);
B( mask ) = t;

How to keep a random subset of a stream of data?

I have a stream of events flowing through my servers. It is not feasible for me to store all of them, but I would like to periodically be able to process some of them in aggregate. So, I want to keep a subset of the stream that is a random sampling of everything I've seen, but is capped to a max size.
So, for each new item, I need an algorithm to decide if I should add it to the stored set, or if I should discard it. If I add it, and I'm already at my limit, I need an algorithm to evict one of the old items.
Obviously, this is easy as long as I'm below my limit (just save everything). But how can I maintain a good random sampling without being biased towards old items or new items once I'm past that limit?
Thanks,
This is a common interview question.
One easy way to do it is to save the nth element with probability k/n (or 1, whichever is lesser). If you need to remove an element to save the new sample, evict a random element.
This gives you a uniformly random subset of the n elements. If you don't know n, you can estimate it and get an approximately uniform subset.
This is called random sampling. Source: http://en.wikipedia.org/wiki/Reservoir_sampling
array R[k]; // result
integer i, j;
// fill the reservoir array
for each i in 1 to k do
R[i] := S[i]
done;
// replace elements with gradually decreasing probability
for each i in k+1 to length(S) do
j := random(1, i); // important: inclusive range
if j <= k then
R[j] := S[i]
fi
done
A decent explanation/proof: http://propersubset.com/2010/04/choosing-random-elements.html
While this paper isn't precisely what you're looking for, it may be a good starting point in your search.
store samples in a first in first out (FIFO) queue.
set a sampling rate of so many events between samples, or randomize this a bit - depending on your patterns of events.
save every nth event, or whenever your rate tells you to, then stick it in to the end of the queue.
pop one off the top if the size is too big.
This is assuming you dont know the total number of events that will be received and that you don't need a minimum number of elements in the subset.
arr = arr[MAX_SIZE] //Create a new array that will store the events. Assuming first index 1.
counter = 1 //Initialize a counter.
while(receiving event){
random = //Generate a random number between 1 and counter
if( counter == random ){
if( counter <= MAX_SIZE ){
arr[counter] = event
}
else{
tmpRandom = //Generate a random number between 1 and MAX_SIZE
arr[tmpRandom] = event
}
}
counter =+ 1
}
Assign a probability of recording each event and store the event in an indexable data structure. When the size of the structure gets to the threshold, remove a random element and add new elements. In Ruby, you could do this:
#storage = []
prob = 0.002
while ( message = getnextMessage) do
#storage.delete((rand() * #storage.length).floor) if #storage.length > MAX_LEN
#storage << message if (rand() < prob)
end
This addresses your max size AND your non-bias toward when the event occurred. You could also choose which element gets deleted by partitioning your stored elements into buckets and then removing an element from any bucket that has more than one element. The bucket method allows you to keep one from each hour, for example.
You should also know that sampling theory is Big Math. If you need more than a layman's idea about this you should consult a qualified mathematician in your area.

Resources