K means clustering in matlab - algorithm

I found the below code to segment the images using K means clustering,but in the below code,they are using some calculation to find the min,max values.I know the basic concept of K-means algorithm.but I couldn't understand this code.Can any one please explain.
function [Centroid,new_cluster]=kmeans_algorithm(input_image,k)
% k = 4;
input_image=double(input_image);
new_image=input_image;
input_image=input_image(:);
min_val=min(input_image);
input_image=round(input_image-min_val+1);
length_input_image=length(input_image);
max_val=max(input_image)+1;
hist_gram=zeros(1,max_val);
hist_gram_count=zeros(1,max_val);
for i=1:length_input_image
if(input_image(i)>0)
hist_gram(input_image(i))=hist_gram(input_image(i))+1;
end;
end
IDX=find(hist_gram);
hist_length=length(IDX);
Centroid=(1:k)*max_val/(k+1);
while(true)
old_Centroid=Centroid;
for i=1:hist_length
new_val=abs(IDX(i)-Centroid);
hist_val=find(new_val==min(new_val));
hist_gram_count(IDX(i))=hist_val(1);
end
for i=1:k,
loop_count=find(hist_gram_count==i);
Centroid(i)=sum(loop_count.*hist_gram(loop_count))/sum(hist_gram(loop_count));
end
if(Centroid==old_Centroid) break;end;
end
length_input_image=size(new_image);
new_cluster=zeros(length_input_image);
for i=1:length_input_image(1),
for j=1:length_input_image(2),
new_val=abs(new_image(i,j)-Centroid);
loop_count=find(new_val==min(new_val));
new_cluster(i,j)=loop_count(1);
end
end
Centroid=Centroid+min_val-1;
especially what's the purpose of this input_image(:)in the above code. In google they said it like matrix.but still I'm confused,whether this is matrix or array

The notation (:) collapses a multi-dimensional vector into a column vector.
data = rand(10,4);
size(data(:))
% 40 1
Then you can apply a normal function to an entire multi-dimensional array
min(data(:));
Instead of to each dimension independently
min(min(data));
In the code that you have posted, they collapse input_image to a column vector just to make it easier to apply functions like min, max, and length.
Update
The code that you have posted doesn't actually perform k-means clustering. It simply creates a histogram of all values in the image. They use min and max to determine the number of bins to use for the histogram.

Related

Julia multiply 2 matrices in a for loop with a formula

Hello hopefully someone can help me... I am absolutely new to programming.
I just want to multiply two matrices in Julia with a for loop. It is about the frequency of a oscillation table at continuous casting in a steel plant. The below function calculates the frequency (=freqres) for a given mold stroke and casting speed(vc). The result is the frequency of the oscillation table for a given casting speed (from 1000mm/min to 6000mm/min).
vc = [1000:100:6000;]
function freqcalc(vc,stroke)
for i in vc
push!(freqres, 3i/4stroke)
end
end
Now I want to calculate the negative strip time with the following formula:
nst = (60/(pi*freqres))*acos(vc/(pi*stroke*freqres))
With the below command I did not achieve the desired result, I think due to a nested condition.
nstres = Float64[]
for i in vc, j in freqres
push!(nstres, (60/(pi*j))*acos(i/(pi*stroke*j)))
end
I just want to have the result of every entry of vc and freqres of the above formula in nstres.
push!(nstres, (60/(pi*freqres1))*acos(vc1/(pi*stroke*freqres1)))
push!(nstres, (60/(pi*freqres2))*acos(vc2/(pi*stroke*freqres2)))
I hope I could clarify my question, and thanks for your comments, I very appreciate it.

Is preallocation possible in this code snippet?

The following is a snippet of code of Ant Colony Optimization. I've removed whatever I feel would absolutely not be necessary to understand the code. The rest I'm not sure as I'm unfamiliar with coding on matlab. However, I'm running this algorithm on 500 or so cities with 500 ants and 1000 iterations, and the code runs extremely slow as compared to other algorithm implementations on matlab. For the purposes of my project, I simply need the datasets, not demonstrate coding capability on matlab, and I had time constraints that simply did not allow me to learn matlab from scratch, as that was not taken into consideration nor expected when the deadline was given, so I got the algorithm from an online source.
Matlab recommends preallocating two variables inside a loop as they are arrays that change size I believe. However I do not fully understand the purpose of those two parts of the code, so I haven't been able to do so. I believe both arrays increment a new item every iteration of the loop, so technically they should both be zero-able and could preallocate the size of both expected final sizes based on the for loop condition, but I'm not sure. I've tried preallocating zeroes to the two arrays, but it does not seem to fix anything as Matlab still shows preallocate for speed recommendation.
I've added two comments on the two variables recommended by MATLAB to preallocate below. If someone would be kind as to skim over it and let me know if it is possible, it'd be much appreciated.
x = 10*rand(50,1);
y = 10*rand(50,1);
n=numel(x);
D=zeros(n,n);
for i=1:n-1
for j=i+1:n
D(i,j)=sqrt((x(i)-x(j))^2+(y(i)-y(j))^2);
D(j,i)=D(i,j);
end
end
model.n=n;
model.x=x;
model.y=y;
model.D=D;
nVar=model.n;
MaxIt=100;
nAnt=50;
Q=1;
tau0=10*Q/(nVar*mean(model.D(:)));
alpha=1;
beta=5;
rho=0.6;
eta=1./model.D;
tau=tau0*ones(nVar,nVar);
BestCost=zeros(MaxIt,1);
empty_ant.Tour=[];
empty_ant.Cost=[];
ant=repmat(empty_ant,nAnt,1);
BestSol.Cost=inf;
for it=1:MaxIt
for k=1:nAnt
ant(k).Tour=randi([1 nVar]);
for l=2:nVar
i=ant(k).Tour(end);
P=tau(i,:).^alpha.*eta(i,:).^beta;
P(ant(k).Tour)=0;
P=P/sum(P);
r=rand;
C=cumsum(P);
j=find(r<=C,1,'first');
ant(k).Tour=[ant(k).Tour j];
end
tour = ant(k).Tour;
n=numel(tour);
tour=[tour tour(1)]; %MatLab recommends preallocation here
ant(k).Cost=0;
for i=1:n
ant(k).Cost=ant(k).Cost+model.D(tour(i),tour(i+1));
end
if ant(k).Cost<BestSol.Cost
BestSol=ant(k);
end
end
for k=1:nAnt
tour=ant(k).Tour;
tour=[tour tour(1)];
for l=1:nVar
i=tour(l);
j=tour(l+1);
tau(i,j)=tau(i,j)+Q/ant(k).Cost;
end
end
tau=(1-rho)*tau;
BestCost(it)=BestSol.Cost;
figure(1);
tour=BestSol.Tour;
tour=[tour tour(1)]; %MatLab recommends preallocation here
plot(model.x(tour),model.y(tour),'g.-');
end
If you change the size of an array, that means copying it to a new location in memory. This is not a huge problem for small arrays but for large arrays this slows down your code immensely. The tour arrays you're using are fixed size (51 or n+1 in this case) so you should preallocate them as zero arrays. the only thing you do is add the first element of the tour again to the end so all you have to do is set the last element of the array.
Here is what you should change:
x = 10*rand(50,1);
y = 10*rand(50,1);
n=numel(x);
D=zeros(n,n);
for i=1:n-1
for j=i+1:n
D(i,j)=sqrt((x(i)-x(j))^2+(y(i)-y(j))^2);
D(j,i)=D(i,j);
end
end
model.n=n;
model.x=x;
model.y=y;
model.D=D;
nVar=model.n;
MaxIt=1000;
nAnt=50;
Q=1;
tau0=10*Q/(nVar*mean(model.D(:)));
alpha=1;
beta=5;
rho=0.6;
eta=1./model.D;
tau=tau0*ones(nVar,nVar);
BestCost=zeros(MaxIt,1);
empty_ant.Tour=zeros(n, 1);
empty_ant.Cost=[];
ant=repmat(empty_ant,nAnt,1);
BestSol.Cost=inf;
for it=1:MaxIt
for k=1:nAnt
ant(k).Tour=randi([1 nVar]);
for l=2:nVar
i=ant(k).Tour(end);
P=tau(i,:).^alpha.*eta(i,:).^beta;
P(ant(k).Tour)=0;
P=P/sum(P);
r=rand;
C=cumsum(P);
j=find(r<=C,1,'first');
ant(k).Tour=[ant(k).Tour j];
end
tour = zeros(n+1,1);
tour(1:n) = ant(k).Tour;
n=numel(ant(k).Tour);
tour(end) = tour(1); %MatLab recommends preallocation here
ant(k).Cost=0;
for i=1:n
ant(k).Cost=ant(k).Cost+model.D(tour(i),tour(i+1));
end
if ant(k).Cost<BestSol.Cost
BestSol=ant(k);
end
end
for k=1:nAnt
tour(1:n)=ant(k).Tour;
tour(end) = tour(1);
for l=1:nVar
i=tour(l);
j=tour(l+1);
tau(i,j)=tau(i,j)+Q/ant(k).Cost;
end
end
tau=(1-rho)*tau;
BestCost(it)=BestSol.Cost;
figure(1);
tour(1:n) = BestSol.Tour;
tour(end) = tour(1); %MatLab recommends preallocation here
plot(model.x(tour),model.y(tour),'g.-');
end
I think that the warning that the MATLAB Editor gives in this case is misplaced. The array is not repeatedly resized, it is just resized once. In principle, tour(end+1)=tour(1) is more efficient than tour=[tour,tour(1)], but in this case you might not notice the difference in cost.
If you want to speed up this code you could think of vectorizing some of its loops, and of reducing the number of indexing operations performed. For example this section:
tour = ant(k).Tour;
n=numel(tour);
tour=[tour tour(1)]; %MatLab recommends preallocation here
ant(k).Cost=0;
for i=1:n
ant(k).Cost=ant(k).Cost+model.D(tour(i),tour(i+1));
end
if ant(k).Cost<BestSol.Cost
BestSol=ant(k);
end
could be written as:
tour = ant(k).Tour;
ind = sub2ind(size(model.D),tour,circshift(tour,-1));
ant(k).Cost = sum(model.D(ind));
if ant(k).Cost < BestSol.Cost
BestSol = ant(k);
end
This rewritten code doesn't have a loop, which usually makes things a little faster, and it also doesn't repeatedly do complicated indexing (ant(k).Cost is two indexing operations, within a loop that will slow you down more than necessary).
There are more opportunities for optimization like these, but rewriting the whole function is outside the scope of this answer.
I have not tried running the code, please let me know if there are any errors when using the proposed change.

Matlab subscript error

I am writing a simple code in matlab which has the purpose of creating the histogram of a grayscale image without using the function hist. I am stuck at the point in which mathlab displays the error "Subscript indices must either be real positive integers or logicals." Can you help me finding where is the wrong indices?
indirizzo='file.jpg';
immagine=imread(indirizzo);
immaginebn=rgb2gray(immagine);
n=zerps(0,255);
for x=0:255;
numeroennesimo=sum(sum(immaginebn==x));
n(x)=numeroennesimo;
end
plot(x,n)
you cant use 0 as index. Either make n(x+1) or for x = 1:256 and substract the 1 in your comparison. And there is a typo, I guess it means zeros instead of zerps, which also doesnt work with a 0. And one more, your plot will also not work as the x has only a size of 1 while n is an array of 266 and for a histogram I would use a barplot instead.
indirizzo='file.jpg';
immagine=imread(indirizzo);
immaginebn=rgb2gray(immagine);
n=zeros(1,256);
for x=0:255;
numeroennesimo=sum(sum(immaginebn==x-1));
n(x+1)=numeroennesimo;
end
bar(0:255,n)
or
indirizzo='file.jpg';
immagine=imread(indirizzo);
immaginebn=rgb2gray(immagine);
n=zeros(1,256);
xplot=zeros(1,256);
for x=1:256;
numeroennesimo=sum(sum(immaginebn==x-1));
n(x)=numeroennesimo;
xplot(x) = x-1;
end
plot(xplot,n)

how can I get the location for the maximum value in fortran?

I have a 250*2001 matrix. I want to find the location for the maximum value for a(:,i) where i takes 5 different values: i = i + 256
a(:,256)
a(:,512)
a(:,768)
a(:,1024)
a(:,1280)
I tried using MAXLOC, but since I'm new to fortran, I couldn't get it right.
Try this
maxloc(a(:,256:1280:256))
but be warned, this call will return a value in the range 1..5 for the second dimension. The call will return the index of the maxloc in the 2001*5 array section that you pass to it. So to get the column index of the location in the original array you'll have to do some multiplication. And note that since the argument in the call to maxloc is a rank-2 array section the call will return a 2-element vector.
Your question is a little unclear: it could be either of two things you want.
One value for the maximum over the entire 250-by-5 subarray;
One value for the maximum in each of the 5 250-by-1 subarrays.
Your comments suggest you want the latter, and there is already an answer for the former.
So, in case it is the latter:
b(1:5) = MAXLOC(a(:,256:1280:256), DIM=1)

Removing a "row" from a structure array

This is similar to a question I asked before, but is slightly different:
So I have a very large structure array in matlab. Suppose, for argument's sake, to simplify the situation, suppose I have something like:
structure(1).name, structure(2).name, structure(3).name structure(1).returns, structure(2).returns, structure(3).returns (in my real program I have 647 structures)
Suppose further that structure(i).returns is a vector (very large vector, approximately 2,000,000 entries) and that a condition comes along where I want to delete the jth entry from structure(i).returns for all i. How do you do this? or rather, how do you do this reasonably fast? I have tried some things, but they are all insanely slow (I will show them in a second) so I was wondering if the community knew of faster ways to do this.
I have parsed my data two different ways; the first way had everything saved as cell arrays, but because things hadn't been working well for me I parsed the data again and placed everything as vectors.
What I'm actually doing is trying to delete NaN data, as well as all data in the same corresponding row of my data file, and then doing the very same thing after applying the Hampel filter. The relevant part of my code in this attempt is:
for i=numStock+1:-1:1
for j=length(stock(i).return):-1:1
if(isnan(stock(i).return(j)))
for k=numStock+1:-1:1
stock(k).return(j) = [];
end
end
end
stock(i).return = sort(stock(i).return);
stock(i).returnLength = length(stock(i).return);
stock(i).medianReturn = median(stock(i).return);
stock(i).madReturn = mad(stock(i).return,1);
end;
for i=numStock:-1:1
for j = length(stock(i+1).volume):-1:1
if(isnan(stock(i+1).volume(j)))
for k=numStock:-1:1
stock(k+1).volume(j) = [];
end
end
end
stock(i+1).volume = sort(stock(i+1).volume);
stock(i+1).volumeLength = length(stock(i+1).volume);
stock(i+1).medianVolume = median(stock(i+1).volume);
stock(i+1).madVolume = mad(stock(i+1).volume,1);
end;
for i=numStock+1:-1:1
for j=stock(i).returnLength:-1:1
if (abs(stock(i).return(j) - stock(i).medianReturn) > 3*stock(i).madReturn)
for k=numStock+1:-1:1
stock(k).return(j) = [];
end
end;
end;
end;
for i=numStock:-1:1
for j=stock(i+1).volumeLength:-1:1
if (abs(stock(i+1).volume(j) - stock(i+1).medianVolume) > 3*stock(i+1).madVolume)
for k=numStock:-1:1
stock(k+1).volume(j) = [];
end
end;
end;
end;
However, this returns an error:
"Matrix index is out of range for deletion.
Error in Failure (line 110)
stock(k).return(j) = [];"
So instead I tried by parsing everything in as vectors. Then I decided to try and delete the appropriate entries in the vectors prior to building the structure array. This isn't returning an error, but it is very slow:
%% Delete bad data, Hampel Filter
% Delete bad entries
id=strcmp(returns,'');
returns(id)=[];
volume(id)=[];
date(id)=[];
ticker(id)=[];
name(id)=[];
permno(id)=[];
sp500(id) = [];
id=strcmp(returns,'C');
returns(id)=[];
volume(id)=[];
date(id)=[];
ticker(id)=[];
name(id)=[];
permno(id)=[];
sp500(id) = [];
% Convert returns from string to double
returns=cellfun(#str2double,returns);
sp500=cellfun(#str2double,sp500);
% Delete all data for which a return is not a number
nanid=isnan(returns);
returns(nanid)=[];
volume(nanid)=[];
date(nanid)=[];
ticker(nanid)=[];
name(nanid)=[];
permno(nanid)=[];
% Delete all data for which a volume is not a number
nanid=isnan(volume);
returns(nanid)=[];
volume(nanid)=[];
date(nanid)=[];
ticker(nanid)=[];
name(nanid)=[];
permno(nanid)=[];
% Apply the Hampel filter, and delete all data corresponding to
% observations deleted by the filter.
medianReturn = median(returns);
madReturn = mad(returns,1);
for i=length(returns):-1:1
if (abs(returns(i) - medianReturn) > 3*madReturn)
returns(i) = [];
volume(i)=[];
date(i)=[];
ticker(i)=[];
name(i)=[];
permno(i)=[];
end;
end
medianVolume = median(volume);
madVolume = mad(volume,1);
for i=length(volume):-1:1
if (abs(volume(i) - medianVolume) > 3*madVolume)
returns(i) = [];
volume(i)=[];
date(i)=[];
ticker(i)=[];
name(i)=[];
permno(i)=[];
end;
end
As I said, this is very slow, probably because I'm using a for loop on a very large data set; however, I'm not sure how else one would do this. Sorry for the gigantic post, but does anyone have a suggestion as to how I might go about doing what I'm asking in a reasonable way?
EDIT: I should add that getting the vector method to work is probably preferable, since my aim is to put all of the return vectors into a matrix and get all of the volume vectors into a matrix and perform PCA on them, and I'm not sure how I would do that using cell arrays (or even if princomp would work on cell arrays).
EDIT2: I have altered the code to match your suggestion (although I did decide to give up speed and keep with the for-loops to keep with the structure array, since reparsing this data will be way worse time-wise). The new code snipet is:
stock_return = zeros(numStock+1,length(stock(1).return));
for i=1:numStock+1
for j=1:length(stock(i).return)
stock_return(i,j) = stock(i).return(j);
end
end
stock_return = stock_return(~any(isnan(stock_return)), : );
This returns an Index exceeds matrix dimensions error, and I'm not sure why. Any suggestions?
I could not find a convenient way to handle structures, therefore I would restructure the code so that instead of structures it uses just arrays.
For example instead of stock(i).return(j) I would do stock_returns(i,j).
I show you on a part of your code how to get rid of for-loops.
Say we deal with this code:
for j=length(stock(i).return):-1:1
if(isnan(stock(i).return(j)))
for k=numStock+1:-1:1
stock(k).return(j) = [];
end
end
end
Now, the deletion of columns with any NaN data goes like this:
stock_return = stock_return(:, ~any(isnan(stock_return)) );
As for the absolute difference from medianVolume, you can write a similar code:
% stock_return_length is a scalar
% stock_median_return is a column vector (eg. [1;2;3])
% stock_mad_return is also a column vector.
median_return = repmat(stock_median_return, stock_return_length, 1);
is_bad = abs(stock_return - median_return) > 3.* stock_mad_return;
stock_return = stock_return(:, ~any(is_bad));
Using a scalar for stock_return_length means of course that the return lengths are the same, but you implicitly assume it in your original code anyway.
The important point in my answer is using any. Logical indexing is not sufficient in itself, since in your original code you delete all the values if any of them is bad.
Reference to any: http://www.mathworks.co.uk/help/matlab/ref/any.html.
If you want to preserve the original structure, so you stick to stock(i).return, you can speed-up your code using essentially the same scheme but you can only get rid of one less for-loop, meaning that your program will be substantially slower.

Resources