How to represent clusters in MATLAB? - algorithm

Suppose I have the following data sets:
A:
1 8 9 12
2 1 0 35
7 0 0 23
B:
6 3
1 9
0 7
What I want to do is for each row in B, find the smallest value and get the column index from which it appears in. For example, for row 1 from B, the smallest value is 3 which comes from column 2. Therefore add row 1 from A to Cluster 2.
For row 2 from B, the smallest value is 1, which comes from column 1. Therefore add row 2 from A to Cluster 1. And so on...
Now I want to make an array called C (this will represent my clusters) with 2 items. Item 1 contains the matrix of all rows from A that should be in Cluster 1, and Item 2 contains the matrix of all rows from A that should be in Cluster 2. This is where I'm having problems. This is my current attempt:
function clusterSet = buildClusters(A, B)
clusterSet = zeros(size(B, 2)); % Number of clusters = number of columns in B
for i = 1:size(A, 1)
[value, index] = min(B(i,:)); % Get the minimum value of B in row i, and its index (column number)
clusterSet(index) = A(i,:); % Add row i from A to its corresponding cluster's matrix.
end
end
I'm getting the following error on the last line (note: this is not explicitly referring to my data sets 'A' and 'B', but talks about a general A and B):
In an assignment A(I) = B, the number of elements in B and I must
be the same.
If the minimum value of B in row 1 comes from column 2, then row 1 from A should be added to a matrix Cluster 2 (row of B corresponds to which row of A to add to the cluster, and the column of B represents which cluster to add it to). This is what I want that line to do but I get the above error.
Any suggestions?

Here's a way without loops:
[~, cluster] = min(B,[],2); %// get cluster index of each row
[clusterSort, indSort] = sort(cluster); %// sort cluster indices
sz = accumarray(clusterSort,1); %// size of each cluster
C = mat2cell(A(indSort,:), sz); %// split A into cell array based on clusters

Related

How to extract vectors from a given condition matrix in Octave

I'm trying to extract a matrix with two columns. The first column is the data that I want to group into a vector, while the second column is information about the group.
A =
1 1
2 1
7 2
9 2
7 3
10 3
13 3
1 4
5 4
17 4
1 5
6 5
the result that i seek are
A1 =
1
2
A2 =
7
9
A3 =
7
10
13
A4=
1
5
17
A5 =
1
6
as an illustration, I used the eval function but it didn't give the results I wanted
Assuming that you don't actually need individually named separated variables, the following will put the values into separate cells of a cell array, each of which can be an arbitrary size and which can be then retrieved using cell index syntax. It makes used of logical indexing so that each iteration of the for loop assigns to that cell in B just the values from the first column of A that have the correct number in the second column of A.
num_cells = max (A(:,2));
B = cell (num_cells,1);
for idx = 1:max(A(:,2))
B(idx) = A((A(:,2)==idx),1);
end
B =
{
[1,1] =
1
2
[2,1] =
7
9
[3,1] =
7
10
13
[4,1] =
1
5
17
[5,1] =
1
6
}
Cell arrays are accessed a bit differently than normal numeric arrays. Array indexing (with ()) will return another cell, e.g.:
>> B(1)
ans =
{
[1,1] =
1
2
}
To get the contents of the cell so that you can work with them like any other variable, index them using {}.
>> B{1}
ans =
1
2
How it works:
Use max(A(:,2)) to find out how many array elements are going to be needed. A(:,2) uses subscript notation to indicate every value of A in column 2.
Create an empty cell array B with the right number of cells to contain the separated parts of A. This isn't strictly necessary, but with large amounts of data, things can slow down a lot if you keep adding on to the end of an array. Pre-allocating is usually better.
For each iteration of the for loop, it determines which elements in the 2nd column of A have the value matching the value of idx. This returns a logical array. For example, for the third time through the for loop, idx = 3, and:
>> A_index3 = A(:,2)==3
A_index3 =
0
0
0
0
1
1
1
0
0
0
0
0
That is a logical array of trues/falses indicating which elements equal 3. You are allowed to mix both logical and subscripts when indexing. So using this we can retrieve just those values from the first column:
A(A_index3, 1)
ans =
7
10
13
we get the same result if we do it in a single line without the A_index3 intermediate placeholder:
>> A(A(:,2)==3, 1)
ans =
7
10
13
Putting it in a for loop where 3 is replaced by the loop variable idx, and we assign the answer to the idx location in B, we get all of the values separated into different cells.

Finding the optimal sum of a 2D array

The problem statement goes like this:
Given an N x M array of (nonnegative) integers, find the optimal value of each column in the array, taking into account free rows.
A free row is anything in the range of [prevOld, prevNew]
Free rows are given starting values of:
prevOld = 0
prevNew = 0
so at the first step, only row 0 is free. If an element lies in a free row, it incurs no penalty.
If an element is not in a free row, then the penalty incurred is 2 * distance to closest free row - (e.g if free rows are [1,3] and our element is in row 5 then the element loses 2*(5-3) value. But if the element was in row 2, then no pentalty is incurred since 1 <= 2 <= 3
Once an element has been selected, free rows are updated as such:
prevOld = prevNew
prevNew = row of selected element
So if we begin with [x,y] and choose element in row z, then for the next column, free rows are now [y,z]
We are asked to solve this using an dynamic programming algorithm.
I am having a hell of a time coming up with a recurrence relation for this problem. I originally tried an algorithm that chooses the maximum element in a column based on a "real" value given the free rows, but this doesn't take into account the fact that sometimes we want to choose a lower value in our current column for a higher value in the next column. Any point in the right direction would be greatly appreciated.
EDIT
Sample input/output (no input/output was given, so putting this together from instructions):
Input: 3 x 3 array
4 5 7
7 8 7
1 9 10
First column is 4 7 1 and our starting rows are prevOld = 0 and prevNew = 0, [0,0]
so the 0th row is the only free row, with that, the "real" values of the first column are: 4 5 -3
4 is row 0, so it is free therefore its value is not affected
7 is in row 1, which is 1 away from the closest free row 0, so it loses 2*1 value, so 7 has a "real" value of 5
1 is in row 2, which is 2 away from the closest row, so it loses 2*2 value, so 1 has a "real" value of -3
With these real values we would select 7 (row 1) as our choice. Now we update prevOld and prevNew
prevOld = prevNew
prevNew = 1 (because we selected row 1)
so now we have [0,1] as the free rows and we move onto the next column: 5 8 9
Skipping ahead: real values of this column are
5 (row 0 is free, no loss)
8 (row 1 is free, no loss)
7 (row 2 is 1 away from the closest free row, so it loses 2*1 value)
so we choose 8 (row 1) and update prevOld and prevNew again:
prevOld = prevNew
prevNew = 1 (selected row)
free rows are now [1,1] for the final column: 7 7 10
real values are: 5 7 8, so we choose 8 and were done
Output is: 1 1 2 (the rows we selected in each column), total: 21 (total of the "real" values we selected in the rows)

Split sequence of numbers from 1 to n^2 in n subsequences so they all have the same sum

Given the number n and a sequence of numbers from 1 to n^2 how to split it in n subsequences so all of the subsequences have the same sum and length of n ?
For example if n = 3 answer could be:
3 4 8 = 15
2 6 7 = 15
1 5 9 = 15
So I feel this problem can be solved by making few observations to the problem.
For example, let's say we have n=3. Then n^2=9.
Now total sum of all the numbers from 1 to 9 = 9 * (9+1) / 2 = 45.
So, now we can split 45 into three equal groups each having sum = 45/3 = 5.
Similarly:-
n = 4, sum of 1 to 16 numbers = 16 * 17/2 = 136. each group sum = 136/4 = 34.
n = 5, sum of 1 to 25 numbers = 25 * 26/2 = 25*13. each group sum = 25*13/5 = 65.
Now, we know what should be sum of each set of groups in order to split numbers into n sub sequences.
Now Another observation that we make is whether our n is odd or even.
For n being even, the splitting it very easy.
n = 2, so we have numbers 1 to 4.
1 4
2 3.
Let's assume a matrix of n x n , in above case it will be 2 x 2.
Rules for even n:-
1. Keep a counter = 1.
2. Fill the first column (1 to n), incrementing the counter by 1.
3. When we reach at the bottom of the column, for column 2, we do a reverse iteration (n to 1) and fill them with counter by incrementing it by 1.
You can verify this technique will work by taking n=2,4,6 ... and filling the array.
Now let's see how to fill this matrix n x n for n odd.
Rules for odd n:-
1. Keep a counter = 1.
2. Fill the first column (1 to n), incrementing the counter by 1.
3. Now this case is slightly different from even case, from the next column onwards,
we don't reverse our calculation from n to 1 but we keep moving ahead in column.
Let's understand this step by looking at an example.
Let's take n=3.
Our first column will be 1,2,3.
Now for the second column we start at bottom column which is n in our example it's 3.
Fill the n = 3 with value 4. next row value = (n+1)%n = 0, which gets 5, next row = (n+1+1)%n = 1 , which gets value 6. Now all the column 2 values are filled, let's move onto next column i.e third.
We will start at row = 1 , so row 1 column 3 will get 7, then row 2 column 3 will get 8 and then row 0 column 3 will get 9.
Hope this helps!

SAS grouping algorithm

I have the following mock up table
#n a b group
1 1 1 1
2 1 2 1
3 2 2 1
4 2 3 1
5 3 4 2
6 3 5 2
7 4 5 2
I am using SAS for this problem. In column group, the rows that are interconnected through a and b are grouped. I will try to explain why these rows are in the same group
row 1 to 2 are in group 2 since they both have a = 1
row 3 is in group 2 since b = 2 in row 2 and 3 and row 2 is in group 1
row 3 and 4 are in group 1 since a = 2 in both rows and row 3 is in group 1
The overall logic is that if a row x contains the same value of a or b as row y, row x also belongs to the same group as y is a part of.
Following the same logic, row 5,6 and 7 are in group 2.
Is there any way to make an algorithm to find these groups?
Case I:
Grouping defined as to be item linkage within contiguous rows.
Use the LAG function to examine both variables prior values. Increase the group value if both have changed. For example
group + ( a ne lag(a) and b ne lag(b) );
Case II:
Grouping determined from pair item slot value linkages over all data.
From grouping pairs by either key
General statement of problem:
-----------------------------
Given: P = p{i} = (p{i,1},p{i,2}), a set of pairs (key1, key2).
Find: The distinct groups, G = g{x}, of P,
such that each pair p in a group g has this property:
key1 matches key1 of any other pair in g.
-or-
key2 matches key2 of any other pair in g.
Demonstrates
… an iterative way using hashes.
Two hashes maintain the groupId assigned to each key value.
Two additional hashes are used to maintain group mapping paths.
When the data can be passed without causing a mapping, then the groups have
been fully determined.
A final pass is done, at which point the groupIds are assigned to each
pair and the data is output to a table.

Determining the Longest Continguous Subsequence

There are N nodes (1 <= N <= 100,000) various positions along a
long one-dimensional length. The ith node is at position x_i (an
integer in the range 0...1,000,000,000) and has a node type b_i(an integer in
the range 1..8). Nodes can not be in the same position
You want to get a range on this one-dimension in which all of the types of nodes are fairly represented. Therefore, you want to ensure that, for whatever types of nodes that are present in the range, there is an equal number of each node type (for example, a range with 27 each of types 1 and 3 is ok, a range with 27 of types 1, 3, and 4 is
ok, but 9 of type 1 and 10 of type 3 is not ok). You also want
at least K (K >= 2) types (out of the 8 total) to be represented in the
rand. Find the maximum size of this range that satisfies the constraints. The size of a photo is the difference between the maximum and minimum positions of the nodes in the photo.
If there are no ranges satisfying the constraints, output -1 instead.
INPUT:
* Line 1: N and K separated by a space
* Lines 2..N+1: Each line contains a description of a node as two
integers separated by a space; x(i) and its node type.
INPUT:
9 2
1 1
5 1
6 1
9 1
100 1
2 2
7 2
3 3
8 3
INPUT DETAILS:
Node types: 1 2 3 - 1 1 2 3 1 - ... - 1
Locations: 1 2 3 4 5 6 7 8 9 10 ... 99 100
OUTPUT:
* Line 1: A single integer indicating the maximum size of a fair
range. If no such range exists, output -1.
OUTPUT:
6
OUTPUT DETAILS:
The range from x = 2 to x = 8 has 2 each of types 1, 2, and 3. The range
from x = 9 to x = 100 has 2 of type 1, but this is invalid because K = 2
and so you need at least 2 distinct types of nodes.
Could You Please help in suggesting some algorithm to solve this. I have thought about using some sort of priority queue or stack data structure, but am really unsure how to proceed.
Thanks, Todd
It's not too difficult to invent almost linear-time algorithm because recently similar problem was discussed on CodeChef: "ABC-Strings".
Sort nodes by their positions.
Prepare all possible subsets of node types (for example, we could expect types 1,2,4,5,7 to be present in resulting interval and all other types not present there). For K=2 there may be only 256-8-1=247 subsets. For each subset perform remaining steps:
Initialize 8 type counters to [0,0,0,0,0,0,0,0].
For each node perform remaining steps:
Increment counter for current node type.
Take L counters for types included to current subset, subtract first of them from other L-1 counters, which produces L-1 values. Take remaining 8-L counters and combine them together with those L-1 values into a tuple of 7 values.
Use this tuple as a key for hash map. If hash map contains no value for this key, add a new entry with this key and value equal to the position of current node. Otherwise subtract value in the hash map from the position of current node and (possibly) update the best result.

Resources