SAS grouping algorithm - algorithm

I have the following mock up table
#n a b group
1 1 1 1
2 1 2 1
3 2 2 1
4 2 3 1
5 3 4 2
6 3 5 2
7 4 5 2
I am using SAS for this problem. In column group, the rows that are interconnected through a and b are grouped. I will try to explain why these rows are in the same group
row 1 to 2 are in group 2 since they both have a = 1
row 3 is in group 2 since b = 2 in row 2 and 3 and row 2 is in group 1
row 3 and 4 are in group 1 since a = 2 in both rows and row 3 is in group 1
The overall logic is that if a row x contains the same value of a or b as row y, row x also belongs to the same group as y is a part of.
Following the same logic, row 5,6 and 7 are in group 2.
Is there any way to make an algorithm to find these groups?

Case I:
Grouping defined as to be item linkage within contiguous rows.
Use the LAG function to examine both variables prior values. Increase the group value if both have changed. For example
group + ( a ne lag(a) and b ne lag(b) );
Case II:
Grouping determined from pair item slot value linkages over all data.
From grouping pairs by either key
General statement of problem:
-----------------------------
Given: P = p{i} = (p{i,1},p{i,2}), a set of pairs (key1, key2).
Find: The distinct groups, G = g{x}, of P,
such that each pair p in a group g has this property:
key1 matches key1 of any other pair in g.
-or-
key2 matches key2 of any other pair in g.
Demonstrates
… an iterative way using hashes.
Two hashes maintain the groupId assigned to each key value.
Two additional hashes are used to maintain group mapping paths.
When the data can be passed without causing a mapping, then the groups have
been fully determined.
A final pass is done, at which point the groupIds are assigned to each
pair and the data is output to a table.

Related

Sorting/ordering values from smallest to biggest in an array

I have a formula like this : =ArrayFormula(sort(INDEX($B$1:$B$10,MATCH(E1,$A$1:$A$10,0))))
in columns A:B:
a 1
b 2
c 3
d 4
e 5
f 6
g 7
h 8
i 9
j 10
and
the data to convert in E:H
a c f e
f a c b
b a c d
I get the following results using the above formula
in columns L:O:
1 3 6 5
6 1 3 2
2 1 3 4
My desired output is like this:
1 3 5 6
1 2 3 6
1 2 3 4
I'd like to arrange the numbers from smallest to biggest in value. I can do this with additional helper cells. but if possible i'd like to get the same result without any additional cells. can i get a little help please? thanks.
To sort by row, use SORT BYROW. But unfortunately, nested array results aren't supported in BYROW. So, we need to JOIN and SPLIT the resulting array.
=ARRAYFORMULA(SPLIT(BYROW(your_formula,LAMBDA(row,JOIN("🌆",SORT(TRANSPOSE(row))))),"🌆"))
Here's another way using Makearray with Index to get the current row and Small to get the smallest, next smallest etc. within the row:
=ArrayFormula(makearray(3,4,lambda(r,c,small(index(vlookup(E1:H3,A1:B10,2,false),r,0),c))))
Or you could change the order (might be a little faster) as you don't need to vlookup the entire array, just the current row:
=ArrayFormula(makearray(3,4,lambda(r,c,small(vlookup(index(E1:H3,r,0),A1:B10,2,false),c))))
It's interesting (to me at any rate) that you can interrogate the row and column number of the current cell using Map or Scan, so this is also possible:
=ArrayFormula(map(E1:H3,lambda(cell,small(vlookup(index(E1:H3,row(cell),0),A1:B10,2,false),column(cell)-column(E:E)+1))))
Thanks to #JvdV for this insight (which may be obvious to some but wasn't to me) shown here in Excel.
try:
=INDEX(TRIM(SPLIT(FLATTEN(QUERY(QUERY(QUERY(SPLIT(FLATTEN(E1:H3&"×​"&ROW(E1:H3)), "​"),
"select max(Col1) group by Col1 pivot Col2"), "offset 1", 0),,9^9)), "×")))
or if you want numbers:
=INDEX(IFNA(VLOOKUP(TRIM(SPLIT(FLATTEN(QUERY(QUERY(QUERY(SPLIT(FLATTEN(E1:H3&"×​"&ROW(E1:H3)), "​"),
"select max(Col1) group by Col1 pivot Col2"), "offset 1", 0),,9^9)), "×")), A:B, 2, 0)))

Best approach for formula of matching values in sheet 2

I need to populate column A in sheet two based on multiple columns in sheet one.
For example, here are two of multiple conditions:
If columns A,B,C,D (of sheet 2) are all 5/6 then populate corresponding row in sheet one with "mid".
If columns A,B,C,D (of sheet 2) contain at least one 3 and L,M,O contain all 0s, populate "low".
I believe using SWITCH would make the most sense, unless someone can reccommend a simpler approach?
My main issue is with the syntax of writing this, I am getting a formula parse error:
=SWITCH(Sheet 1!G2:G&K2:K,ISBETWEEN(5,6),"mid")
Sheet 1
A B C D E F G H I J K L M N O
2 2 3 2 0 0 0 0
5 5 6 6
In row one of my example sheet 2 would get "mid" and row 2 would get "low"
try:
=ARRAYFORMULA(
IF( 4=LEN(REGEXREPLACE(FLATTEN(QUERY(TRANSPOSE(A1:D5),,9^9)), "[^5-6]+", )), "mid",
IF((4=LEN(REGEXREPLACE(FLATTEN(QUERY(TRANSPOSE(L1:O5),,9^9)), "[^0]+", )))*(REGEXMATCH(FLATTEN(QUERY(TRANSPOSE(A1:D5),,9^9)), "3")), "low", )))

How to extract vectors from a given condition matrix in Octave

I'm trying to extract a matrix with two columns. The first column is the data that I want to group into a vector, while the second column is information about the group.
A =
1 1
2 1
7 2
9 2
7 3
10 3
13 3
1 4
5 4
17 4
1 5
6 5
the result that i seek are
A1 =
1
2
A2 =
7
9
A3 =
7
10
13
A4=
1
5
17
A5 =
1
6
as an illustration, I used the eval function but it didn't give the results I wanted
Assuming that you don't actually need individually named separated variables, the following will put the values into separate cells of a cell array, each of which can be an arbitrary size and which can be then retrieved using cell index syntax. It makes used of logical indexing so that each iteration of the for loop assigns to that cell in B just the values from the first column of A that have the correct number in the second column of A.
num_cells = max (A(:,2));
B = cell (num_cells,1);
for idx = 1:max(A(:,2))
B(idx) = A((A(:,2)==idx),1);
end
B =
{
[1,1] =
1
2
[2,1] =
7
9
[3,1] =
7
10
13
[4,1] =
1
5
17
[5,1] =
1
6
}
Cell arrays are accessed a bit differently than normal numeric arrays. Array indexing (with ()) will return another cell, e.g.:
>> B(1)
ans =
{
[1,1] =
1
2
}
To get the contents of the cell so that you can work with them like any other variable, index them using {}.
>> B{1}
ans =
1
2
How it works:
Use max(A(:,2)) to find out how many array elements are going to be needed. A(:,2) uses subscript notation to indicate every value of A in column 2.
Create an empty cell array B with the right number of cells to contain the separated parts of A. This isn't strictly necessary, but with large amounts of data, things can slow down a lot if you keep adding on to the end of an array. Pre-allocating is usually better.
For each iteration of the for loop, it determines which elements in the 2nd column of A have the value matching the value of idx. This returns a logical array. For example, for the third time through the for loop, idx = 3, and:
>> A_index3 = A(:,2)==3
A_index3 =
0
0
0
0
1
1
1
0
0
0
0
0
That is a logical array of trues/falses indicating which elements equal 3. You are allowed to mix both logical and subscripts when indexing. So using this we can retrieve just those values from the first column:
A(A_index3, 1)
ans =
7
10
13
we get the same result if we do it in a single line without the A_index3 intermediate placeholder:
>> A(A(:,2)==3, 1)
ans =
7
10
13
Putting it in a for loop where 3 is replaced by the loop variable idx, and we assign the answer to the idx location in B, we get all of the values separated into different cells.

count the number of split points

I just got a question about counting the split points in a integer array, to ensure there is at least one duplicated integer on the two sides.
ex:
1 1 4 2 4 2 4 1
we can either split it into:
1 1 4 2 | 4 2 4 1
or
1 1 4 2 4 | 2 4 1
so that there is at least one '1', '2' ,and '4' are in both sides.
The integer can range from 1 to 100,000
The complexity requires O(n). How to solve this question?
Make one pass over the array and build count[i] = how many times the value i appears in the array. The problem is only solvable if count[i] >= 2 for all non-zero values. You can use this array to tell how many distinct values you have in your array.
Next, make another pass and using another array count2[i] (or you can reuse the first one), keep track of when you have visited each value at least once. Then use that position as your split point.
Example:
1 1 4 2 4 2 4 1
count = [3, 2, 0, 4] => 3 distinct values
1 1 4 2 4 2 4 1
^ => 1 distinct value so far
^ => 1 distinct value so far
^ => 2 distinct values so far
^ => 3 distinct values so far => this is your split point
There might be cases for which there is no solution, for example if the last 1 was at the beginning as well. To detect this, you can just make another pass over the rest of the array after you have decided on the split point and see if you still have all the values on that side.
You can avoid this last pass by using the count and count2 arrays to detect when you can no longer have a split point. This is left as an exercise.

Array problem using if and do loop

This is my code:
data INDAT8; set INDAT6;
Array myarray{24,27};
goodgroups=0;
do i=2 to 24 by 2;
do j=2 to 27;
if myarray[i,j] gt 1 then myarray[i+1,j] = 'bad';
else if myarray[i,j] eq 1 and myarray[i+1,j] = 1 then myarray[i+1,j]= 'good';
end;
end;
run;
proc print data=INDAT8;
run;
Problem:
I have the data in this format- it is just an example: n=2
X Y info
2 1 good
2 4 bad
3 2 good
4 1 bad
4 4 good
6 2 good
6 3 good
Now, the above data is in sorted manner (total 7 rows). I need to make a group of 2 , 3 or 4 rows separately and generate a graph. In the above data, I made a group of 2 rows. The third row is left alone as there is no other column in 3rd row to form a group. A group can be formed only within the same row. NOT with other rows.
Now, I will check if both the rows have “good” in the info column or not. If both rows have “good” – the group formed is also good , otherwise bad. In the above example, 3rd /last group is “good” group. Rest are all bad group. Once I’m done with all the rows, I will calculate the total no. of Good groups formed/Total no. of groups.
In the above example, the output will be: Total no. of good groups/Total no. of groups => 1/3.
This is the case of n=2(size of group)
Now, for n=3, we make group of 3 rows and for n=4, we make a group of 4 rows and find the good /bad groups in a similar way. If all the rows in a group has “good” block—the result is good block, otherwise bad.
Example: n= 3
2 1 good
2 4 bad
2 6 good
3 2 good
4 1 good
4 4 good
4 6 good
6 2 good
6 3 good
In the above case, I left the 4th row and last 2 rows as I can’t make group of 3 rows with them. The first group result is “bad” and last group result is “good”.
Output: 1/ 2
For n= 4:
2 1 good
2 4 good
2 6 good
2 7 good
3 2 good
4 1 good
4 4 good
4 6 good
6 2 good
6 3 good
6 4 good
6 5 good
In this case, I make a group of 4 and finds the result. The 5th,6th,7th,8th row are left behind or ignored. I made 2 groups of 4 rows and both are “good” blocks.
Output: 2/2
So, After getting 3 output values from n=2 , n-3, and n=4 I will plot a graph of these values.
If you can help in any any language using array, if and do loop. it would be great.
I can change my code accordingly.
Update:
The answer for this doesn't have to be in sas. Since it is more algorithm-related than anything, I will accept suggestions in any language as long as they show how to accomplish this using arrays and do.
I am having trouble understanding your problem statement, but from what I can gather here is what I can suggest:
Place data into bins and the process the summary data.
Implementation 1
Assumption: You don't know what the range of the first column will be or distriution will be sparse
Create a hash table. The Key will be the item you are doing your grouping on. The value will be the count seen so far.
Proces each record. If the key already exists, increment the count (value for that key in the hash). Otherwise add the key and set the value to 1.
Continue until you have processed all records
Count the number of keys in the hash table and the number of values that are greater than your threshold.
Implementation 2
Assumption: You know the range of the first column and the distriution is reasonably dense
Create an array of integers with enough elements so the index can match the column value. Initialize all elements to zero. This array will hold your count for each item you are grouping on
Process each record. Examine value of first column. Increment corresponding index in array. (So if you have "2 1 good", do groupCount[2]++)
Continue until you have processed all records
Walk each element in the array. Count how many items are non zero (meaning they appeared at least once) and how many items meet your threshold.
You can use the same approach for gathering the good and bad counts.

Resources