I have problem with clusterization of clients.
I have a dataset with columns such as name, address, email, phone, etc. (in a example A,B,C). Each row has unique identifier (ID). I need to assign CLUSTER_ID (X) to each row. In one cluster all rows have one or more the same attributes as other rows. So clients with ID=1,2,3 have the same A attribute and clients with ID=3,10 have the same B attribute then ID=1,2,3,10 should be in the same cluster.
How can I solve this problem using SQL?
If it's not possible how to write the algorithm (pseudocode)?
The performance is very important, because the dataset contains milions of rows.
Sample Input:
ID A B C
1 A1 B3 C1
2 A1 B2 C5
3 A1 B10 C10
4 A2 B1 C5
5 A2 B8 C1
6 A3 B1 C4
7 A4 B6 C3
8 A4 B3 C5
9 A5 B7 C2
10 A6 B10 C3
11 A8 B5 C4
Sample Output:
ID A B C X
1 A1 B3 C1 1
2 A1 B2 C5 1
3 A1 B10 C10 1
4 A2 B1 C5 1
5 A2 B8 C1 1
6 A3 B1 C4 1
7 A4 B6 C3 1
8 A4 B3 C5 1
9 A5 B7 C2 2
10 A6 B10 C3 1
11 A8 B5 C4 1
Thanks for any help.
A possible way is by repeating updates for the empty X.
Start with cluster_id 1.
F.e. by using a variable.
SET #CurrentClusterID = 1
Take the top 1 record, and update it's X to 1.
Now loop an update for all records with an empty X,
and that can be linked to a record with X = 1 and that has the same A or B or C
Disclaimer:
The statement will vary depending on the RDBMS.
This is just intended as pseudo-code.
WHILE (<<some check to see if there were records updated>>)
BEGIN
UPDATE yourtable t
SET t.X = #CurrentClusterID
WHERE t.X IS NULL
AND EXISTS (
SELECT 1 FROM yourtable d
WHERE d.X = #CurrentClusterID
AND (d.A = t.A OR d.B = t.B OR d.C = t.C)
);
END
Loop that till it updates 0 records.
Now repeat the method for the other clusters, till there are no more empty X in the table.
1) Increase the #CurrentClusterID by 1
2) Update the next top 1 record with an empty X to the new #CurrentClusterID
3) Loop the update till no-more updates were done.
An example test on db<>fiddle here for MS Sql Server.
Related
Table A
id
Name
1
A1
2
A2
3
A3
4
A4
5
A5
Table B
id
id_table_A
id_table_C
Name
1
1
1
Test-1
2
2
1
Test-2
3
1
2
Test-3
4
3
2
Test-4
5
3
1
Test-5
6
5
2
Test-6
7
2
2
Test-7
Table C
id
Name
1
C1
2
C2
3
C3
My Question
I want to select all datas from table A and also table B, according to id in table C in Laravel (Query Builder or Eloquent doesn't matter). So, it would showing like these:
C1 would showing:
id_table_A
Name
A1
Test-1
A2
Test-2
A3
Test-5
A4
NULL
A5
NULL
Or when I choose C2, it would showing like:
id_table_A
Name
A1
Test-3
A2
Test-7
A3
Test-4
A4
NULL
A5
Test-6
And C3 will showing like:
id_table_A
Name
A1
NULL
A2
NULL
A3
NULL
A4
NULL
A5
NULL
NB: Sorry I don't know how to simplify my question
You can use a many to many relation
A::class
class A extends Model
{
public function Cs()
{
return $this->belongsToMany(C:class, 'b', 'id_table_a', 'id_table_c')->withPivot('id','name');
}
}
Then use this relation to get what you need
$as = A::with('Cs')->get();
I have a temp table using to test and need direction with some analytics function. Still trying to figure out my real solution.. and any help to lead me in right direction will be appreciated.
A1 B1
40 5
50 4
60 3
70 2
90 1
Tyring to find the previous value and subtract and add the column
SELECT A1, B1,
(A1-B1) AS C1,
(A1-B1) + LEAD((A1-B1),1,0) OVER (ORDER BY ROWNUM) AS G1
FROM TEST;
The output is not what I expect
A1 B1 C1
40 5 35
50 4 46
60 3 57
70 2 68
90 1 89
From last rows (5th row), first subtract A1 -B2 to get C1..then (C1+ previous A1) - previous row B1 that is ---> 89 + 70 - 2 = 157 (save results in C1 previous row)
4th row: 157+60 -3 = 214
repeat until the first row...
Expected final output should be ;--
A1 B1 C1
40 5 295
50 4 260
60 3 214
70 2 157
90 1 89
LAG and LEAD only get a single row's value not an aggregation of multiple rows and it is not applied recursively.
You want:
SELECT A1,
B1,
SUM( A1 - B1 ) OVER ( ORDER BY ROWNUM
ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING
) AS C1
FROM test;
Task
I want to calculate the permanent P of a NxN matrix for N up to 100. I can make use of the fact that the matrix features only M=4 (or slightly more) different rows and cols. The matrix might look like
A1 ... A1 B1 ... B1 C1 ... C1 D1 ... D1 |
... | r1 identical rows
A1 ... A1 B1 ... B1 C1 ... C1 D1 ... D1 |
A2 ... A2 B2 ... B2 C2 ... C2 D2 ... D2
...
A2 ... A2 B2 ... B2 C2 ... C2 D2 ... D2
A3 ... A3 B3 ... B2 C2 ... C2 D2 ... D2
...
A3 ... A3 B3 ... B3 C3 ... C3 D3 ... D3
A4 ... A4 B4 ... B4 C4 ... C4 D4 ... D4
...
A4 ... A4 B4 ... B4 C4 ... C4 D4 ... D4
---------
c1 identical cols
and c and r are the multiplicities of cols and rows. All values in the matrix are laying between 0 and 1 and are encoded as double precision floating-point numbers.
Algorithm
I tried to use the Ryser formula to calculate the permanent. For the formula, one needs to first calculate the sum of each row and multiply all the row sums. For the matrix above this yields
S0 = (c1 * A1 + c2 * B1 + c3 * C1 + c4 * D1)^r1 * ...
* (c1 * A4 + c2 * B4 + c3 * C4 + c4 * D4)^r4
As a next step the same is done with col 1 deleted
S1 = ((c1-1) * A1 + c2 * B1 + c3 * C1 + c4 * D1)^r1 * ...
* ((c1-1) * A4 + c2 * B4 + c3 * C4 + c4 * D4)^r4
and this number is subtracted from S0.
The algorithm continues with all possible ways to delete single and group of cols and the products of the row sums of the remaining matrix are added (even number of cols deleted) and subtracted (odd number of cols deleted).
The task can be solved relative efficiently if one makes use of the identical cols (for example the result S1 will pop up exactly c1 times).
Problem
Even if the final result is small the values of the intermediate results S0, S1, ... can reach values up to N^N. A double can hold this number but the absolute precision for such big numbers is below or on the order of the expected overall result. The expected result P is on the order of c1!*c2!*c3!*c4! (actually I am interested in P/(c1!*c2!*c3!*c4!) which should lay between 0 and 1).
I tried to arrange the additions and subtractions of the values S in a way that the sums of the intermediate results are around 0. This helps in the sense that I can avoid intermediate results that are exceeding N^N, but this improves things only a little bit. I also thought about using logarithms for the intermediate results to keep the absolute numbers down - but the relative accuracy of the encoded numbers will be still bounded by the encoding as floating point number and I think I will run into the same problem. If possible, I want to avoid the usage of data types that are implementing a variable-precision arithmetic for performance reasons (currently I am using matlab).
I am creating a dashboard using Excel Powerquery(aka. M), in which I need to create a measure which requires rolling up values for last 12 months for two dimension
Example:
Input:
D1 | D2 | MonthYear(D3) | Value
A1 B1 Mar2016 1
A2 B1 Mar2016 2
A3 B1 Mar2016 3
A1 B1 Apr2016 4
A2 B1 Apr2016 5
A3 B1 Apr2016 6
A1 B1 May2016 7
A2 B1 May2016 8
A3 B1 May2016 9
Output:
D1 | D2 | MonthYear(D3) | Value
A1 B1 Mar2016 1
A2 B1 Mar2016 2
A3 B1 Mar2016 3
A1 B1 Apr2016 4+1
A2 B1 Apr2016 5+2
A3 B1 Apr2016 6+3
A1 B1 May2016 7+4+1
A2 B1 May2016 8+5+2
A3 B1 May2016 9+6+3
Also sum should be done only for last 12 months if more data is available. ANy help is appreciated
I covered a very similar scenario to this in my demo file: Power Query demo - Running Total.xlsx
You can download it from my OneDrive and review the steps:
https://1drv.ms/f/s!AGLFDsG7h6JPgw4
Basically you add an Index, Group By the "group columns" (in your scenario D1 and D2) and create an "All Rows" Aggregate column. Then you Copy the "All Rows" column, Expand both "All Rows" columns, Filter and finally Group By and Sum to create the Running Total.
The only bit of code is the Added column to produce a true/false column for the filter, e.g.
[Index] >= [#"All Rows - Copy.Index"]
I'm starting to work with MVC, but i'm struck with one logic with LinQ query. I have attached the image which explains the scenario and logic. Kindly help me with linq query
Column A Column B Column C
Test A A1 C1
Test A A2 C2
Test A A4 C3
Test A A5
Test B B1
Test B B2 C7
Test B B3
Test B B4 C9
Test C D1
Test C D2
Count of (Column A with atleast minimum 1 Column B has Column C value)/(Total of Column A)
Test A 3/5= 0.6
Test B 2/4= 0.5
Test C 0/2= 0
something like that :
Not clear what's impact of column B, by the way...
yourTable.GroupBy(m => m.ColumnA)
.Select(m=> new {
key = m.Key,
count = m.Count(x => x.ColumnC == null) / (decimal)m.Count()
});