Oracle LEAD & LAG analytics functions - oracle

I have a temp table using to test and need direction with some analytics function. Still trying to figure out my real solution.. and any help to lead me in right direction will be appreciated.
A1 B1
40 5
50 4
60 3
70 2
90 1
Tyring to find the previous value and subtract and add the column
SELECT A1, B1,
(A1-B1) AS C1,
(A1-B1) + LEAD((A1-B1),1,0) OVER (ORDER BY ROWNUM) AS G1
FROM TEST;
The output is not what I expect
A1 B1 C1
40 5 35
50 4 46
60 3 57
70 2 68
90 1 89
From last rows (5th row), first subtract A1 -B2 to get C1..then (C1+ previous A1) - previous row B1 that is ---> 89 + 70 - 2 = 157 (save results in C1 previous row)
4th row: 157+60 -3 = 214
repeat until the first row...
Expected final output should be ;--
A1 B1 C1
40 5 295
50 4 260
60 3 214
70 2 157
90 1 89

LAG and LEAD only get a single row's value not an aggregation of multiple rows and it is not applied recursively.
You want:
SELECT A1,
B1,
SUM( A1 - B1 ) OVER ( ORDER BY ROWNUM
ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING
) AS C1
FROM test;

Related

Need help for finding optimal path which visits multiple sequences of nodes

Summary
Recently I have had a path-finding puzzle that has some complex constraints (currently, I don't have any solution for this one)
A 2D matrix represented the graph. The length of a path is the number of traversed cells.
One or more number sequences are to be found inside the matrix. Each sequence is scored with a value.
Maximum length of the path in the graph. The number of picked cells must not exceed this value.
At any given moment, you can only choose cells in a specific column or row.
On each turn, you need to switch between column and row and stay on
the same line as the last cell you picked. You have to move at right angles. (The direction is like the Snake game).
Always start with picking the first cell from the top row, then go
vertically down to pick the second cell, and then continue switching
between column and row as usual.
You can't choose the same cell twice. The resulting path must not contain duplicated
cells.
For example:
The task is to find the shortest path, if possible in the graph that contains one or more sequences with the highest total score and the path's length is not exceed the provided maximum length.
The picture below demonstrates the solved puzzle with the resulting path marked in red:
Here, we have a path 3A-10-9B. This path contains the given
sequence 3A-10-9B so, which earns 10pts. More complex graphs typically have longer paths containing various sequences at once.
More complex examples
Multiple Sequences
You can complete sequences in any order. The order in which the sequences are listed doesn't matter.
Wasted Moves
Sometimes we are forced to waste moves and choose different cells that don't belong to any sequence. Here are the rules:
Able to waste 1 or 2 moves before the first sequence.
Able to waste 1 or 2 moves between any neighboring sequences.
However, you cannot break sequences and waste moves in the middle of them.
Here, we must waste one move before the sequence 3A-9B and two moves between sequences 3A-9B and 72-D4. Also, notice how red lines between 3A and 9B as well as between 72 and D4 "cross" previously selected cells D4 and 9B, respectively. You can pick different cells from the same row or column multiple times.
Optimal Sequences
Sometimes, it is not possible to have a path that contains all of the provided sequences. In this case, choose the way which achieved the most significant score.
In the above example, we can complete either 9B-3A-72-D4 or 72-D4-3A but not both due to the maximum path length of 5 cells. We have chosen the sequence 9B-3A-72-D4 since it grants more score points than 72-D4-3A.
Unsolvable solution
The first sequence 3A-D4 can't be completed since the code matrix doesn't contain code D4 at all. The second sequence, 72-10, can't be completed for another reason: codes 72 and 10 aren't located in the same row or column anywhere in the matrix and, therefore, can't form a sequence.
Performance advice
One brute force way is to generate all possible paths in the code matrix, loop through them and choose the best one. This is the easiest but also the slowest approach. Solving larger matrices with larger maximum length of path might take dozens of minutes, if not hours.
Try to implement a faster algorithm that doesn’t iterate through all possible paths and can solve puzzles with the following parameters in less than 10 seconds:
Matrix size: 10x10
Number of sequences: 5
Average length of sequences: 4
Maximum path length: 12
At least one solution exists
For example:
Matrix:
41,0f,32,18,29,4b,55,3f,10,3a,
19,4f,57,43,3a,25,19,1e,5e,42,
13,5a,54,3c,1b,32,29,1c,15,30,
49,45,22,2e,25,51,2f,21,4c,37,
1a,5e,49,12,55,1e,49,19,43,2d,
34,26,53,48,49,60,32,3c,50,10,
0f,1e,30,3d,64,37,5b,5e,22,61,
4e,4f,15,5a,13,56,44,22,40,26,
43,2c,17,2b,1f,25,43,60,50,1f,
3c,2b,54,46,42,4d,32,46,30,24,
Sequences:
30, 26, 44, 32, 3c - 25pts
5a, 3c, 12, 1e, 4d - 10pts
1e, 5a, 12 - 10pts
4d, 1e - 5pts
32, 51, 2f, 49, 55, 42 - 30pts
Optimal solution
3f, 1c, 30, 26, 44, 32, 3c, 22, 5a, 12, 1e, 4d
Which contains
30, 26, 44, 32, 3c
5a, 12, 1e
1e, 4d
Conclusion
I am looking for any advice for this puzzle since I have no idea what keywords to look for. A pseudo-code or hints would be helpful for me, and I appreciate that. What has come to my mind is just Dijkstra:
For each sequence, since the order doesn't matter, I have to find all get all possible paths with every permutation, then find the highest score path that contains other input sequences
After that, choose the best of the best.
In this case, I doubt the performance will be the issue.
First step is to find if a required sequence exists.
- SET found FALSE
- LOOP C1 over cells in first row
- CLEAR foundSequence
- ADD C1 to foundSequence
- LOOP C2 over cells is column containing C1
- IF C2 value == first value in sequence
- ADD C2 to foundSequence
- SET found TRUE
- break from LOOP C2
- IF found
- SET direction VERT
- LOOP V over remaining values in sequence
- TOGGLE direction
- SET found FALSE
- LOOP C2 over cells in same column or row ( depending on direction ) containing last cell in foundSequence
- IF C2 value == V
- ADD C2 to foundSequence
- SET found TRUE
- break from LOOP C2
- IF ! found
break out of LOOP V
- IF foundSequence == required sequence
- RETURN foundSequence
RETURN failed
Note: this doesn't find sequences that are feasible with "wasted moves". I would implement this first and get it working. Then, using the same ideas, it can be extended to allow wasted moves.
You have not specified an input format! I suggest a space delimited text files with lines beginning with 'm' containing matrix values and lines beginning 's' containing sequences, like this
m 3A 3A 10 9B
m 9B 72 3A 10
m 10 3A 3A 3A
m 3A 10 3A 9B
s 3A 10 9B
I have implemented the sequence finder in C++
std::vector<int> findSequence()
{
int w, h;
pA->size(w, h);
std::vector<int> foundSequence;
bool found = false;
bool vert = false;
// loop over cells in first row
for (int c = 0; c < w; c++)
{
foundSequence.clear();
found = false;
if (pA->cell(c, 0)->value == vSequence[0][0])
{
foundSequence.push_back(pA->cell(c, 0)->ID());
found = true;
}
while (found)
{
// found possible starting cell
// toggle search direction
vert = (!vert);
// start from last cell found
auto pmCell = pA->cell(foundSequence.back());
int c, r;
pA->coords(c, r, pmCell);
// look for next value in required sequence
std::string nextValue = vSequence[0][foundSequence.size()];
found = false;
if (vert)
{
// loop over cells in column
for (int r2 = 1; r2 < w; r2++)
{
if (pA->cell(c, r2)->value == nextValue)
{
foundSequence.push_back(pA->cell(c, r2)->ID());
found = true;
break;
}
}
}
else
{
// loop over cells in row
for (int c2 = 0; c2 < h; c2++)
{
if (pA->cell(c2, r)->value == nextValue)
{
foundSequence.push_back(pA->cell(c2, r)->ID());
found = true;
break;
}
}
}
if (!found) {
// dead end - try starting from next cell in first row
break;
}
if( foundSequence.size() == vSequence[0].size()) {
// success!!!
return foundSequence;
}
}
}
std::cout << "Cannot find sequence\n";
exit(1);
}
This outputs:
3A 3A 10 9B
9B 72 3A 10
10 3A 3A 3A
3A 10 3A 9B
row 0 col 1 3A
row 3 col 1 10
row 3 col 3 9B
You can check out the code for the complete application at https://github.com/JamesBremner/stackoverflow75410318
I have added the ability to find sequences that start elsewhere than the first row ( i.e. with "wasted moves" ). You can see the code in the github repo.
Here are the the results of a timing profile run on a 10 by 10 matrix - the algorithm finds 5 sequences in 0.6 milliseconds
Searching
41 0f 32 18 29 4b 55 3f 10 3a
19 4f 57 43 3a 25 19 1e 5e 42
13 5a 54 3c 1b 32 29 1c 15 30
49 45 22 2e 25 51 2f 21 4c 37
1a 5e 49 12 55 1e 49 19 43 2d
34 26 53 48 49 60 32 3c 50 10
0f 1e 30 3d 64 37 5b 5e 22 61
4e 4f 15 5a 13 56 44 22 40 26
43 2c 17 2b 1f 25 43 60 50 1f
3c 2b 54 46 42 4d 32 46 30 24
for sequence 4d 1e
Cannot find sequence starting in 1st row, using wasted moves
row 9 col 5 4d
row 4 col 5 1e
for sequence 30 26 44 32 3c
Cannot find sequence starting in 1st row, using wasted moves
Cannot find sequence
for sequence 5a 3c 12 1e 4d
Cannot find sequence starting in 1st row, using wasted moves
row 2 col 1 5a
row 2 col 3 3c
row 4 col 3 12
row 4 col 5 1e
row 9 col 5 4d
for sequence 1e 5a 12
Cannot find sequence starting in 1st row, using wasted moves
row 6 col 1 1e
row 4 col 5 1e
row 4 col 3 12
for sequence 32 51 2f 49 55 42
Cannot find sequence starting in 1st row, using wasted moves
row 2 col 5 32
row 3 col 5 51
row 3 col 6 2f
row 4 col 6 49
row 4 col 4 55
row 9 col 4 42
raven::set::cRunWatch code timing profile
Calls Mean (secs) Total Scope
5 0.00059034 0.0029517 findSequence

Clusterization algorithm

I have problem with clusterization of clients.
I have a dataset with columns such as name, address, email, phone, etc. (in a example A,B,C). Each row has unique identifier (ID). I need to assign CLUSTER_ID (X) to each row. In one cluster all rows have one or more the same attributes as other rows. So clients with ID=1,2,3 have the same A attribute and clients with ID=3,10 have the same B attribute then ID=1,2,3,10 should be in the same cluster.
How can I solve this problem using SQL?
If it's not possible how to write the algorithm (pseudocode)?
The performance is very important, because the dataset contains milions of rows.
Sample Input:
ID A B C
1 A1 B3 C1
2 A1 B2 C5
3 A1 B10 C10
4 A2 B1 C5
5 A2 B8 C1
6 A3 B1 C4
7 A4 B6 C3
8 A4 B3 C5
9 A5 B7 C2
10 A6 B10 C3
11 A8 B5 C4
Sample Output:
ID A B C X
1 A1 B3 C1 1
2 A1 B2 C5 1
3 A1 B10 C10 1
4 A2 B1 C5 1
5 A2 B8 C1 1
6 A3 B1 C4 1
7 A4 B6 C3 1
8 A4 B3 C5 1
9 A5 B7 C2 2
10 A6 B10 C3 1
11 A8 B5 C4 1
Thanks for any help.
A possible way is by repeating updates for the empty X.
Start with cluster_id 1.
F.e. by using a variable.
SET #CurrentClusterID = 1
Take the top 1 record, and update it's X to 1.
Now loop an update for all records with an empty X,
and that can be linked to a record with X = 1 and that has the same A or B or C
Disclaimer:
The statement will vary depending on the RDBMS.
This is just intended as pseudo-code.
WHILE (<<some check to see if there were records updated>>)
BEGIN
UPDATE yourtable t
SET t.X = #CurrentClusterID
WHERE t.X IS NULL
AND EXISTS (
SELECT 1 FROM yourtable d
WHERE d.X = #CurrentClusterID
AND (d.A = t.A OR d.B = t.B OR d.C = t.C)
);
END
Loop that till it updates 0 records.
Now repeat the method for the other clusters, till there are no more empty X in the table.
1) Increase the #CurrentClusterID by 1
2) Update the next top 1 record with an empty X to the new #CurrentClusterID
3) Loop the update till no-more updates were done.
An example test on db<>fiddle here for MS Sql Server.

Is it possible to find maximum value of 2 or more column in a table?

for example : I have a table as follows
id math science english history
1 80 90 90 90
2 70 60 81 78
3 69 50 45 80
4 30 40 10 80
i only want to find the maximum value in column math and science.
Is it possible?
Simply use this :
select max(science),max(math) from your_table

Computing lag in Hive by a variable

My input table looks like:
guest_id days
101 79
101 70
101 68
101 61
102 101
102 90
102 55
103 99
103 90
Note that, days are in descending order,by guest_id
Desired output table:
guest_id days days_diff
101 79 0
101 70 9
101 68 2
101 61 7
102 101 0
102 90 11
102 55 35
103 99 0
103 90 9
days_diff is the first order difference by guest_id (not throughout days column)
You need to have a unique id column as well (otherwise Hive doesn't know about the order of your rows).
Then you can just self join on id=id+1 to get your differences:
select a.guest_id,
a.days,
case when a.guest_id = b.guest_id then b.days-a.days else 0 end days_diff
from
input a
join input b on a.id=b.id-1
Edit: As pointed out by Kunal in the comments, Hive does have a Lag window function which requires a PARTITION BY ... ORDER BY clause; you still need something to order your table by, for example if you have a date column you would used this like the following:
SELECT guest_id,
days,
LAG(days, 1, 0) OVER (PARTITION BY guest_id ORDER BY date)
FROM input;

how to group photos with similar faces together

In most face recognition SDK, it only provides two major functions
detecting faces and extracting templates from photos, this is called detection.
comparing two templates and returning the similar score, this is called recognition.
However, beyond those two functions, what I am looking for is an algorithm or SDK for grouping photos with similar faces together, e.g. based on similar scores.
Thanks
First, perform step 1 to extract the templates, then compare each template with all the others by applying step two on all the possible pairs, obtaining their similarity scores.
Sort the matches based on this similarity score, decide on a threshold and group together those templates that exceed it.
Take, for instance, the following case:
Ten templates: A, B, C, D, E, F, G, H, I, J.
Scores between: 0 and 100.
Similarity threshold: 80.
Similarity table:
A B C D E F G H I J
A 100 85 8 0 1 50 55 88 90 10
B 85 100 5 30 99 60 15 23 8 2
C 8 5 100 60 16 80 29 33 5 8
D 0 30 60 100 50 50 34 18 2 66
E 1 99 16 50 100 8 3 2 19 6
F 50 60 80 50 8 100 20 55 13 90
G 55 15 29 34 3 20 100 51 57 16
H 88 23 33 18 2 55 51 100 8 0
I 90 8 5 2 19 13 57 8 100 3
J 10 2 8 66 6 90 16 0 3 100
Sorted matches list:
AI 90
FJ 90
BE 99
AH 88
AB 85
CF 80
------- <-- Threshold cutoff line
DJ 66
.......
Iterate through the list until the threshold cutoff point, where the values no longer exceed it, maintain a full templates set and association sets for each template, obtaining the final groups:
// Empty initial full templates set
fullSet = {};
// Iterate through the pairs list
foreach (templatePair : pairList)
{
// If the full set contains the first template from the pair
if (fullSet.contains(templatePair.first))
{
// Add the second template to its group
templatePair.first.addTemplateToGroup(templatePair.second);
// If the full set also contains the second template
if (fullSet.contains(templatePair.second))
{
// The second template is removed from the full set
fullSet.remove(templatePair.second);
// The second template's group is added to the first template's group
templatePair.first.addGroupToGroup(templatePair.second.group);
}
}
else
{
// If the full set contains only the second template from the pair
if (fullSet.contains(templatePair.second))
{
// Add the first template to its group
templatePair.second.addTemplateToGroup(templatePair.first);
}
}
else
{
// If none of the templates are present in the full set, add the first one
// to the full set and the second one to the first one's group
fullSet.add(templatePair.first);
templatePair.first.addTemplateToGroup(templatePair.second);
}
}
Execution details on the list:
AI: fullSet.add(A); A.addTemplateToGroup(I);
FJ: fullSet.add(F); F.addTemplateToGroup(J);
BE: fullSet.add(B); B.addTemplateToGroup(E);
AH: A.addTemplateToGroup(H);
AB: A.addTemplateToGroup(B); fullSet.remove(B); A.addGroupToGroup(B.group);
CF: C.addTemplateToGroup(F);
In the end, you end up with the following similarity groups:
A - I, H, B, E
C - F, J

Resources