Show duplicates in internal table - algorithm

Each an every item should have an uniquie SecondNo + Drawing combination. Due to misentries, some combinations are there two times.
I need to create a report with ABAP which identifies those combinations and does not reflect the others.
Item: SecNo: Drawing:
121 904 5000 double
122 904 5000 double
123 816 5100
124 813 5200
125 812 4900 double
126 812 4900 double
127 814 5300
How can I solve this? I tried 2 approaches and failed:
Sorting the data and tried to print out each one when the value of the upper row is equal to the next value
counting the duplicates and showing all of them which are more then one.
Where do I put in the condition? in the loop area?
I tried this:
REPORT duplicates.
DATA: BEGIN OF lt_duplicates OCCURS 0,
f2(10),
f3(10),
END OF lt_duplicates,
it_test TYPE TABLE OF ztest WITH HEADER LINE,
i TYPE i.
SELECT DISTINCT f2 f3 FROM ztest INTO TABLE lt_duplicates.
LOOP AT lt_duplicates.
IF f2 = lt_duplicates-f2 AND f3 = lt_duplicates-f3.
ENDIF.
i = LINES( it_test ).
IF i > 1.
LOOP AT it_test.
WRITE :/ it_test-f1,it_test-f2,it_test-f3.
ENDLOOP.
ENDIF.
ENDLOOP.

From ABAP 7.40, you may use the GROUP BY constructs with the GROUP SIZE words so that to take into account only the groups with at least 2 elements.
ABAP statement LOOP AT ... GROUP BY ( <columns...> gs = GROUP SIZE ) ...
Loop at grouped lines:
Either LOOP AT GROUP ...
Or ... FOR ... IN GROUP ...
ABAP expression ... VALUE|REDUCE|NEW type|#( FOR GROUPS ... GROUP BY ( <columns...> gs = GROUP SIZE ) ...
Loop at grouped lines: ... FOR ... IN GROUP ...
For both constructs, it's possible to loop at the grouped lines in two ways:
* LOOP AT GROUP ...
* ... FOR ... IN GROUP ...
Line# Item SecNo Drawing
1 121 904 5000 double
2 122 904 5000 double
3 123 816 5100
4 124 813 5200
5 125 812 4900 double
6 126 812 4900 double
7 127 814 5300
You might want to produce the following table containing the duplicates:
SecNo Drawing Lines
904 5000 [1,2]
812 4900 [5,6]
Solution with LOOP AT ... GROUP BY ...:
TYPES: BEGIN OF t_line,
item TYPE i,
secno TYPE i,
drawing TYPE i,
END OF t_line,
BEGIN OF t_duplicate,
secno TYPE i,
drawing TYPE i,
num_dup TYPE i, " number of duplicates
lines TYPE STANDARD TABLE OF REF TO t_line WITH EMPTY KEY,
END OF t_duplicate,
t_lines TYPE STANDARD TABLE OF t_line WITH EMPTY KEY,
t_duplicates TYPE STANDARD TABLE OF t_duplicate WITH EMPTY KEY.
DATA(table) = VALUE t_lines(
( item = 121 secno = 904 drawing = 5000 )
( item = 122 secno = 904 drawing = 5000 )
( item = 123 secno = 816 drawing = 5100 )
( item = 124 secno = 813 drawing = 5200 )
( item = 125 secno = 812 drawing = 4900 )
( item = 126 secno = 812 drawing = 4900 )
( item = 127 secno = 814 drawing = 5300 ) ).
DATA(expected_duplicates) = VALUE t_duplicates(
( secno = 904 drawing = 5000 num_dup = 2 lines = VALUE #( ( REF #( table[ 1 ] ) ) ( REF #( table[ 2 ] ) ) ) )
( secno = 812 drawing = 4900 num_dup = 2 lines = VALUE #( ( REF #( table[ 5 ] ) ) ( REF #( table[ 6 ] ) ) ) ) ).
DATA(actual_duplicates) = VALUE t_duplicates( ).
LOOP AT table
ASSIGNING FIELD-SYMBOL(<line>)
GROUP BY
( secno = <line>-secno
drawing = <line>-drawing
gs = GROUP SIZE )
ASSIGNING FIELD-SYMBOL(<group_table>).
IF <group_table>-gs >= 2.
actual_duplicates = VALUE #( BASE actual_duplicates
( secno = <group_table>-secno
drawing = <group_table>-drawing
num_dup = <group_table>-gs
lines = VALUE #( FOR <line2> IN GROUP <group_table> ( REF #( <line2> ) ) ) ) ).
ENDIF.
ENDLOOP.
WRITE : / 'List of duplicates:'.
SKIP 1.
WRITE : / 'Secno Drawing List of concerned items'.
WRITE : / '---------- ---------- ---------------------------------- ...'.
LOOP AT actual_duplicates ASSIGNING FIELD-SYMBOL(<duplicate>).
WRITE : / <duplicate>-secno, <duplicate>-drawing NO-GROUPING.
LOOP AT <duplicate>-lines INTO DATA(line).
WRITE line->*-item.
ENDLOOP.
ENDLOOP.
ASSERT actual_duplicates = expected_duplicates. " short dump if not equal
Output:
List of duplicates:
Secno Drawing List of concerned items
---------- ---------- ---------------------------------- ...
904 5000 121 122
812 4900 125 126
Solution with ... VALUE type|#( FOR GROUPS ... GROUP BY ...:
DATA(actual_duplicates) = VALUE t_duplicates(
FOR GROUPS <group_table> OF <line> IN table
GROUP BY
( secno = <line>-secno
drawing = <line>-drawing
gs = GROUP SIZE )
( secno = <group_table>-secno
drawing = <group_table>-drawing
num_dup = <group_table>-gs
lines = VALUE #( FOR <line2> IN GROUP <group_table> ( REF #( <line2> ) ) ) ) ).
DELETE actual_duplicates WHERE num_dup = 1.
Note: for deleting non-duplicates, instead of using an additional DELETE statement, it can be done inside the VALUE construct by adding a LINES OF COND construct which adds 1 line if group size >= 2, or none otherwise (if group size = 1):
...
gs = GROUP SIZE )
( LINES OF COND #( WHEN <group_table>-gs >= 2 THEN VALUE #( "<== new line
( secno = <group_table>-secno
...
... REF #( <line2> ) ) ) ) ) ) ) ). "<== 3 extra right parentheses

You can use AT...ENDAT for this, provided that you arrange the fields correctly:
TYPES: BEGIN OF t_my_line,
secno TYPE foo,
drawing TYPE bar,
item TYPE baz, " this field has to appear AFTER the other ones in the table
END OF t_my_line.
DATA: lt_my_table TYPE TABLE OF t_my_line,
lt_duplicates TYPE TABLE OF t_my_line.
FIELD-SYMBOLS: <ls_line> TYPE t_my_line.
START-OF-WHATEVER.
* ... fill the table ...
SORT lt_my_table BY secno drawing.
LOOP AT lt_my_table ASSIGNING <ls_line>.
AT NEW drawing. " whenever drawing or any field left of it changes...
FREE lt_duplicates.
ENDAT.
APPEND <ls_line> TO lt_duplicates.
AT END OF drawing.
IF lines( lt_duplicates ) > 1.
* congrats, here are your duplicates...
ENDIF.
ENDAT.
ENDLOOP.

I needed simply to report duplicate lines in error based on two fields so used the following.
LOOP AT gt_data INTO DATA(gs_data)
GROUP BY ( columnA = gs_data-columnA columnB = gs_data-columnB
size = GROUP SIZE index = GROUP INDEX ) ASCENDING
REFERENCE INTO DATA(group_ref).
IF group_ref->size > 1.
PERFORM insert_error USING group_ref->columnA group_ref->columnB.
ENDIF.
ENDLOOP.

Here is my 2p worth, you could cut some out of this depending on what you want to do, and you should consider the amount of data being processed too. This method is only really for smaller sets.
Personally I like to prevent erroneous records at the source. Catching an error during input. But if you do end up in a pickle there is definitely more than one way to solve the issue.
TYPES: BEGIN OF ty_itab,
item TYPE i,
secno TYPE i,
drawing TYPE i,
END OF ty_itab.
TYPES: itab_tt TYPE STANDARD TABLE OF ty_itab.
DATA: lt_itab TYPE itab_tt,
lt_itab2 TYPE itab_tt,
lt_itab3 TYPE itab_tt.
lt_itab = VALUE #(
( item = '121' secno = '904' drawing = '5000' )
( item = '122' secno = '904' drawing = '5000' )
( item = '123' secno = '816' drawing = '5100' )
( item = '124' secno = '813' drawing = '5200' )
( item = '125' secno = '812' drawing = '4900' )
( item = '126' secno = '812' drawing = '4900' )
( item = '127' secno = '814' drawing = '5300' )
).
APPEND LINES OF lt_itab TO lt_itab2.
APPEND LINES OF lt_itab TO lt_itab3.
SORT lt_itab2 BY secno drawing.
DELETE ADJACENT DUPLICATES FROM lt_itab2 COMPARING secno drawing.
* Loop at what is hopefully the smaller itab.
LOOP AT lt_itab2 ASSIGNING FIELD-SYMBOL(<line>).
DELETE TABLE lt_itab3 FROM <line>.
ENDLOOP.
* itab1 has all originals.
* itab2 has the unique.
* itab3 has the duplicates.

Related

Power BI DAX : multi-column and multi-row condition and group by 2 columns

Folks, I am trying to create a calculated column/measures and experiencing issues.
My Data-set looks like this:
City
Building Name
Test Date
Component
Test Result
Calculated Result
-
-
-
-
-
-
City1
Build1
1/3/2014
Component A
Pass
None
City1
Build1
1/11/2014
Component 1
Fail
Fail1
City1
Build1
1/11/2014
Component 2
Pass
Fail1
City1
Build1
1/11/2014
Component 3
Pass
Fail1
City1
Build1
1/06/2014
Component A
Fail
MultiFail
City1
Build1
1/06/2014
Component 1
Fail
MultiFail
City1
Build1
1/06/2014
Component 2
Pass
MultiFail
City1
Build1
1/06/2014
Component 3
Fail
MultiFail
I am looking at Component & Test Result columns, count list of Fails - grouped by Building Name and Date; then generate Calculated result depending on the number of components failed.
If Single component Test Result = fail - then Calculated Result = Fail1
If CountA(components Test Result = fail) <=2 then Calculated Result = Fail2
If CountA(components Test Result = fail) > 2 then Calculated Result = MultiFail
If Component1 AND ComponentA Test Result = Fail then Calculated Result = FailMail
So far, I tried various ways in solving this with a step ahead and 2 steps behind:
I created a calculated column to count # Fails to be used for Calculated Result and struggling to generate Calculated Result.
Tests_Failed =
CALCULATE(COUNT(Table[TestResult]),FILTER(Table,Table[date]=MAX(Table[date])
&& Table[TestResult]="Fail"))
Another way I tried approaching the problem
Calculated Result =
VAR Component = Table[Component]
VAR Date1 = Table[Test date]
VAR Build = Table[Building Name]
RETURN
CALCULATE(DISTINCTCOUNT('Table'[Component]), ALL(Table), FILTER('Table',
'Table'[Test Result]="Fail" && 'Table'[date] = Date1 &&
'Table'[Building Name]=Build)))
Can you try this
Measure =
VAR _1 =
CALCULATE (
CALCULATE (
DISTINCTCOUNT ( 'Table'[Component] ),
FILTER ( 'Table', 'Table'[Test Result] = "Fail" )
),
ALLEXCEPT ( 'Table', 'Table'[Building Name], 'Table'[Test Date] )
)
VAR _2 =
SWITCH (
TRUE (),
ISBLANK ( _1 ) = TRUE (), "None",
_1 = 1, "Fail1",
"MultiFail"
)
RETURN
_2

Create a measure depending on a filter

I have two measures:
% PARTIC_REC_CORR =
IF(
SUM('RECEITA _SERVIÇO'[REC_CORRENTES])=0,
"_",
DIVIDE(
SUM(BD_RH[PESSOAL + ENCARGOS + BENEFÍCIOS]),
SUM('RECEITA _SERVIÇO'[REC_CORRENTES])
)
)
% PARTIC_REC_CORR TOTAL =
IF(
SUM('REC_CORRENTES_1'[RECEITAS CORRENTES])=0,
"_",
DIVIDE(
SUM(BD_RH[PESSOAL + ENCARGOS + BENEFÍCIOS]),
SUM('REC_CORRENTES_1'[RECEITAS CORRENTES])
)
)
Now I want to create a new one depending on the selected filter (dLocal).
If nothing is selected, I want % PARTIC_REC_CORR TOTAL, otherwise % PARTIC_REC_CORR. I tried
% PARTIC_REC_CORR 2 =
IF(
ALL('dLocal'[SIESTADO]),
RH_INDICADORES[% PARTIC_REC_CORR TOTAL],
RH_INDICADORES[% PARTIC_REC_CORR]
)
unsuccessfully.
dLocal
SISESTADO CDESTADO
AC 24
AL 02
... ...
RECEITA _SERVIÇO
DATA_BASE CDESTADO REC_CORRENTES SISTESTADO
31/12/2018 24 99999,99 AC
31/12/2018 02 99999,99 AL
... ... ... ...
31/12/2019 24 99999,99 AC
31/12/2019 02 99999,99 AL
... ... ... ...
REC_CORRENTES_1
DATA_BASE REC_CORRENTES
31/12/2018 99999999,99
31/12/2018 99999999,99
This should work.
Edit: the measures where in the wrong order
% PARTIC_REC_CORR 2 =
IF(
ISFILTERED('dLocal'[SIESTADO])
,RH_INDICADORES[% PARTIC_REC_CORR]
,RH_INDICADORES[% PARTIC_REC_CORR TOTAL]
)
note that ISFILTERED considers only direct filter on a field.
Your formula is not working because ALL() returns a whole column with all the values (ALL removes the filters), how that is considered inside the IF "check expression" I have no idea.

VB6 MSFlexGrid - Unable to set columns and rows count at runtime

I have a Visual Basic 6 form with a MSFlexGrid control inside, which takes data from a record set(ADODB) and displays them.
Before starting the copy of data to the FlexGrid, I'm trying to set the rows count, depending on records count. Also I have a collection which contains columns' names, then I can get the number of columns from here.
The following is a code snippet:
v_colsCount = UBound(aCols) + 2 // aCols = array with columns' names
v_regCount = rs.RecordCount // rs = my ADODB record set
myFlexGrid.Rows = 0 // for cleaning rows from a previous display
myFlexGrid.Rows = IIf(v_regCount > 0, v_regCount + 1, 2)
myFlexGrid.Cols = v_colsCount
myFlexGrid.FixedRows = 1
myFlexGrid.FixedCols = 0
There are 7532 rows and 52 columns. The problem comes when I run the application and try to execute this part of the code (fill the FlexGrid with data from the record set):
For iRow = 1 To v_regCount
For iCol = 0 To v_colsCount -2
sAux = ConvStr(rs.Fields(aCols(iCol)).Value)
myFlexGrid.TextMatrix(iRow, iCol) = sAux
I notice that
v_regCount = 7532 but v_colsCount = 2 ,
and I get an error ("Substring out of range"). If I swap the settings order (i.e. if I set myFlexGrid.Cols after set myFlexGrid.Rows), then
v_regCount = 0 and v_colsCount = 52
I don't understand why I can't set rows and columns count at the same time.
Any ideas?
Thanks in advance

advice to make my below Pig code simple

Here is my code and I do two group all operations and my code works. My purpose is to generate all student unique user count with their total scores, student located in CA unique user count. Wondering if good advice to make my code simple to use only one group operation, or any constructive ideas to make code simple, for example using only one FOREACH operation? Thanks.
student_all = group student all;
student_all_summary = FOREACH student_all GENERATE COUNT_STAR(student) as uu_count, SUM(student.mathScore) as count1,SUM(student.verbScore) as count2;
student_CA = filter student by LID==1;
student_CA_all = group student_CA all;
student_CA_all_summary = FOREACH student_CA_all GENERATE COUNT_STAR(student_CA);
Sample input (student ID, location ID, mathScore, verbScore),
1 1 10 20
2 1 20 30
3 1 30 40
4 2 30 50
5 2 30 50
6 3 30 50
Sample output (unique user, unique user in CA, sum of mathScore of all students, sum of verb Score of all students),
7 3 150 240
thanks in advance,
Lin
You might be looking for this.
data = load '/tmp/temp.csv' USING PigStorage(' ') as (sid:int,lid:int, ms:int, vs:int);
gdata = group data all;
result = foreach gdata {
student_CA = filter data by lid == 1;
student_CA_sum = SUM( student_CA.sid ) ;
student_CA_count = COUNT( student_CA.sid ) ;
mathScore = SUM(data.ms);
verbScore = SUM(data.vs);
GENERATE student_CA_sum as student_CA_sum, student_CA_count as student_CA_count, mathScore as mathScore, verbScore as verbScore;
};
Output is:
grunt> dump result
(6,3,150,240)
grunt> describe result
result: {student_CA_sum: long,student_CA_count: long,mathScore: long,verbScore: long}
first load the file(student)in hadoop file system. The perform the below action.
split student into student_CA if locationId == 1, student_Other if locationId != 1;
student_CA_all = group student_CA all;
student_CA_all_summary = FOREACH student_CA_all GENERATE COUNT_STAR(student_CA) as uu_count,COUNT_STAR(student_CA)as locationCACount, SUM(student_CA.mathScore) as mScoreCount,SUM(student_CA.verbScore) as vScoreCount;
student_Other_all = group student_Other all;
student_Other_all_summary = FOREACH student_Other_all GENERATE COUNT_STAR(student_Other) as uu_count,0 as locationOtherCount:long, SUM(student_Other.mathScore) as mScoreCount,SUM(student_Other.verbScore) as vScoreCount;
student_CAandOther_all_summary = UNION student_CA_all_summary, student_Other_all_summary;
student_summary_all = group student_CAandOther_all_summary all;
student_summary = foreach student_summary_all generate SUM(student_CAandOther_all_summary.uu_count) as studentIdCount, SUM(student_CAandOther_all_summary.locationCACount) as locationCount, SUM(student_CAandOther_all_summary.mScoreCount) as mathScoreCount , SUM(student_CAandOther_all_summary.vScoreCount) as verbScoreCount;
output:
dump student_summary;
(6,3,150,240)
Hope this helps :)
While solving your problem, I also encountered an issue with PIG. I assume it is because of improper exception handling done in UNION command. Actually, it can hang you command line prompt, if you execute that command, without proper error message. If you want I can share you the snippet for that.
The answer accepted has an logical error.
Try to have the below input file
1 1 10 20
2 1 20 30
3 1 30 40
4 2 30 50
5 2 30 50
6 3 30 50
7 1 10 10
The output will be
(13,4,160,250)
The output should be
(7,4.170,260)
I have modified the script to work correct.
data = load '/tmp/temp.csv' USING PigStorage(' ') as (sid:int,lid:int, ms:int, vs:int);
gdata = group data all;
result = foreach gdata {
student_CA_sum = COUNT( data.sid ) ;
student_CA = filter data by lid == 1;
student_CA_count = COUNT( student_CA.sid ) ;
mathScore = SUM(data.ms);
verbScore = SUM(data.vs);
GENERATE student_CA_sum as student_CA_sum, student_CA_count as student_CA_count, mathScore as mathScore, verbScore as verbScore;
};
Output
(7,4,160,250)

bash merging tables on unique id

I have two similar, 'table format' text files, each several million records long. In the inputfile1, the unique identifier is a merger of values in two other columns (neither of which are unique identifiers on their own). In inputfile2, the unique identifier is two letters followed by a random four-digit number.
How can I replace the unique identifiers in inputfile1 with the corresponding unique identifiers in the inputfile2? All of the records in the first table are present in the second, though not vis versa. Below are toy examples of the files.
Input file 1:
Grp Len ident data
A 20 A_20 3k3bj52
A 102 A_102 3k32rf2
A 352 A_352 3w3bj52
B 60 B_60 3k3qwrg
B 42 B_42 3kerj52
C 89 C_89 3kftj55
C 445 C_445 fy5763b
Input file 2:
Grp Len ident
A 20 fz2525
A 102 fz5367
A 352 fz4678
A 356 fz1543
B 60 fz5732
B 11 fz2121
B 42 fz3563
C 89 fz8744
C 245 fz2653
C 445 fz2985
C 536 fz8983
Desired output:
Grp Len ident data
A 20 fz2525 3k3bj52
A 102 fz5367 3k32rf2
A 352 fz4678 3w3bj52
B 60 fz5732 3k3qwrg
B 42 fz3563 3kerj52
C 89 fz8744 3kftj55
C 445 fz2985 fy5763b
My provisional plan is:
Generate extra identifiers for input2, in the style of input1 (easy)
Filter out lines from input2 that don't occur input1 (hardish)
Then stick on the data from input1 (easy)
I might be able to do this in R but the data is large and complex, and I was wondering if there was a way in bash or perl. Any tips in the right direction would be good.
This should work for you, assuming the Grp and Len values are in the same order in both files, as per my comment
Essentially it reads a line from the first file and then reads from the second file, forming the Grp_Len key from each record until it finds an entry that matches. Then it's just a matter of building the new output record
use strict;
use warnings;
open my $f1, '<', 'file1.txt';
print scalar <$f1>;
open my $f2, '<', 'file2.txt';
<$f2>;
while ( <$f1> ) {
my #f1 = split;
my #f2;
while () {
#f2 = split ' ', <$f2>;
last if join('_', #f2[0,1]) eq $f1[2];
}
print "#f2 $f1[3]\n";
}
output
Grp Len ident data
A 20 fz2525 3k3bj52
A 102 fz5367 3k32rf2
A 352 fz4678 3w3bj52
B 60 fz5732 3k3qwrg
B 42 fz3563 3kerj52
C 89 fz8744 3kftj55
C 445 fz2985 fy5763b
Update
Here's another version which is identical except that it builds a printf format string from the spacing of the column headers in the first file. That results in a much neater output
use strict;
use warnings;
open my $f1, '<', 'file1.txt';
my $head = <$f1>;
print $head;
my $format = create_format($head);
open my $f2, '<', 'file2.txt';
<$f2>;
while ( <$f1> ) {
my #f1 = split;
my #f2;
while () {
#f2 = split ' ', <$f2>;
last if join('_', #f2[0,1]) eq $f1[2];
}
printf $format, #f2, $f1[3];
}
sub create_format {
my ($head) = #_;
my ($format, $pos);
while ( $head =~ /\b\S/g ) {
$format .= sprintf("%%-%ds", $-[0] - $pos) if defined $pos;
$pos = $-[0];
}
$format . "%s\n";
}
output
Grp Len ident data
A 20 fz2525 3k3bj52
A 102 fz5367 3k32rf2
A 352 fz4678 3w3bj52
B 60 fz5732 3k3qwrg
B 42 fz3563 3kerj52
C 89 fz8744 3kftj55
C 445 fz2985 fy5763b

Resources