bash merging tables on unique id - bash

I have two similar, 'table format' text files, each several million records long. In the inputfile1, the unique identifier is a merger of values in two other columns (neither of which are unique identifiers on their own). In inputfile2, the unique identifier is two letters followed by a random four-digit number.
How can I replace the unique identifiers in inputfile1 with the corresponding unique identifiers in the inputfile2? All of the records in the first table are present in the second, though not vis versa. Below are toy examples of the files.
Input file 1:
Grp Len ident data
A 20 A_20 3k3bj52
A 102 A_102 3k32rf2
A 352 A_352 3w3bj52
B 60 B_60 3k3qwrg
B 42 B_42 3kerj52
C 89 C_89 3kftj55
C 445 C_445 fy5763b
Input file 2:
Grp Len ident
A 20 fz2525
A 102 fz5367
A 352 fz4678
A 356 fz1543
B 60 fz5732
B 11 fz2121
B 42 fz3563
C 89 fz8744
C 245 fz2653
C 445 fz2985
C 536 fz8983
Desired output:
Grp Len ident data
A 20 fz2525 3k3bj52
A 102 fz5367 3k32rf2
A 352 fz4678 3w3bj52
B 60 fz5732 3k3qwrg
B 42 fz3563 3kerj52
C 89 fz8744 3kftj55
C 445 fz2985 fy5763b
My provisional plan is:
Generate extra identifiers for input2, in the style of input1 (easy)
Filter out lines from input2 that don't occur input1 (hardish)
Then stick on the data from input1 (easy)
I might be able to do this in R but the data is large and complex, and I was wondering if there was a way in bash or perl. Any tips in the right direction would be good.

This should work for you, assuming the Grp and Len values are in the same order in both files, as per my comment
Essentially it reads a line from the first file and then reads from the second file, forming the Grp_Len key from each record until it finds an entry that matches. Then it's just a matter of building the new output record
use strict;
use warnings;
open my $f1, '<', 'file1.txt';
print scalar <$f1>;
open my $f2, '<', 'file2.txt';
<$f2>;
while ( <$f1> ) {
my #f1 = split;
my #f2;
while () {
#f2 = split ' ', <$f2>;
last if join('_', #f2[0,1]) eq $f1[2];
}
print "#f2 $f1[3]\n";
}
output
Grp Len ident data
A 20 fz2525 3k3bj52
A 102 fz5367 3k32rf2
A 352 fz4678 3w3bj52
B 60 fz5732 3k3qwrg
B 42 fz3563 3kerj52
C 89 fz8744 3kftj55
C 445 fz2985 fy5763b
Update
Here's another version which is identical except that it builds a printf format string from the spacing of the column headers in the first file. That results in a much neater output
use strict;
use warnings;
open my $f1, '<', 'file1.txt';
my $head = <$f1>;
print $head;
my $format = create_format($head);
open my $f2, '<', 'file2.txt';
<$f2>;
while ( <$f1> ) {
my #f1 = split;
my #f2;
while () {
#f2 = split ' ', <$f2>;
last if join('_', #f2[0,1]) eq $f1[2];
}
printf $format, #f2, $f1[3];
}
sub create_format {
my ($head) = #_;
my ($format, $pos);
while ( $head =~ /\b\S/g ) {
$format .= sprintf("%%-%ds", $-[0] - $pos) if defined $pos;
$pos = $-[0];
}
$format . "%s\n";
}
output
Grp Len ident data
A 20 fz2525 3k3bj52
A 102 fz5367 3k32rf2
A 352 fz4678 3w3bj52
B 60 fz5732 3k3qwrg
B 42 fz3563 3kerj52
C 89 fz8744 3kftj55
C 445 fz2985 fy5763b

Related

Add one column from one file to the end of multiple files

I want to put one column from one file, the column 7, (i.e motherfile) to the end column of many files (i.e child1.c, chil2.c child3.c and so on)
motherfile
38 WAT1 1 TIP3 OH2 OT -0.834000 15.9994 0
39 WAT1 1 TIP3 H1 HT 0.417000 1.0080 0
40 WAT1 1 TIP3 H2 HT 0.417000 1.0080 0
41 WAT1 2 TIP3 OH2 OT -0.834000 15.9994 0
42 WAT1 2 TIP3 H1 HT 0.417000 1.0080 0
child1.c
O -5.689000 -0.628000 -10.423000
H -6.663000 -0.744000 -10.224000
H -5.166000 -1.340000 -9.957000
O 11.405000 3.612000 1.674000
H 11.331000 4.609000 1.663000
child2.c
O -4.689000 -0.628000 -10.423000
H -5.663000 -0.744000 -10.224000
H -6.166000 -1.340000 -9.957000
O 1.4405000 3.612000 1.674000
H 14.331000 4.609000 1.663000
and so on ...
I tried to use
awk '{f1 = $0; getline<"motherfile"; print f1, $7}' < child1.c > newchild1.c
but this only function to add a column to one file , and I want to put the column to many files.
Note the newchild.c need to be like this one.
O -5.689000 -0.628000 -10.423000 -0.834000
H -6.663000 -0.744000 -10.224000 0.417000
H -5.166000 -1.340000 -9.957000 0.417000
O 11.405000 3.612000 1.674000 -0.834000
H 11.331000 4.609000 1.663000 0.417000
In awk print statements can be redirected to a file using > or >>. The following example will read column 7 of the motherfile into memory, and write to a new file, pretended with the string new, including the saved column.
awk 'NR==FNR{a[FNR]=$7;next}{print$0,a[FNR]>"new"FILENAME}' motherfile
child1.c child2.c ...

advice to make my below Pig code simple

Here is my code and I do two group all operations and my code works. My purpose is to generate all student unique user count with their total scores, student located in CA unique user count. Wondering if good advice to make my code simple to use only one group operation, or any constructive ideas to make code simple, for example using only one FOREACH operation? Thanks.
student_all = group student all;
student_all_summary = FOREACH student_all GENERATE COUNT_STAR(student) as uu_count, SUM(student.mathScore) as count1,SUM(student.verbScore) as count2;
student_CA = filter student by LID==1;
student_CA_all = group student_CA all;
student_CA_all_summary = FOREACH student_CA_all GENERATE COUNT_STAR(student_CA);
Sample input (student ID, location ID, mathScore, verbScore),
1 1 10 20
2 1 20 30
3 1 30 40
4 2 30 50
5 2 30 50
6 3 30 50
Sample output (unique user, unique user in CA, sum of mathScore of all students, sum of verb Score of all students),
7 3 150 240
thanks in advance,
Lin
You might be looking for this.
data = load '/tmp/temp.csv' USING PigStorage(' ') as (sid:int,lid:int, ms:int, vs:int);
gdata = group data all;
result = foreach gdata {
student_CA = filter data by lid == 1;
student_CA_sum = SUM( student_CA.sid ) ;
student_CA_count = COUNT( student_CA.sid ) ;
mathScore = SUM(data.ms);
verbScore = SUM(data.vs);
GENERATE student_CA_sum as student_CA_sum, student_CA_count as student_CA_count, mathScore as mathScore, verbScore as verbScore;
};
Output is:
grunt> dump result
(6,3,150,240)
grunt> describe result
result: {student_CA_sum: long,student_CA_count: long,mathScore: long,verbScore: long}
first load the file(student)in hadoop file system. The perform the below action.
split student into student_CA if locationId == 1, student_Other if locationId != 1;
student_CA_all = group student_CA all;
student_CA_all_summary = FOREACH student_CA_all GENERATE COUNT_STAR(student_CA) as uu_count,COUNT_STAR(student_CA)as locationCACount, SUM(student_CA.mathScore) as mScoreCount,SUM(student_CA.verbScore) as vScoreCount;
student_Other_all = group student_Other all;
student_Other_all_summary = FOREACH student_Other_all GENERATE COUNT_STAR(student_Other) as uu_count,0 as locationOtherCount:long, SUM(student_Other.mathScore) as mScoreCount,SUM(student_Other.verbScore) as vScoreCount;
student_CAandOther_all_summary = UNION student_CA_all_summary, student_Other_all_summary;
student_summary_all = group student_CAandOther_all_summary all;
student_summary = foreach student_summary_all generate SUM(student_CAandOther_all_summary.uu_count) as studentIdCount, SUM(student_CAandOther_all_summary.locationCACount) as locationCount, SUM(student_CAandOther_all_summary.mScoreCount) as mathScoreCount , SUM(student_CAandOther_all_summary.vScoreCount) as verbScoreCount;
output:
dump student_summary;
(6,3,150,240)
Hope this helps :)
While solving your problem, I also encountered an issue with PIG. I assume it is because of improper exception handling done in UNION command. Actually, it can hang you command line prompt, if you execute that command, without proper error message. If you want I can share you the snippet for that.
The answer accepted has an logical error.
Try to have the below input file
1 1 10 20
2 1 20 30
3 1 30 40
4 2 30 50
5 2 30 50
6 3 30 50
7 1 10 10
The output will be
(13,4,160,250)
The output should be
(7,4.170,260)
I have modified the script to work correct.
data = load '/tmp/temp.csv' USING PigStorage(' ') as (sid:int,lid:int, ms:int, vs:int);
gdata = group data all;
result = foreach gdata {
student_CA_sum = COUNT( data.sid ) ;
student_CA = filter data by lid == 1;
student_CA_count = COUNT( student_CA.sid ) ;
mathScore = SUM(data.ms);
verbScore = SUM(data.vs);
GENERATE student_CA_sum as student_CA_sum, student_CA_count as student_CA_count, mathScore as mathScore, verbScore as verbScore;
};
Output
(7,4,160,250)

Read and write tab-delimited text data

I have an excel output in the tab-delimited format:
temperature H2O CO2 N2 NH3
10 2.71539E+12 44374931376 7410673406 2570.560804
20 2.34216E+12 38494172272 6429230649 3148.699673
30 2.04242E+12 33759520581 5639029060 3856.866413
40 1.75491E+12 29172949817 4882467457 4724.305292
.
.
.
I need to convert these numbers to FORMAT(1X,F7.0,2X,1P4E11.3) readable for another code.
This is what I've come up with:
program fixformat
real temp, neuts(4)
integer i,j
character header
open(11,file='./unformatted.txt',status='old')
open(12,file='./formatted.txt',status='unknown')
read(11,*) header
write(12,*) header
do i = 1, 200
read(11,*) temp, (neuts(j),j=1,4)
write(12,23) temp, (neuts(j),j=1,4)
end do
23 FORMAT(1X,F7.0,2X,1P4E11.3)
close(11)
close(12)
return
end
I keep getting this error:
Fortran runtime error: Bad real number in item 1 of list input
Is there any other way to convert the data to that format?
You need a character string, not a single character for the header
character(80) header
other than that you program works for me. Make sure you have the right number of lines in your loop
Do i=1,200
Adjust 200 to the real number of your data lines.
If for some reason you still cannot read even a single line, you can also use the format:
read(11,'(f2.0,4(1x,f11.0))') temp, (neuts(j),j=1,4)
because the tab is just a character you can easily skip.
Notes:
Unformatted and formatted means something completely different in Fortran. Unformatted is what you may know as "binary".
Use some indentation and blank lines for your programs to make them readable.
There is no reason to explicitly use status=unknown. Just don't put anything there. In your case status=replace may be more appropriate.
The FORMAT statement is quite obsolete, in modern Fortran we use format strings:
write(12,'(1X,F7.0,2X,1P4E11.3)') temp, (neuts(j),j=1,4)
There is absolutely no reason for your return before the end. Returns is for early return from a procedure. Some put stop before the end program, but it is superfluous.
To read tab delimited data, I'd use a simple algorithm like the one below. NOTE: This is assuming that there is no tab character in any of your fields.
integer :: error_code, delim_index, line_index
character*500 :: data_line, field_data_string
double precision :: dp_value
Open(Unit=1001,File="C:\\MY\\PATH\\Data.txt")
DO
Read(UNIT=1001,End=106, FMT='(A)' ) data_line
line_length = LEN(TRIM(data_line))
delim_index = SCAN(data_line, achar(9) )
line_index = 0
DO WHILE ( delim_index .NE. 0 )
line_index = line_index + delim_index
IF (delim_index .EQ. 1 ) THEN ! found a NULL (no value), so skip
GOTO 101
END IF
field_data_string = data_line( (line_index-delim_index+1) : line_index )
READ( field_data_string, FMT=*, ERR=100) dp_value
PRINT *, "Is a double precision ", dp_value
GOTO 101
100 Continue
PRINT *, "Not a double precision"
101 Continue
IF ( (line_index+1) .GT. line_length ) THEN
GOTO 104 ! found end of line prematurely
END IF
delim_index = SCAN( data_line( line_index + 1 : ), achar(9) )
END DO
field_data_string = data_line( line_index + 1 : )
READ( field_data_string, FMT=*, ERR=102) dp_value
PRINT *, "Is a double precision ", dp_value
GOTO 103
102 Continue
PRINT *, "Not a double precision"
103 Continue
PRINT *, "Is a double precision ", dp_value
104 Continue
END DO
104 Continue
PRINT *, "Error opening file"
105 Continue
Close(1001)

Show duplicates in internal table

Each an every item should have an uniquie SecondNo + Drawing combination. Due to misentries, some combinations are there two times.
I need to create a report with ABAP which identifies those combinations and does not reflect the others.
Item: SecNo: Drawing:
121 904 5000 double
122 904 5000 double
123 816 5100
124 813 5200
125 812 4900 double
126 812 4900 double
127 814 5300
How can I solve this? I tried 2 approaches and failed:
Sorting the data and tried to print out each one when the value of the upper row is equal to the next value
counting the duplicates and showing all of them which are more then one.
Where do I put in the condition? in the loop area?
I tried this:
REPORT duplicates.
DATA: BEGIN OF lt_duplicates OCCURS 0,
f2(10),
f3(10),
END OF lt_duplicates,
it_test TYPE TABLE OF ztest WITH HEADER LINE,
i TYPE i.
SELECT DISTINCT f2 f3 FROM ztest INTO TABLE lt_duplicates.
LOOP AT lt_duplicates.
IF f2 = lt_duplicates-f2 AND f3 = lt_duplicates-f3.
ENDIF.
i = LINES( it_test ).
IF i > 1.
LOOP AT it_test.
WRITE :/ it_test-f1,it_test-f2,it_test-f3.
ENDLOOP.
ENDIF.
ENDLOOP.
From ABAP 7.40, you may use the GROUP BY constructs with the GROUP SIZE words so that to take into account only the groups with at least 2 elements.
ABAP statement LOOP AT ... GROUP BY ( <columns...> gs = GROUP SIZE ) ...
Loop at grouped lines:
Either LOOP AT GROUP ...
Or ... FOR ... IN GROUP ...
ABAP expression ... VALUE|REDUCE|NEW type|#( FOR GROUPS ... GROUP BY ( <columns...> gs = GROUP SIZE ) ...
Loop at grouped lines: ... FOR ... IN GROUP ...
For both constructs, it's possible to loop at the grouped lines in two ways:
* LOOP AT GROUP ...
* ... FOR ... IN GROUP ...
Line# Item SecNo Drawing
1 121 904 5000 double
2 122 904 5000 double
3 123 816 5100
4 124 813 5200
5 125 812 4900 double
6 126 812 4900 double
7 127 814 5300
You might want to produce the following table containing the duplicates:
SecNo Drawing Lines
904 5000 [1,2]
812 4900 [5,6]
Solution with LOOP AT ... GROUP BY ...:
TYPES: BEGIN OF t_line,
item TYPE i,
secno TYPE i,
drawing TYPE i,
END OF t_line,
BEGIN OF t_duplicate,
secno TYPE i,
drawing TYPE i,
num_dup TYPE i, " number of duplicates
lines TYPE STANDARD TABLE OF REF TO t_line WITH EMPTY KEY,
END OF t_duplicate,
t_lines TYPE STANDARD TABLE OF t_line WITH EMPTY KEY,
t_duplicates TYPE STANDARD TABLE OF t_duplicate WITH EMPTY KEY.
DATA(table) = VALUE t_lines(
( item = 121 secno = 904 drawing = 5000 )
( item = 122 secno = 904 drawing = 5000 )
( item = 123 secno = 816 drawing = 5100 )
( item = 124 secno = 813 drawing = 5200 )
( item = 125 secno = 812 drawing = 4900 )
( item = 126 secno = 812 drawing = 4900 )
( item = 127 secno = 814 drawing = 5300 ) ).
DATA(expected_duplicates) = VALUE t_duplicates(
( secno = 904 drawing = 5000 num_dup = 2 lines = VALUE #( ( REF #( table[ 1 ] ) ) ( REF #( table[ 2 ] ) ) ) )
( secno = 812 drawing = 4900 num_dup = 2 lines = VALUE #( ( REF #( table[ 5 ] ) ) ( REF #( table[ 6 ] ) ) ) ) ).
DATA(actual_duplicates) = VALUE t_duplicates( ).
LOOP AT table
ASSIGNING FIELD-SYMBOL(<line>)
GROUP BY
( secno = <line>-secno
drawing = <line>-drawing
gs = GROUP SIZE )
ASSIGNING FIELD-SYMBOL(<group_table>).
IF <group_table>-gs >= 2.
actual_duplicates = VALUE #( BASE actual_duplicates
( secno = <group_table>-secno
drawing = <group_table>-drawing
num_dup = <group_table>-gs
lines = VALUE #( FOR <line2> IN GROUP <group_table> ( REF #( <line2> ) ) ) ) ).
ENDIF.
ENDLOOP.
WRITE : / 'List of duplicates:'.
SKIP 1.
WRITE : / 'Secno Drawing List of concerned items'.
WRITE : / '---------- ---------- ---------------------------------- ...'.
LOOP AT actual_duplicates ASSIGNING FIELD-SYMBOL(<duplicate>).
WRITE : / <duplicate>-secno, <duplicate>-drawing NO-GROUPING.
LOOP AT <duplicate>-lines INTO DATA(line).
WRITE line->*-item.
ENDLOOP.
ENDLOOP.
ASSERT actual_duplicates = expected_duplicates. " short dump if not equal
Output:
List of duplicates:
Secno Drawing List of concerned items
---------- ---------- ---------------------------------- ...
904 5000 121 122
812 4900 125 126
Solution with ... VALUE type|#( FOR GROUPS ... GROUP BY ...:
DATA(actual_duplicates) = VALUE t_duplicates(
FOR GROUPS <group_table> OF <line> IN table
GROUP BY
( secno = <line>-secno
drawing = <line>-drawing
gs = GROUP SIZE )
( secno = <group_table>-secno
drawing = <group_table>-drawing
num_dup = <group_table>-gs
lines = VALUE #( FOR <line2> IN GROUP <group_table> ( REF #( <line2> ) ) ) ) ).
DELETE actual_duplicates WHERE num_dup = 1.
Note: for deleting non-duplicates, instead of using an additional DELETE statement, it can be done inside the VALUE construct by adding a LINES OF COND construct which adds 1 line if group size >= 2, or none otherwise (if group size = 1):
...
gs = GROUP SIZE )
( LINES OF COND #( WHEN <group_table>-gs >= 2 THEN VALUE #( "<== new line
( secno = <group_table>-secno
...
... REF #( <line2> ) ) ) ) ) ) ) ). "<== 3 extra right parentheses
You can use AT...ENDAT for this, provided that you arrange the fields correctly:
TYPES: BEGIN OF t_my_line,
secno TYPE foo,
drawing TYPE bar,
item TYPE baz, " this field has to appear AFTER the other ones in the table
END OF t_my_line.
DATA: lt_my_table TYPE TABLE OF t_my_line,
lt_duplicates TYPE TABLE OF t_my_line.
FIELD-SYMBOLS: <ls_line> TYPE t_my_line.
START-OF-WHATEVER.
* ... fill the table ...
SORT lt_my_table BY secno drawing.
LOOP AT lt_my_table ASSIGNING <ls_line>.
AT NEW drawing. " whenever drawing or any field left of it changes...
FREE lt_duplicates.
ENDAT.
APPEND <ls_line> TO lt_duplicates.
AT END OF drawing.
IF lines( lt_duplicates ) > 1.
* congrats, here are your duplicates...
ENDIF.
ENDAT.
ENDLOOP.
I needed simply to report duplicate lines in error based on two fields so used the following.
LOOP AT gt_data INTO DATA(gs_data)
GROUP BY ( columnA = gs_data-columnA columnB = gs_data-columnB
size = GROUP SIZE index = GROUP INDEX ) ASCENDING
REFERENCE INTO DATA(group_ref).
IF group_ref->size > 1.
PERFORM insert_error USING group_ref->columnA group_ref->columnB.
ENDIF.
ENDLOOP.
Here is my 2p worth, you could cut some out of this depending on what you want to do, and you should consider the amount of data being processed too. This method is only really for smaller sets.
Personally I like to prevent erroneous records at the source. Catching an error during input. But if you do end up in a pickle there is definitely more than one way to solve the issue.
TYPES: BEGIN OF ty_itab,
item TYPE i,
secno TYPE i,
drawing TYPE i,
END OF ty_itab.
TYPES: itab_tt TYPE STANDARD TABLE OF ty_itab.
DATA: lt_itab TYPE itab_tt,
lt_itab2 TYPE itab_tt,
lt_itab3 TYPE itab_tt.
lt_itab = VALUE #(
( item = '121' secno = '904' drawing = '5000' )
( item = '122' secno = '904' drawing = '5000' )
( item = '123' secno = '816' drawing = '5100' )
( item = '124' secno = '813' drawing = '5200' )
( item = '125' secno = '812' drawing = '4900' )
( item = '126' secno = '812' drawing = '4900' )
( item = '127' secno = '814' drawing = '5300' )
).
APPEND LINES OF lt_itab TO lt_itab2.
APPEND LINES OF lt_itab TO lt_itab3.
SORT lt_itab2 BY secno drawing.
DELETE ADJACENT DUPLICATES FROM lt_itab2 COMPARING secno drawing.
* Loop at what is hopefully the smaller itab.
LOOP AT lt_itab2 ASSIGNING FIELD-SYMBOL(<line>).
DELETE TABLE lt_itab3 FROM <line>.
ENDLOOP.
* itab1 has all originals.
* itab2 has the unique.
* itab3 has the duplicates.

Awk Calc Avg Rows Below Certain Line

I'm having trouble calculating an average of specific numbers in column BELOW a specific text identifier using awk. I have two columns of data and I'm trying to start the average keying on a common identifier that repeats, which is 01/1991. So, awk should calc the average of all lines beginning with 01/1991, which repeats, using the next 21 lines with total count of rows for average = 22 for the total number of years 1991-2012. The desired output is an average of each TextID/Name entry for all the January's (01) for each year 1991 - 2012 show below:
TextID/Name 1
Avg: 50.34
TextID/Name 2
Avg: 45.67
TextID/Name 3
Avg: 39.97
...
sample data:
TextID/Name 1
01/1991, 57.67
01/1992, 56.43
01/1993, 49.41
..
01/2012, 39.88
TextID/Name 2
01/1991, 45.66
01/1992, 34.77
01/1993, 56.21
..
01/2012, 42.11
TextID/Name 3
01/1991, 32.22
01/1992, 23.71
01/1993, 29.55
..
01/2012, 35.10
continues with the same data for TextID/Name 4
I'm getting an answer using this code shown below but the average is starting to calculate BEFORE the specific identifier line and not on and below that line (01/1991).
awk '$1="01/1991" {sum+=$2} (NR%22==0){avg=sum/22;print"Average: "avg;sum=0;next}' myfile
Thanks and explanations of the solution is greatly appreciated! I have edited the original answer with more description - thank you again.
If you look at your file, the first field is "01/1991," with a comma at the end, not "01/1991". Also, NR%22==0 will look at line numbers divisible by 22, not 22 lines after the point it thinks you care about.
You can do something like this instead:
awk '
BEGIN { l=-1; }
$1 == "01/1991," {
l=22;
s=0;
}
l > 0 { s+=$2; l--; }
l == 0 { print s/22; l--; }'
It has a counter l that it sets to the number of lines to count, then it sums up that number of lines.
You may want to consider simply summing all lines from one 01/1991 to the next though, which might be more robust.
If you're allowed to use Perl instead of Awk, you could do:
#!/usr/bin/env perl
$start = 0;
$have_started = 0;
$count = 0;
$sum = 0;
while (<>) {
$line = $_;
# Grab the value after the date and comma
if ($line = /\d+\/\d+,\s+([\d\.]+)/) {
$val = $+;
}
# Start summing values after 01/1991
if (/01\/1991,\s+([\d\.]+)/) {
$have_started = 1;
$val = $+;
}
# If we have started counting,
if ($have_started) {
$count++;
$sum += $+;
}
}
print "Average of all values = " . $sum/$count;
Run it like so:
$ cat your-text-file.txt | above-perl-script.pl

Resources