File processing: Combining multiple files with different number of columns and rows - bash

I have multiple tab delimieted files where only the two first columns are in common. I'm trying to combine them in one tab delimited file .
Example: let's say we have 3 files (file1, file2, file3) that we want to combine into file4.
(row and column names are just for demonstration purposes and are not included in any of the files)
Input files =>
File1: 2 rows(r1,r2), 3 columns(c1,c2,c3)
c1 c2 c3
r1 a b c
r2 d e f
File2: 3 rows(r3,r4,r5), 3 columns(c1,c2,c4)
c1 c2 c4
r3 1 2 3
r4 4 5 6
r5 7 8 9
File3: 1 row(r6), 4 columns(c1, c2, c5, c6)
c1 c2 c5 c6
r6 w x y z
Output file =>
for all 3 files, the 2 first columns (c1, c2) have the same name
File4:
c1 c2 c3 c4 c5 c6
r1 a b c - - -
r2 d e f - - -
r3 1 2 - 3 - -
r4 4 5 - 6 - -
r5 7 8 - 9 - -
r6 w x - - y z
What I'm trying to do is: for each of the files add the needed empty columns so that all files have the same number of columns then reorder the columns with "awk" then use "cat" to stack them vertically. But I don't know if this is the best way or there is a more efficient way to do it.
Thanks,

The following essentially does the task. It essentially builds up a matrix entry which is indexed by the row and column names.
awk '(FNR==1) {
for(i=1;i<=NF;++i) {
if (!($i in columns)) { column_order[++cn] = $i; columns[$i] }
c[i+1]=$i
}
next
}
!($1 in rows) { row_order[++rn] = $1; rows[$1] }
{ for(i=2;i<=NF;++i) entry[$1,c[i]]=$i }
END {
s="";for(j=1;j<=cn;++j) s=s OFS column_order[j]; print s
for(i=1;i<=rn;++i) {
row_name=row_order[i]
s=row_name
for(j=1;j<=cn;++j) {
col_name = column_order[j]
s=s OFS ((row_name,col_name) in entry ? entry[row_name,col_name] : "-")
}
print s
}
}' file1 file2 file3 file4 ... filen

Related

How to create a pivot table from a CSV file having subgroups and getting the count of the last values using shell script?

I want to group the columns then form subsequent group getting the count of last column values.
For example main Group A, Subgroup D, J , P and count of P in the subsequent groups as well as the total count of last column.
I am able to form groups but subgroup seems a little hard. Any help is appreciated like how to get this.
Input:
A,D,J,P
A,D,J,Q
A,D,K,P
A,D,K,P
A,E,J,Q
A,E,K,Q
A,E,J,Q
B,F,L,R
B,F,L,R
B,F,M,S
C,H,N,T
C,H,O,U
C,H,N,T
C,H,O,U
Output:
A D J P 1
&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbspQ 1
&nbsp&nbsp&nbsp&nbsp&nbsp&nbspK P 2
A E J Q 2
&nbsp&nbsp&nbsp&nbsp&nbsp&nbspK Q 1
B F L R 2
&nbsp&nbsp&nbsp&nbsp&nbsp&nbspM S 1
C H N T 2
&nbsp&nbsp&nbsp&nbsp&nbsp&nbspO U 2
&nbsp&nbsp&nbsp&nbspTotal 14
Here's a different approach, a shell script that uses sqlite to calculate the group counts (Requires 3.25 or newer because it uses window functions):
#!/bin/sh
file="$1"
sqlite3 -batch -noheader <<EOF
CREATE TABLE data(c1 TEXT, c2 TEXT, c3 TEXT, c4 TEXT);
.mode csv
.import "$file" data
.mode list
.separator " "
SELECT (CASE c1 WHEN lag(c1, 1) OVER (PARTITION BY c1 ORDER BY c1) THEN ' ' ELSE c1 END)
, (CASE c2 WHEN lag(c2, 1) OVER (PARTITION BY c1,c2 ORDER BY c1,c2) THEN ' ' ELSE c2 END)
, (CASE c3 WHEN lag(c3, 1) OVER (PARTITION BY c1,c2,c3 ORDER BY c1,c2,c3) THEN ' ' ELSE c3 END)
, c4
, count(*)
FROM data
GROUP BY c1, c2, c3, c4
ORDER BY c1, c2, c3, c4;
SELECT 'Total ' || count(*) FROM data;
EOF
Running this gives:
$ ./group.sh example.csv
A D J P 1
Q 1
K P 2
E J Q 2
K Q 1
B F L R 2
M S 1
C H N T 2
O U 2
Total 14
Also a one-liner using datamash, though it doesn't include the fancy output format:
$ datamash -st, groupby 1,2,3,4 count 4 < example.csv | tr , ' '
A D J P 1
A D J Q 1
A D K P 2
A E J Q 2
A E K Q 1
B F L R 2
B F M S 1
C H N T 2
C H O U 2
Using Perl
Script
perl -0777 -lne '
s/^(.+?)$/$x++;$kv{$1}++/mge;
foreach my $k (sort keys %kv)
{ $q=$c=$k;
while(length($p) > 0)
{
last if $c=~/^$p/g;
$q=substr($c,length($p)-1);
$p=~s/(.$)//;
}
printf( "%9s\n", "$q $kv{$k}") ;
$p=$k;
}
print "Total $x";
' anurag.txt
Output:
A,D,J,P 1
Q 1
K,P 2
E,J,Q 2
K,Q 1
B,F,L,R 2
M,S 1
C,H,N,T 2
O,U 2
Total 14
$ cat tst.awk
BEGIN { FS="," }
!($0 in cnt) { recs[++numRecs] = $0 }
{ cnt[$0]++ }
END {
for (recNr=1; recNr<=numRecs; recNr++) {
rec = recs[recNr]
split(rec,f)
newVal = 0
for (i=1; i<=NF; i++) {
if (f[i] != p[i]) {
newVal = 1
}
printf "%s%s", (newVal ? f[i] : " "), OFS
p[i] = f[i]
}
print cnt[rec]
tot += cnt[rec]
}
print "Total", tot+0
}
$ awk -f tst.awk file
A D J P 1
Q 1
K P 2
E J Q 2
K Q 1
B F L R 2
M S 1
C H N T 2
O U 2
Total 14
I'll propose a multi stage solution in the spirit of unix toolset.
create a sorted, counted, de-delimited data format
$ sort file | uniq -c | awk '{print $2,$1}' | tr ',' ' '
A D J P 1
A D J Q 1
A D K P 2
A E J Q 2
A E K Q 1
B F L R 2
B F M S 1
C H N T 2
C H O U 2
now, the task is removing the longest left common substring from consecutive lines
... | awk 'NR==1 {p=$0}
NR>1 {k=0;
while(p~t=substr($0,1,++k));
gsub(/./," ",t); sub(/^ /,"",t);
p=$0; $0=t substr(p,k)}1'
A D J P 1
Q 1
K P 2
E J Q 2
K Q 1
B F L R 2
M S 1
C H N T 2
O U 2
whether it's easier to understand than one script will be seen.
I have not exactly an answer that produces your example output but I'm close enough to dare posting an answer
Now I have an answer that produces exactly your example output... :-)
$ cat ABCD
A,D,J,P
A,D,J,Q
A,D,K,P
A,D,K,P
A,E,J,Q
A,E,K,Q
A,E,J,Q
B,F,L,R
B,F,L,R
B,F,M,S
C,H,N,T
C,H,O,U
C,H,N,T
C,H,O,U
$ awk '{a[$0]+=1}END{for(i in a) print i","a[i];print "Total",NR}' ABCD |\
sort | \
awk -F, '
/Total/{print;next}
{print a1==$1?" ":$1,a2==$2?" ":$2,a3==$3?" ":$3,a4==$4?" ":$4,$5
a1=$1;a2=$2;a3=$3;a4=$4}'
A D J P 1
Q 1
K P 2
E J Q 2
K 1
B F L R 2
M S 1
C H N T 2
O U 2
Total 14
$
The first awk script iterates on every line and at every line we increment the value of an array, a, element, indexed by the whole line value, next at the end (END target) we loop on the indices of a to print the index and the associated value, that is the count of the times we have that line in the data - eventually we output also the total number of lines processed, that is automatically updated in the variable NR, number of records.
The second awk script either prints the total line and skips any further processing or it compares each field (splitted on commas) with the corresponding field of the previous line and output the new field or a space accordingly.

Create expandable varlist

I tried to convert two data files into a matrix in Stata.
In the first data file there are only 10 columns, so I used:
mkmat d1 d2 d3 d4 d5 d6 d7 d8 d9 d10, matrix(dataname)
However, the second data file contains more than 100 columns.
Do I have to manually include in mkmat all variable names, or there is a better way to do this?
Consider the following toy example:
clear
set obs 5
forvalues i = 1 / 5 {
generate d`i' = rnormal()
}
list
+-----------------------------------------------------------+
| d1 d2 d3 d4 d5 |
|-----------------------------------------------------------|
1. | .2347558 .255076 -1.309553 1.202226 -1.188903 |
2. | .1994864 .5560354 -.7548561 1.353276 -1.836232 |
3. | 1.444645 -1.798258 1.189875 -.0599763 .4022007 |
4. | .2568011 -1.27296 .5404224 -.1167567 1.853389 |
5. | -.4792487 .175548 1.846101 .4198408 -1.182597 |
+-----------------------------------------------------------+
You could simply use wildcard characters:
mkmat d*, matrix(d)
or
mkmat d?, matrix(d)
Alternatively, the commands ds and unab can be used to create a local macro containing a list of qualifying variable names, which can then be used in mkmat:
ds d*
mkmat `r(varlist)', matrix(d1)
matrix list d1
d1[5,5]
d1 d2 d3 d4 d5
r1 .23475575 .25507599 -1.3095527 1.2022264 -1.1889035
r2 .19948645 .5560354 -.75485611 1.3532759 -1.8362321
r3 1.4446446 -1.7982582 1.1898755 -.0599763 .4022007
r4 .25680107 -1.2729601 .54042244 -.11675671 1.8533887
r5 -.47924873 .175548 1.846101 .41984081 -1.1825972
unab varlist : d*
mkmat `varlist', matrix(d2)
matrix list d2
d2[5,5]
d1 d2 d3 d4 d5
r1 .23475575 .25507599 -1.3095527 1.2022264 -1.1889035
r2 .19948645 .5560354 -.75485611 1.3532759 -1.8362321
r3 1.4446446 -1.7982582 1.1898755 -.0599763 .4022007
r4 .25680107 -1.2729601 .54042244 -.11675671 1.8533887
r5 -.47924873 .175548 1.846101 .41984081 -1.1825972
The advantage of ds is that it can be used to further filter results with its has() or not() options.
For example, if some of your variables are strings, mkmat will complain:
tostring d3 d5, force replace
mkmat d*, matrix(d)
string variables not allowed in varlist;
d3 is a string variable
However, the following will work fine:
ds d*, has(type numeric)
d1 d2 d4
mkmat `r(varlist)', matrix(d)
matrix list d
d[5,3]
d1 d2 d4
r1 -1.5934615 2.1092126 -.99447298
r2 -.51445526 -.62898564 .56975317
r3 -1.8468649 -.68184066 .26716048
r4 -.02007644 -.29140079 2.2511463
r5 -.62507766 .6255222 1.0599482
Type help ds or help unab from Stata's command prompt for full syntax details.

replacing specific value (from another file) using awk

I have a following file.
File1
a b 1
c d 2
e f 3
File2
x l
y m
z n
I want to replace 1 by x at a time and save in a file3. next time 1 to y and save in file4.
Then files look like
File3
a b x
c d 2
e f 3
File4
a b y
c d 2
e f 3
once I finished x, y, z then 2 by l, m and n.
I start with this but it inserts but does not replace.
awk -v r=1 -v c=3 -v val=x -F, '
BEGIN{OFS=" "}; NR != r; NR == r {$c = val; print}
' file1 >file3
Here's a gnu awk script ( because it uses multidimensional arrays, array ordering ) that will do what you want:
#!/usr/bin/awk -f
BEGIN { fcnt=3 }
FNR==NR { for(i=1;i<=NF;i++) f2[i][NR]=$i; next }
{
fout[FNR][1] = $0
ff = $NF
if(ff in f2) {
for( r in f2[ff]) {
$NF = f2[ff][r]
fout[FNR][fcnt++] = $0
}
}
}
END {
for(f=fcnt-1;f>=3;f--) {
for( row in fout ) {
if( fout[row][f] != "" ) out = fout[row][f]
else out = fout[row][1]
print out > "file" f
}
}
}
I made at least one major assumption about your input data:
The field number in file2 corresponds exactly to the value that needs to be replaced in file1. For example, x is field 1 in file2, and 1 is what needs replacing in the output files.
Here's the breakdown:
Set fcnt=3 in the BEGIN block.
FNR==NR - store the contents of File2 in the f2 array by (field number, line number).
Store the original f1 line in fout as (line number,1) - where 1 is a special, available array position ( because fcnt starts at 3 ).
Save off $NF as ff because it's going to be reset
Whenever ff is a field number in the first subscript of the f2 array, then reset $NF to the value from file2 and then assign the result to fout at (line number, file number) as $0 ( recomputed ).
In the END, loop over the fcnt in reverse order, and either set out to a replaced line value or an original line value in row order, then print out to the desired filename.
It could be run like gawk -f script.awk file2 file1 ( notice the file order ). I get the following output:
$ cat file[3-8]
a b x
c d 2
e f 3
a b y
c d 2
e f 3
a b z
c d 2
e f 3
a b 1
c d l
e f 3
a b 1
c d m
e f 3
a b 1
c d n
e f 3
This could be made more efficient for memory by only performing the lookup in the END block, but I wanted to take advantage of the $0 recompute instead of needing calls to split in the END.

Match closest value from two different files and print specific columns

Hi guys I have two files each of them with N columns and M rows.
File1
1 2 4 6 8
20 4 8 10 12
15 5 7 9 11
File2
1 a1 b1 c5 d1
2 a1 b2 c4 d2
3 a2 b3 c3 d3
19 a3 b4 c2 d4
14 a4 b5 c1 d5
And what I need is to search the closest value in the column 1, and print specific columns in the output. so for example the output should be:
File3
1 2 4 6 8
1 a1 b1 c5 d1
20 4 8 10 12
19 a3 b4 c2 d4
15 5 7 9 11
14 a4 b5 c1 d5
Since 1 = 1, 19 is the closest to 20 and 14 to 15, the output are those lines.
How can I do this in awk or any other tool?
Help!
This is what I have until now:
echo "ARGIND == 1 {
s1[\$1]=\$1;
s2[\$1]=\$2;
s3[\$1]=\$3;
s4[\$1]=\$4;
s5[\$1]=\$5;
}
ARGIND == 2 {
bestdiff=-1;
for (v in s1)
if (bestdiff < 0 || (v-\$1)**2 <= bestdiff)
{
s11=s1[v];
s12=s2[v];
s13=s3[v];
s14=s4[v];
s15=s5[v];
bestdiff=(v-\$1)**2;
if (bestdiff < 2){
print \$0
print s11,s12,s13,s14,s15}}">diff.awk
awk -f diff.awk file2 file1
output:
1 2 4 6 8
1 a1 b1 c5 d1
20 4 8 10 12
19 a3 b4 c2 d4
15 5 7 9 1
14 a4 b5 c1 d5
1 2
1 1
14 15
I have no idea why the last three lines.
What I ended with trying to give a way to answer:
function closest(b,i) { # define a function
distance=999999; # this should be higher than the max index to avoid returning null
for (x in b) { # loop over the array to get its keys
(x+0 > i+0) ? tmp = x - i : tmp = i - x # +0 to compare integers, ternary operator to reduce code, compute the diff between the key and the target
if (tmp < distance) { # if the distance if less than preceding, update
distance = tmp
found = x # and save the key actually found closest
}
}
return found # return the closest key
}
{ # parse the files for each line (no condition)
if (NR>FNR) { # If we changed file (File Number Record is less than Number Record) change array
b[$1]=$0 # make an array with $1 as key
} else {
akeys[max++] = $1 # store the array keys to ensure order at end as for (x in array) does not guarantee the order
a[$1]=$0 # make an array with $1 as key
}
}
END { # Now we ended parsing the two files, print the result
for (i in akeys) { # loop over the first file keys
print a[akeys[i]] # print the value for this file
if (akeys[i] in b) { # if the same key exist in second file
print b[akeys[i]] # then print it
} else {
bindex = closest(b,akeys[i]) # call the function to find the closest key from second file
print b[bindex] # print what we found
}
}
}
I hope this is enough commented to be clear, feel free to comment if needed.
Warning This may become really slow if you have a large number of line in the second file as the second array will be parsed for each key of first file which is not present in second file./Warning
Given your sample inputs a1 and a2:
$ mawk -f closest.awk a1 a2
1 2 4 6 8
1 a1 b1 c5 d1
20 4 8 10 12
19 a3 b4 c2 d4
15 5 7 9 11
14 a4 b5 c1 d5

Conditional Filter in GROUP BY in Pig

I have the following dataset in which I need to merge multiple rows into one if they have the same key. At the same time, I need to pick among the multiple tuples which gets grouped.
1 N1 1 10
1 N1 2 15
2 N1 1 10
3 N1 1 10
3 N1 2 15
4 N2 1 10
5 N3 1 10
5 N3 2 20
For example
A = LOAD 'data.txt' AS (f1:int, f2:chararray, f3:int, f4:int);
G = GROUP A BY (f1, f2);
DUMP G;
((1,N1),{(1,N1,1,10),(1,N1,2,15)})
((2,N1),{(2,N1,1,10)})
((3,N1),{(3,N1,1,10),(3,N1,2,15)})
((4,N2),{(4,N2,1,10)})
((5,N3),{(5,N3,1,10),(5,N3,2,20)})
Now, I want to pick if there are multiple tuples in collected bag, I want to filter only those which have f3==2. Here is the final data which I want:
((1,N1),{(1,N1,2,15)}) -- f3==2, f3==1 is removed from this set
((2,N1),{(2,N1,1,10)})
((3,N1),{(3,N1,2,15)}) -- f3==2, f3==1 is removed from this bag
((4,N2),{(4,N2,1,10)})
((5,N3),{(5,N3,2,10)})
Any idea how to achieve this?
I did with my way as specified in the comment above. Here is how I did it.
A = LOAD 'group.txt' USING PigStorage(',') AS (f1:int, f2:chararray, f3:int, f4:int);
G = GROUP A BY (f1, f2);
CNT = FOREACH G GENERATE group, COUNT($1) AS cnt, $1;
SPLIT CNT INTO
CNT1 IF (cnt > 1),
CNT2 IF (cnt == 1);
M1 = FOREACH CNT1 {
row = FILTER $2 BY (f3 == 2);
GENERATE FLATTEN(row);
};
M2 = FOREACH CNT2 GENERATE FLATTEN($2);
O = UNION M1, M2;
DUMP O;
(2,N1,1,10)
(4,N2,1,10)
(1,N1,2,15)
(3,N1,2,15)
(5,N3,2,20)

Resources